Java Filereader Encoding Issue

Java FileReader encoding issue

Yes, you need to specify the encoding of the file you want to read.

Yes, this means that you have to know the encoding of the file you want to read.

No, there is no general way to guess the encoding of any given "plain text" file.

The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.

Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).

In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

Read file utf-8

new FileReader(fileName)

As indicated in the documentation:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

So, if your file is encoded using UTF-8, and your default encoding is not UTF-8, that won't work. The documentation explains what must be done in this case:

new InputStreamReader(new FileInputStream(fileName), "UTF-8")

Found reliance on default encoding in FileReader

use an explicit character encoding when opening a file instead of relying on the platform default (which can change depending on the platform), unless of course, you intend to use the platform default. you can use InputStreamReader to convert a FileInputStream to a Reader using an explicit character encoding.

Wrong output when attempting to read a text file

Your file starts with a byte-order mark (U+FEFF). It should only occur in the first character of the file - it's not terribly widely used, but various Windows tools do include it, including Notepad. You can just strip it from the start of the first line.

As an aside, I'd strongly recommend not using FileReader - it doesn't allow you to specify the encoding. I'd use Files.newBufferedReader, and either specify the encoding or let it default to UTF-8 (rather than the system default encoding which FileReader uses). When you're using BufferedReader, you can then just read a line at a time with readLine() too:

 String line;
while ((line = reader.readLine()) != null) {
System.out.println(line.replace("\uFEFF", ""));
}

If you really want to read a character at a time, it's worth getting in the habit of using a StringBuilder instead of repeated string concatenation in a loop. Also note that your variable name of ascii is misleading: it's actually the UTF-16 code unit, which may or may not be an ASCII character.

The encoding you specify should match the encoding used to write the file - at that point you should see the correct output instead of an extra character between each "real" character when using Unicode and Unicode big endian.

How to read a file in Java with specific character encoding?

So, first, as a heads up, do realize that fileName.getBytes() as you have there gets the bytes of the filename, not the file itself.

Second, reading inside the docs of FileReader:

The constructors of this class assume that the default character
encoding and the default byte-buffer size are appropriate. To specify
these values yourself, construct an InputStreamReader on a
FileInputStream.

So, sounds like FileReader actually isn't the way to go. If we take the advice in the docs, then you should just change your code to have:

String fileName = getFileNameToReadFromUserInput();
FileInputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply());
BufferedReader buffReader = new BufferedReader(isr);

and not try to make a FileReader at all.

Java text encoding

Your problem is probably that you're opening a reader using the platform encoding.

You should manually specify the encoding whenever you convert between bytes and characters. If you know that the appropriate encoding is UTF-8 you can open a file thus:

FileInputStream inputFile = new FileInputStream(myFile);
try {
FileReader reader = new FileReader(inputFile, "UTF-8");
// Maybe buffer reader and do something with it.
} finally {
inputFile.close();
}

Libraries like Guava can make this whole process easier..

Reading from a file Russian characters(javaSE)

You need to specify the encoding to be able to read the russian character. Don't use FileReader as it will use default platform encoding.

Instead use

new BufferedReader(new InputStreamReader(fileDir), "UTF8");


Related Topics



Leave a reply



Submit