How to Check Character Encoding of a File in Linux

How can I find encoding of a file via a script on Linux?

It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.

Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)

How can I be sure of the file encoding?

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.

That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.

Now to your question:

  1. Run this command:

    tr -d \\000-\\177 < your-file | wc -c

    If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.

  2. Run this command

    iconv -f utf-8 -t ucs-4 < your-file >/dev/null

    If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).

    If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

Where is the character encoding of a text file stored in Linux?

It isn't, at least not by default. There's actually no difference between the way those two files containing abcd are stored in the filesystem, since the text string abcd is encoded identically in the ASCII subset of both locales.

Ext filesystems do not log file encoding metadata. While it is possible to record a limited amount of data (on the order of a few kilobytes) along with a file on an ext filesystem by using extended attributes, gedit apparently does not use this to store character encoding, and instead caches a specific user's selected encoding for specific files. You can demonstrate this by logging in as another user (I logged in as root for this experiment) and opening the same file -- gedit will read it using the default system locale, not the custom locale that you saved it in under the other login.

How to determine encoding table of a text file

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

How to find file encoding type or convert any encoding type to UTF-8 in shell?

We do file encoding conversion with

vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename

It's working fine , no need to give source encoding.



Related Topics



Leave a reply



Submit