How can I find encoding of a file via a script on Linux?
It sounds like you're looking for enca
. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i
(Linux) or file -I
(OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
How can I be sure of the file encoding?
Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
That being said, file
typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.
Now to your question:
Run this command:
tr -d \\000-\\177 < your-file | wc -c
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.
Where is the character encoding of a text file stored in Linux?
It isn't, at least not by default. There's actually no difference between the way those two files containing abcd
are stored in the filesystem, since the text string abcd
is encoded identically in the ASCII subset of both locales.
Ext filesystems do not log file encoding metadata. While it is possible to record a limited amount of data (on the order of a few kilobytes) along with a file on an ext filesystem by using extended attributes, gedit apparently does not use this to store character encoding, and instead caches a specific user's selected encoding for specific files. You can demonstrate this by logging in as another user (I logged in as root for this experiment) and opening the same file -- gedit will read it using the default system locale, not the custom locale that you saved it in under the other login.
How to determine encoding table of a text file
If you're on Linux, try file -i filename.txt
.
$ file -i vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
For reference, here is my environment:
$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic
Some file
versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:
$ file -I vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
Also, have a look here.
How to find file encoding type or convert any encoding type to UTF-8 in shell?
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.
Related Topics
Splitting Bulk Text File Every N Line
How to Trace a System Call in Linux
How to List All Binary File Extensions Within a Directory Tree
How to Compile Intel MAC Binaries on Linux
Bash: Return a String from Bash Function
In a Sigill Handler, How to Skip The Offending Instruction
How to Find The Reason for a Dead Process Without Log File on Unix
How to Monitor Cwnd and Ssthresh Values for a Tcp Connection
Configure "-Prefix" Option for Cross Compiling
Pcre Issue When Setting Up Wsgi Application
Killing Process in Shell Script
How to Check If Jboss Is Running on Unix Server
Check for Iommu Support on Linux
How Does Boost Asio's Hostname Resolution Work on Linux? How to Use Nss
Linux: How to Enable Execute in Place (Xip) for Ramfs/Tmpfs