Add "# Coding: Utf-8" to All Files

Set encoding and fileencoding to utf-8 in Vim

TL;DR

In the first case with set encoding=utf-8, you'll change the output encoding that is shown in the terminal.

In the second case with set fileencoding=utf-8, you'll change the output encoding of the file that is written.

As stated by @Dennis, you can set them both in your ~/.vimrc if you always want to work in utf-8.

More details

From the wiki of VIM about working with unicode

"encoding sets how vim shall represent characters internally. Utf-8 is necessary for most flavors of Unicode."

"fileencoding sets the encoding for a particular file (local to buffer); :setglobal sets the default value. An empty value can also be used: it defaults to same as 'encoding'. Or you may want to set one of the ucs encodings, It might make the same disk file bigger or smaller depending on your particular mix of characters. Also, IIUC, utf-8 is always big-endian (high bit first) while ucs can be big-endian or little-endian, so if you use it, you will probably need to set 'bomb" (see below)."

Save all files in Visual Studio project as UTF-8

Since you're already in Visual Studio, why not just simply write the code?

foreach (var f in new DirectoryInfo(@"...").GetFiles("*.cs", SearchOption.AllDirectories)) {
string s = File.ReadAllText(f.FullName);
File.WriteAllText (f.FullName, s, Encoding.UTF8);
}

Only three lines of code! I'm sure you can write this in less than a minute :-)

Changing encoding and charset to UTF-8

is UTF-8 backwards compatible with ISO-8859-1?

Unicode is a superset of the code points contained in ISO-8859-1 so all the "characters" can be represented in UTF-8 but how they map to byte values is different. There is overlap between the encoded values but it is not 100%.

In terms of serving content or processing forms submissions you are unlikely to have many issues.

It may mean a breaking change for URL handling. For example, for a parameter value naïve there would be two incompatible forms:

  • http://example.com/foo?p=na%EFve
  • http://example.com/foo?p=na%C3%AFve

This is only likely to be an issue if there are external applications relying on the old form.

PhpStorm: Converting folders encoding to another

AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):

  1. Select desired files in Project View panel
  2. Use File | File Encoding
  3. When asked -- make sure you choose "convert" and not just "read in another encoding".

You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).


Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.

How can I avoid putting the magic encoding comment on top of every UTF-8 file in Ruby 1.9?

Explicit is better than implicit. Writing out the name of the encoding is good for your text editor, your interpreter, and anyone else who wants to look at the file. Different platforms have different defaults -- UTF-8, Windows-1252, Windows-1251, etc. -- and you will either hamper portability or platform integration if you automatically pick one over the other. Requiring more explicit encodings is a Good Thing.

It might be a good idea to integrate your Rails app with GetText. Then all of your UTF-8 strings will be isolated to a small number of translation files, and your Ruby modules will be clean ASCII.

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'


Related Topics



Leave a reply



Submit