Set encoding and fileencoding to utf-8 in Vim
TL;DR
In the first case with
set encoding=utf-8
, you'll change the output encoding that is shown in the terminal.In the second case with
set fileencoding=utf-8
, you'll change the output encoding of the file that is written.
As stated by @Dennis, you can set them both in your ~/.vimrc if you always want to work in utf-8
.
More details
From the wiki of VIM about working with unicode
"encoding
sets how vim shall represent characters internally. Utf-8 is necessary for most flavors of Unicode."
"fileencoding
sets the encoding for a particular file (local to buffer); :setglobal sets the default value. An empty value can also be used: it defaults to same as 'encoding'. Or you may want to set one of the ucs encodings, It might make the same disk file bigger or smaller depending on your particular mix of characters. Also, IIUC, utf-8 is always big-endian (high bit first) while ucs can be big-endian or little-endian, so if you use it, you will probably need to set 'bomb" (see below)."
Save all files in Visual Studio project as UTF-8
Since you're already in Visual Studio, why not just simply write the code?
foreach (var f in new DirectoryInfo(@"...").GetFiles("*.cs", SearchOption.AllDirectories)) {
string s = File.ReadAllText(f.FullName);
File.WriteAllText (f.FullName, s, Encoding.UTF8);
}
Only three lines of code! I'm sure you can write this in less than a minute :-)
Changing encoding and charset to UTF-8
is UTF-8 backwards compatible with ISO-8859-1?
Unicode is a superset of the code points contained in ISO-8859-1 so all the "characters" can be represented in UTF-8 but how they map to byte values is different. There is overlap between the encoded values but it is not 100%.
In terms of serving content or processing forms submissions you are unlikely to have many issues.
It may mean a breaking change for URL handling. For example, for a parameter value naïve
there would be two incompatible forms:
http://example.com/foo?p=na%EFve
http://example.com/foo?p=na%C3%AFve
This is only likely to be an issue if there are external applications relying on the old form.
PhpStorm: Converting folders encoding to another
AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):
- Select desired files in Project View panel
- Use
File | File Encoding
- When asked -- make sure you choose "convert" and not just "read in another encoding".
You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).
Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.
How can I avoid putting the magic encoding comment on top of every UTF-8 file in Ruby 1.9?
Explicit is better than implicit. Writing out the name of the encoding is good for your text editor, your interpreter, and anyone else who wants to look at the file. Different platforms have different defaults -- UTF-8, Windows-1252, Windows-1251, etc. -- and you will either hamper portability or platform integration if you automatically pick one over the other. Requiring more explicit encodings is a Good Thing.
It might be a good idea to integrate your Rails app with GetText. Then all of your UTF-8 strings will be isolated to a small number of translation files, and your Ruby modules will be clean ASCII.
Unicode (UTF-8) reading and writing to files in Python
In the notation u'Capit\xe1n\n'
(should be just 'Capit\xe1n\n'
in 3.x, and must be in 3.0 and 3.1), the \xe1
represents just one character. \x
is an escape sequence, indicating that e1
is in hexadecimal.
Writing Capit\xc3\xa1n
into the file in a text editor means that it actually contains \xc3\xa1
. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á
in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape
codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str
that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1
in the original string. To get a unicode
result, decode again with UTF-8.
In 3.x, the string_escape
codec is replaced with unicode_escape
, and it is strictly enforced that we can only encode
from a str
to bytes
, and decode
from bytes
to str
. unicode_escape
needs to start with a bytes
in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3
and \xa1
as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Related Topics
"Whenever" Gem Running Cron Jobs on Heroku
In Ruby on Rails, After Send_File Method Delete the File from Server
Do Ruby 'Require' Statements Go Inside or Outside the Class Definition
How to Dynamically Alter Inheritance in Ruby
Has_Many :Through with Counter_Cache
Tell Ruby Program to Wait Some Amount of Time
How to Check If a Given Directory Exists in Ruby
Rails Activesupport Time Parsing
What Does 'Def Self.Function' Name Mean
How to Change All the Keys of a Hash by a New Set of Given Keys
Rails - Sort by Join Table Data
Is It Possible for Rspec to Expect Change in Two Tables