Writing Unicode Text to a Text File

Writing Unicode text to a text file?

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

How to write unicode text to file in python 2 & 3 using same code?

You say:

The only way to write it to a file in python2 is:

fp = open("/tmp/test", "w")
txt2 = txt.encode('utf-8')
fp.write(txt2) # It works

But that's not true. There are many ways to do it that are better than this. The One Obvious Way To Do It is with io.open. In 3.x, this is the same function as the builtin open. In 2.6 and 2.7, it's effectively a backport of the 3.x builtin. This means you get 3.x-style Unicode text files in both versions:

fp = io.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works

If you need compatibility with 2.5 or earlier—or possibly 2.6 and 3.0 (they support io.open, but it's very slow in some cases), you can use the older way, codecs.open:

fp = codecs.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works

There are differences between the two under the covers, but most code you write isn't going to be interested in the underlying raw file or the encoder buffer or anything else besides the basic file-like object API, so you can also use try/except ImportError to fall back to codecs if io isn't available.

How to write Unicode string to text file in R Windows?

I think setting the Encoding of (a copy of) str to "unknown" before using cat() is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat().

Here is an expanded example to demonstrate what I think happens in the original example:

print_info <- function(x) {
print(x)
print(Encoding(x))
str(x)
print(charToRaw(x))
}

cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")

cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")

cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")

cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")

In a "Latin-1" locale (see ?l10n_info) as used by R on Windows, output files "yes-iconv", "latin" and "unknown" should be correct (byte sequence 0xe1, 0xbb, 0x8f which is "ỏ").

In a "UTF-8" locale, files "no-iconv" and "unknown" should be correct.

The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:

(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f

(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f

(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
chr "á»"
[1] e1 bb 8f

(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
chr "á»"
[1] e1 bb 8f

In the original example, iconv() uses the default from = "" argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat() when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

Write Unicode (UTF-8) text file

You shouldn't be using old Pascal I/O at all. That did its job back in the 80s but is very obsolete today.


This century, you can use the TStringList. This is very commonly used in Delphi. For instance, VCL controls use TStrings to access a memo's lines of text and a combo box's or list box's items.

var SL := TStringList.Create;
try
SL.Add('∫cos(x)dx = sin(x) + C');
SL.Add('¬(a ∧ b) ⇔ ¬a ∨ ¬b');
SL.SaveToFile(FileName, TEncoding.UTF8);
finally
SL.Free;
end;

Fore more advanced needs, you can use a TStreamWriter:

var SW := TStreamWriter.Create(FileName, False, TEncoding.UTF8);
try
SW.WriteLine('αβγδε');
SW.WriteLine('ωφψξη');
finally
SW.Free;
end;

And for very simple needs, there are the new TFile methods in IOUtils.pas:

var S := '⌬ is aromatic.';
TFile.WriteAllText(FileName, S, TEncoding.UTF8); // string (possibly with linebreaks)

var Lines: TArray<string>;
Lines := ['☃ is cold.', '☼ is hot.'];
TFile.WriteAllLines(FileName, Lines, TEncoding.UTF8); // string array

As you can see, all these modern options allow you to specify UTF8 as encoding. If you prefer to use some other encoding, like UTF16, that's fine too.


Just forget about AssignFile, Reset, Rewrite, Append, CloseFile etc.

python - How to write unicode characters to files correctly

Specify the encoding encoding='utf-8':

text = 'سلام عزیزم! عزیزم سلام!'
with open('temp.txt', 'w', encoding='utf-8') as out_file:
print(text)
out_file.write(text)
with open('temp.txt', 'r', encoding='utf-8') as in_file:
print(in_file.read())

How do I specify new lines in a string in order to write multiple lines to a file?

It depends on how correct you want to be. \n will usually do the job. If you really want to get it right, you look up the newline character in the os package. (It's actually called linesep.)

Note: when writing to files using the Python API, do not use the os.linesep. Just use \n; Python automatically translates that to the proper newline character for your platform.



Related Topics



Leave a reply



Submit