Writing Unicode text to a text file?
Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')
How to write unicode text to file in python 2 & 3 using same code?
You say:
The only way to write it to a file in python2 is:
fp = open("/tmp/test", "w")
txt2 = txt.encode('utf-8')
fp.write(txt2) # It works
But that's not true. There are many ways to do it that are better than this. The One Obvious Way To Do It is with io.open
. In 3.x, this is the same function as the builtin open
. In 2.6 and 2.7, it's effectively a backport of the 3.x builtin. This means you get 3.x-style Unicode text files in both versions:
fp = io.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works
If you need compatibility with 2.5 or earlier—or possibly 2.6 and 3.0 (they support io.open
, but it's very slow in some cases), you can use the older way, codecs.open
:
fp = codecs.open("/tmp/test", "w", encoding='utf-8')
fp.write(txt2) # It works
There are differences between the two under the covers, but most code you write isn't going to be interested in the underlying raw file or the encoder buffer or anything else besides the basic file-like object API, so you can also use try
/except ImportError
to fall back to codecs
if io
isn't available.
How to write Unicode string to text file in R Windows?
I think setting the Encoding of (a copy of) str
to "unknown"
before using cat()
is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat()
.
Here is an expanded example to demonstrate what I think happens in the original example:
print_info <- function(x) {
print(x)
print(Encoding(x))
str(x)
print(charToRaw(x))
}
cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")
cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")
cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")
cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")
In a "Latin-1"
locale (see ?l10n_info
) as used by R on Windows, output files "yes-iconv"
, "latin"
and "unknown"
should be correct (byte sequence 0xe1
, 0xbb
, 0x8f
which is "ỏ"
).
In a "UTF-8"
locale, files "no-iconv"
and "unknown"
should be correct.
The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:
(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f
(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f
(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
chr "á»"
[1] e1 bb 8f
(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
chr "á»"
[1] e1 bb 8f
In the original example, iconv()
uses the default from = ""
argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str
is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat()
when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).
Write Unicode (UTF-8) text file
You shouldn't be using old Pascal I/O at all. That did its job back in the 80s but is very obsolete today.
This century, you can use the TStringList
. This is very commonly used in Delphi. For instance, VCL controls use TStrings
to access a memo's lines of text and a combo box's or list box's items.
var SL := TStringList.Create;
try
SL.Add('∫cos(x)dx = sin(x) + C');
SL.Add('¬(a ∧ b) ⇔ ¬a ∨ ¬b');
SL.SaveToFile(FileName, TEncoding.UTF8);
finally
SL.Free;
end;
Fore more advanced needs, you can use a TStreamWriter
:
var SW := TStreamWriter.Create(FileName, False, TEncoding.UTF8);
try
SW.WriteLine('αβγδε');
SW.WriteLine('ωφψξη');
finally
SW.Free;
end;
And for very simple needs, there are the new TFile
methods in IOUtils.pas
:
var S := '⌬ is aromatic.';
TFile.WriteAllText(FileName, S, TEncoding.UTF8); // string (possibly with linebreaks)
var Lines: TArray<string>;
Lines := ['☃ is cold.', '☼ is hot.'];
TFile.WriteAllLines(FileName, Lines, TEncoding.UTF8); // string array
As you can see, all these modern options allow you to specify UTF8 as encoding. If you prefer to use some other encoding, like UTF16, that's fine too.
Just forget about AssignFile
, Reset
, Rewrite
, Append
, CloseFile
etc.
python - How to write unicode characters to files correctly
Specify the encoding encoding='utf-8'
:
text = 'سلام عزیزم! عزیزم سلام!'
with open('temp.txt', 'w', encoding='utf-8') as out_file:
print(text)
out_file.write(text)
with open('temp.txt', 'r', encoding='utf-8') as in_file:
print(in_file.read())
How do I specify new lines in a string in order to write multiple lines to a file?
It depends on how correct you want to be. \n
will usually do the job. If you really want to get it right, you look up the newline character in the os
package. (It's actually called linesep
.)
Note: when writing to files using the Python API, do not use the os.linesep
. Just use \n
; Python automatically translates that to the proper newline character for your platform.
Related Topics
Reading a Utf8 CSV File with Python
Using Os.Walk() to Recursively Traverse Directories in Python
Append Multiple Values for One Key in a Dictionary
Differencebetween an Expression and a Statement in Python
What Does a B Prefix Before a Python String Mean
Retrieving Parameters from a Url
Convert Numpy Array to Python List
How to Make Smooth Movement in Pygame
Rank Items in an Array Using Python/Numpy, Without Sorting Array Twice
Passing an Integer by Reference in Python
In Pandas, Is Inplace = True Considered Harmful, or Not
Non-Alphanumeric List Order from Os.Listdir()
Method Resolution Order (Mro) in New-Style Classes
How to Convert JSON Data into a Python Object
How to Get the Original Variable Name of Variable Passed to a Function
Slicing a List in Python Without Generating a Copy
Python App Does Not Print Anything When Running Detached in Docker