Write and Read a File with Utf-8 Encoding

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

How to read a utf-8 encoded text file using Python

Since you are using Python 3, just add the encoding parameter to open():

corpus = open(
    r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

Python reading from a file and saving to utf-8

Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):

with open(filename, 'r', encoding='utf8') as f:
    text = f.read()

# process Unicode text

with open(filename, 'w', encoding='utf8') as f:
    f.write(text)

If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:

import io
with io.open(filename, 'r', encoding='utf8') as f:
    text = f.read()

# process Unicode text

with io.open(filename, 'w', encoding='utf8') as f:
    f.write(text)

How to Read/Write UTF8 text files in C?

This code worked for me:

/* fgetwc example */
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
  setlocale(LC_ALL, "en_US.UTF-8");
  FILE * fin;
  FILE * fout;
  wint_t wc;
  fin=fopen ("in.txt","r");
  fout=fopen("out.txt","w");
  while((wc=fgetwc(fin))!=WEOF){
        // work with: "wc"
  }
  fclose(fin);
  fclose(fout);
  printf("File has been created...\n");
  return 0;
}

How to read a single UTF-8 character from a file in Python?

The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.

You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.

import os
import struct

string = "我不喜欢你女朋友。你需要一个新的。"

with open('sample.bin','wb') as f:
    f.write(os.urandom(10))  # write 10 random bytes
    encoded = string.encode()
    f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
    f.write(encoded)                        # write string
    f.write(os.urandom(10))                 # 10 more random bytes

with open('sample.bin','rb') as f:
    print(f.read())  # show the raw data

# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
    f.seek(10)
    length = int.from_bytes(f.read(2),'big')
    result = f.read(length).decode()
    print(result)

# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
    # read 10 bytes and a big endian 16-bit value
    *other,header = struct.unpack('>10bH',f.read(12))
    result = f.read(length).decode()
    print(result)

Output:

b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。

If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:

with open('sample.bin','rb') as f:
    f.seek(12)
    c = codecs.getreader('utf8')(f)
    print(c.read(1))

Output:

我

Write to UTF-8 file in Python

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Write file in UTF-8 mode using Perl

You want

use utf8;                       # Source code is encoded using UTF-8.

open(my $FH, ">:encoding(utf-8)", "test11.txt")
    or die $!;

print $FH "something Çirçös";

use utf8;                       # Source code is encoded using UTF-8.
use open ':encoding(utf-8)';    # Sets the default encoding for handles opened in scope.

open(my $FH, ">", "test11.txt")
    or die $!;

print $FH "something Çirçös";

Notes:

The encoding you want is utf-8 (case-insensitive), not utf8 (a Perl-specific encoding).
Don't use global vars; use lexical (my) vars.

If you leave off the instruction to encode, you might get lucky and get the right output (along with a "wide character" warning). Don't count on this. You won't always be lucky.

# Unlucky.
$ perl -we'use utf8; print "é"' | od -t x1
0000000 e9
0000001

# Lucky.
$ perl -we'use utf8; print "é♡"' | od -t x1
Wide character in print at -e line 1.
0000000 c3 a9 e2 99 a1
0000005

# Correct.
$ perl -we'use utf8; binmode STDOUT, ":encoding(utf-8)"; print "é♡"' | od -t x1
0000000 c3 a9 e2 99 a1
0000005

What is the encoding to read and write files with special characters such as en dash, left quotes, etc?

Your code works OK if the file you'rte reading is already encoded in UTF-8, but it won't work if it's using a different encoding. I would recommend loading the file into a text editor like Notepad++ that tells you what the encoding of the file is (in the status bar). If it's not encoded in UTF-8 to start with, reading and writing as UTF-8 won't work.

If you want to try reading the file in the system's default encoding, you could use Encoding.Default instead of UTF8. Then you should write to a new file when writing the file because you can't really write multiple encodings to the same file. The default encoding is likely to be the correct encoding if UTF-8 isn't.

string filePath = @"C:\users\yourname\desktop\TestFile.txt";
string[] lines = File.ReadAllLines(filePath, Encoding.Default);

string outFile = @"C:\users\yourname\desktop\outfile.txt";
Stream s = new FileStream(outFile, FileMode.Append);
StreamWriter sw = new StreamWriter(s, Encoding.UTF8, 1000, true);
foreach (var line in lines)
   sw.WriteLine(line);
sw.Close();

Alternatively if you have to append to the same file, use the same encoding that you did for reading the file, or rewrite the whole file. If the original file looks OK in notepad, the system's default encoding is likely to be the correct encoding. If you want to leave the file in the system's current encoding, use Encoding.Default. If you want to change the encoding of the whole file to UTF-8, I think you'd have to rewrite the whole file instead of appending.

If Notepad++ shows this in the status bar, then you can't read the file as UTF-8
Sample Image

You can only use UTF-8 if Notepad++ shows something like this in the status bar:

Sample Image

You can use the "Encoding" menu's "Convert to UTF-8" command in Notepad++ to make the file compatible with your application.

Warning: Don't confuse the "Encode in UTF-8" command with the "Convert to UTF-8" command. If the file looks correct, you want to use "Convert to UTF-8". If you use "Encode in UTF-8" that will re-interpret the existing data as a new encoding instead of changing the content to use a new encoding.

Edit: Change Encoding.GetEncoding(0) to Encoding.Default.