Convert Ascii to Utf-8 Encoding

Force encode from US-ASCII to UTF-8 (iconv)

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

Convert extended ASCII character codes to utf-8 byte codes

If your encoding of "extended ASCII" is ISO-8859-1, then you're in luck. The first 255 Unicode points (Not UTF-8 encoding) of Unicode follow ISO-8859-1. I.e. á == U+00E1.

If you have any other encoding, then you're out of luck. The mapping of characters was arbitrary, so requires a rosetta stone and not calculation.

Once you have a Unicode point, you can relatively easily encode it to UTF-8 using the specification found in https://www.rfc-editor.org/rfc/rfc3629. Without a programming language defined in your question it's out of scope to try to detail that conversion here.

Percent encoding, is then a matter of applying the percent encoding specification to the UTF-8 characters.

Fortunately, most programming languages have inbuilt or 3rd party library for this kind of conversion.

Python – How do I convert an ASCII string into UTF-8?

If the input string contains the raw byte ordinals (such as \xc3\xa9/é instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.

>>> "pasé".encode('latin1').decode()
'pasé'

encode a file from ASCII to UTF8

ASCII is a subset of UTF-8. Any ASCII-encoded file is also valid UTF-8.

From the Wikipedia article on UTF-8:

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

In other words, your operation is a no-op, nothing should change.

Any tools that detect codecs (like chardet) would rightly mark it as ASCII still. Marking it as UTF-8 would also be valid, but so would marking it as ISO-8859-1 (Latin-1) or CP-1252 (the Windows latin-1 based codepage), or any number of codecs that are supersets of ASCII. That would be confusing, however, since your data only consists of ASCII codepoints. Tools that would accept ASCII only would accept your CSV file, while they would not accept UTF-8 data that consists of more than just ASCII codepoints.

If the goal is to validate that any piece of text is valid UTF-8 by using chardet, then you'll have to accept ASCII too:

def is_utf8(content):
encoding = chardet.detect(content)['encoding']
return encoding in {'utf-8', 'ascii'}


Related Topics



Leave a reply



Submit