How to Clean \Xc2\Xa0 \Xc2\Xa0..... in Text Data

How to clean \xc2\xa0 \xc2\xa0..... in text data

That's UTF-8 encoded text. You open the file as UTF-8.

with open(file, 'r', encoding='utf-8') as myfile:
...

2.x:

with codecs.open(file, 'r', encoding='utf-8') as myfile:
...

Unicode In Python, Completely Demystified

Python:Got \xa0 instead of space in CSV and cannot remove or convert

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")

or:

line.replace(r"\xa0", " ")

The r in front of the string means to interpret each character literally, even a backslash.


Note that the data in the CSV file is full of inconsistencies. Examples:

  • \n probably means a linebreak.
  • \\n also appears, and it probably means a linebreak also.
  • \xa0 is a nonbreaking space, encoded in ISO-8859-1.
  • \xc2\xa0 is a nonbreaking space, encoded in UTF-8.
  • \\xc2\\xa0 also appears, with the same meaning.
  • \\\\n also appears.

So to get meaningful content out of that file, you should repeatedly interpret the escape sequences until nothing changes. After that, try to interpret the resulting byte sequence as UTF-8. If it works, fine. If not, interpret it as Codepage 1252 (which is a superset of ISO-8859-1).

How to replace decoded Non-breakable space (nbsp)

Problem Explanation

The reason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - 0xC2 (194) and 0xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

How to remove all occurrences of c2a0 in a string with PHP?

$column = str_replace("\xc2\xa0", '', $column);

Why does python2 save text with encoding escape characters?

In [1]: text = 'perchè'

In [2]: text
Out[2]: 'perch\xc3\xa8'

In [3]: print text
perchè

When you evaluate something ipython calls its repr method. The repr method of strings shows the escapes instead of the actual characters. This is what you actually want, since it avoids problems with stdout encoding and also allows you to see what characters are actually in the string(think of unicode multi-ways of obtaining the same character).

To see the real characters you should write the string to stdout(assuming stdout can handle the encoding of the string etc.)

Most intelligent way of reading some data into Python

a) Yes it is normal. You are pasting UTF8 encoded HTML content into Calc. That content includes a UTF8 encoded NO-BREAK SPACE unicode character that is used for the empty columns of the table.

>>> s = '\xc2\xa0'    # UTF8 encoded string
>>> s.decode('utf8')
u'\xa0'
>>> import unicodedata
>>> print unicodedata.name(s.decode('utf8')) # decode to unicode and lookup name
NO-BREAK SPACE

It looks like you pasted the table into Calc using a "normal" paste. If you had instead pasted the data into Calc using "Paste Special" and selected as "Unformatted text" you would have ended up with ASCII spaces instead of non breaking spaces. Also, when saving the file, you can specify the encoding to use. Choose UTF8 or ASCII as there are not any Unicode characters in that table, so both end up the same.

b) If you decided to paste unformatted text into Calc then you can process the file like this:

import csv

with open('fomc.csv') as infile:
data= []
for row in csv.reader(infile):
data.append([float(field.strip()) if field.strip() else None for field in row])

data will contain:


[[0.125, 2.0, None, None, None], [0.25, None, None, None, None], ..., [4.25, None, None, None, 1.0]]

I've used None to represent the empty columns. You could use 0 or '' as you see fit. Also, I did not copy & paste the column headers into the CSV file, so I don't have to worry about them.

c) See b) - float conversions were performed on all non-empty strings while reading the file.

How can I remove non-breaking spaces from a text file in bash?

After we know it is No break space, I simply sed it on mac with entry method:

opt+space
cat test4.csv | sed 's/ //g'


Related Topics



Leave a reply



Submit