Unicodeencodeerror: 'Charmap' Codec Can't Encode - Character Maps to <Undefined>, Print Function

Why I'm getting UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to undefined error?

There are hints in the full error message... I will keep here what seems most important:

Traceback ...
File "...\cp1252.py", ...
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' ...

The error is caused by the print call. Somewhere in you text, you have a ZERO WIDTH SPACE character (Unicode U+200B), and if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252 here). And the ZERO WIDTH SPACE is not represented in that code page. BTW the default console is not really unicode friendly in Windows.

There is little to do in a Windows console. I would advise you to try one of these workarounds:

  • do not print to the console but write to a (utf8) file. You will then be able to read it with a utf8 enabled text editor like notepad++

  • manually encode anything before printing it, with errors='ignore' or errors='replace'. That way, possibly offending characters will be ignored and no error will arise

      print(soup.prettify().encode('cp1252', errors='ignore'))

UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 1087: character maps to undefined

It seems like you have multiple misunderstandings here.

soup.prettify().encode('cp1252', errors='ignore')

This does nothing useful: you create a string representing the HTML source (with .prettify), encode it as bytes (.encode), and then do nothing with the resulting object. The soup is unmodified.

Fortunately, you don't need or want to do anything about the encoding at this point in the process anyway. But it would be better to remove this line entirely, to avoid misleading yourself.

for e in soup.select("p"):
corpus.append(e.text)

return corpus

You will produce and return a list of strings, which later you are trying to convert to string forcibly using str. The result will show the representation of the list: i.e., it will be enclosed in [] and have commas separating the items and quotes and escape sequences for each string. This is probably not what you wanted.

I assume you wanted to join the strings together, for example like '\n'.join(corpus). However, multiple-line data like this is not appropriate to store in a CSV. (An escaped list representation is also rather awkward to store in a CSV. You should probably think more about how you want to format the data.)

review = str(scraper.extract_corpus(scraper.take_source(str(row.__getitem__('url')))))

First off, you should not call double-underscore methods like __getitem__ directly. I know they are written that way in the documentation; that is just an artifact of how Python works in general. You are meant to use __getitem__ thus: row['url'].

You should expect the result to be a string already, so the inner str call is useless. Then you use take_source, which has this error:

if 'http://' or 'https://' in url:

This does not do what you want; the function will always think the URL is "valid".

Anyway, once you manage to extract_corpus and forcibly produce a string from it, actual problem you are asking about occurs:

with open('reviews.csv','a') as csv_f:

You cannot simply write any arbitrary string to a file in the cp1252 encoding (you know this is the one being used, because of the mention of cp1252.py in your stack trace; it is the default for your platform). This is the place where you are supposed to specify a file encoding. For example, you could specify that the file should be written using encoding='utf-8', which can handle any string. (You will also need to specify this explicitly when you open the file again for any other purpose.)

If you insist on doing the encoding manually, then you would need to .encode the thing you are .writeing to the file. However, because .encode produces the raw encoded bytes, you would then need to open the file in a binary mode (like 'ab'), and that would also mean you have to handle universal newline encoding yourself. It is not a pleasant task. Please just use the library according to how it was designed to be used.


When it comes to handling text encodings etc. properly, you cannot write correct code of decent quality simply by trying to fix each error as it comes up, doing a web search for each error or silencing a type error with a forced conversion. You must actually understand what is going on. I cannot stress this enough. Please start here, and then also read here. Read both top to bottom, aiming to understand what is being said rather than trying to solve any specific problem.

python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to undefined

When you open the file you want to write to, open it with a specific encoding that can handle all the characters.

with open('filename', 'w', encoding='utf-8') as f:
print(r['body'], file=f)

fixing UnicodeEncodeError: 'charmap' codec can't encode character in python3

The Windows command line normally doesn't have a font that supports Asian characters unless your system locale is an Asian locale. Your system locale can be changed in Control Panel, Region and Language, Administrative tab (Windows 7).

Otherwise, you can try win-unicode-console, but you will still need to find a fixed-width console font that supports Asian characters.

Installing console fonts

UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 206: character maps to undefined

Update:

The problem is not from the code above (writing file) but in parse() nor parseList() methods or from reading file.

replace the following

# in parseList(...)
text = open(url, 'r')

# and in parse(..)
text = open(path + file, 'r')

with

# in parseList(...)
text = open(url, 'r', encoding='windows_1252')

# and in parse(..)
text = open(path + file, 'r', encoding='windows_1252')

And don't forget to restore your code in your question above to original state.

Python Youtube API: UnicodeEncodeError: 'charmap' codec can't encode character '\u279c' in position 7741: character maps to undefined

I would like to add an answer to this question for anyone who has a similar problem. The simplest solution is (as stvar answered):

Try print(json.dumps(response, ensure_ascii = True)) instead. (Of course, have import json too.)

The Windows terminal is unable to display certain characters, ensuring ASCII fixed this issue.



Related Topics



Leave a reply



Submit