Python 3.2 Unicodeencodeerror: 'Charmap' Codec Can't Encode Character '\U2013' in Position 9629: Character Maps to <Undefined>

python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to undefined

When you open the file you want to write to, open it with a specific encoding that can handle all the characters.

with open('filename', 'w', encoding='utf-8') as f:
print(r['body'], file=f)

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 582: character maps to undefined

I fixed the problem by adding: encoding="utf-8" to the csv file...
the code:

import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest

job_title = []
company_name = []
location_name = []
job_skill = []
links = []
salary = []
requirements = []
date = []
page_num = 0
num = 1
while page_num != 5000:

result = requests.get(f"https://www.indeed.com/jobs?q=web%20development&start={page_num}")
source = result.content
soup = BeautifulSoup(source, "lxml")

job_titles = soup.find_all("a", {"class", "jcs-JobTitle"})
company_names = soup.find_all("span", {"class": "companyName"})
location_names = soup.find_all("div", {"class": "companyLocation"})
job_skills = soup.find_all("div", {"class": "job-snippet"})
dates = soup.find_all("span", {"class": "date"})

for i in range(len(job_titles)):
job_title.append(job_titles[i].text.strip())
links.append("https://www.indeed.com" + job_titles[i].attrs["href"])
company_name.append(company_names[i].text.strip())
location_name.append(location_names[i].text.strip())
job_skill.append(job_skills[i].text.strip())
date.append(dates[i].text.strip())

page_num += 10
print(f"{num}.Page switched...")
num += 1

for link in links:
result = requests.get(link)
source = result.content
soup = BeautifulSoup(source, "lxml")
salaries = soup.find("span", {"class": "icl-u-xs-mr--xs attribute_snippet"})
salary.append(salaries.text if salaries else "None".strip())
requirement = soup.find("div", {"id": "jobDescriptionText", "class": "jobsearch-jobDescriptionText"}).ul
requirements_text = ""
if requirement:
for li in requirement.find_all("li"):
requirements_text += li.text.strip() + "| "
else:
requirements_text += "None"
requirements_text = requirements_text[:-2]
requirements.append(requirements_text)

my_file = [job_title, company_name, location_name, job_skill, salary, links, date, requirements]
exported = zip_longest(*my_file)
with open("/Users/Rich/Desktop/testing/indeed.csv", "w", encoding="utf-8") as myfile:
writer = csv.writer(myfile)
writer.writerow(["Job titles", "Company names", "Location names", "Job skills", "Salaries", "Links", "Dates", "Requirements"])
writer.writerows(exported)

but I don't know what encoding="utf-8" is for, any idea ??

Python : UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 2936

You have to specify a proper encoding for your file object at opening time. This can be done by passing the encoding to encoding keyword argument of open() function.

For example:

with open(file_name, 'w', encoding='utf8') as text_file:
pass

Python Youtube API: UnicodeEncodeError: 'charmap' codec can't encode character '\u279c' in position 7741: character maps to undefined

I would like to add an answer to this question for anyone who has a similar problem. The simplest solution is (as stvar answered):

Try print(json.dumps(response, ensure_ascii = True)) instead. (Of course, have import json too.)

The Windows terminal is unable to display certain characters, ensuring ASCII fixed this issue.

UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 1087: character maps to undefined

It seems like you have multiple misunderstandings here.

soup.prettify().encode('cp1252', errors='ignore')

This does nothing useful: you create a string representing the HTML source (with .prettify), encode it as bytes (.encode), and then do nothing with the resulting object. The soup is unmodified.

Fortunately, you don't need or want to do anything about the encoding at this point in the process anyway. But it would be better to remove this line entirely, to avoid misleading yourself.

for e in soup.select("p"):
corpus.append(e.text)

return corpus

You will produce and return a list of strings, which later you are trying to convert to string forcibly using str. The result will show the representation of the list: i.e., it will be enclosed in [] and have commas separating the items and quotes and escape sequences for each string. This is probably not what you wanted.

I assume you wanted to join the strings together, for example like '\n'.join(corpus). However, multiple-line data like this is not appropriate to store in a CSV. (An escaped list representation is also rather awkward to store in a CSV. You should probably think more about how you want to format the data.)

review = str(scraper.extract_corpus(scraper.take_source(str(row.__getitem__('url')))))

First off, you should not call double-underscore methods like __getitem__ directly. I know they are written that way in the documentation; that is just an artifact of how Python works in general. You are meant to use __getitem__ thus: row['url'].

You should expect the result to be a string already, so the inner str call is useless. Then you use take_source, which has this error:

if 'http://' or 'https://' in url:

This does not do what you want; the function will always think the URL is "valid".

Anyway, once you manage to extract_corpus and forcibly produce a string from it, actual problem you are asking about occurs:

with open('reviews.csv','a') as csv_f:

You cannot simply write any arbitrary string to a file in the cp1252 encoding (you know this is the one being used, because of the mention of cp1252.py in your stack trace; it is the default for your platform). This is the place where you are supposed to specify a file encoding. For example, you could specify that the file should be written using encoding='utf-8', which can handle any string. (You will also need to specify this explicitly when you open the file again for any other purpose.)

If you insist on doing the encoding manually, then you would need to .encode the thing you are .writeing to the file. However, because .encode produces the raw encoded bytes, you would then need to open the file in a binary mode (like 'ab'), and that would also mean you have to handle universal newline encoding yourself. It is not a pleasant task. Please just use the library according to how it was designed to be used.


When it comes to handling text encodings etc. properly, you cannot write correct code of decent quality simply by trying to fix each error as it comes up, doing a web search for each error or silencing a type error with a forced conversion. You must actually understand what is going on. I cannot stress this enough. Please start here, and then also read here. Read both top to bottom, aiming to understand what is being said rather than trying to solve any specific problem.



Related Topics



Leave a reply



Submit