python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to undefined
When you open the file you want to write to, open it with a specific encoding that can handle all the characters.
with open('filename', 'w', encoding='utf-8') as f:
print(r['body'], file=f)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 582: character maps to undefined
I fixed the problem by adding: encoding="utf-8"
to the csv file...
the code:
import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest
job_title = []
company_name = []
location_name = []
job_skill = []
links = []
salary = []
requirements = []
date = []
page_num = 0
num = 1
while page_num != 5000:
result = requests.get(f"https://www.indeed.com/jobs?q=web%20development&start={page_num}")
source = result.content
soup = BeautifulSoup(source, "lxml")
job_titles = soup.find_all("a", {"class", "jcs-JobTitle"})
company_names = soup.find_all("span", {"class": "companyName"})
location_names = soup.find_all("div", {"class": "companyLocation"})
job_skills = soup.find_all("div", {"class": "job-snippet"})
dates = soup.find_all("span", {"class": "date"})
for i in range(len(job_titles)):
job_title.append(job_titles[i].text.strip())
links.append("https://www.indeed.com" + job_titles[i].attrs["href"])
company_name.append(company_names[i].text.strip())
location_name.append(location_names[i].text.strip())
job_skill.append(job_skills[i].text.strip())
date.append(dates[i].text.strip())
page_num += 10
print(f"{num}.Page switched...")
num += 1
for link in links:
result = requests.get(link)
source = result.content
soup = BeautifulSoup(source, "lxml")
salaries = soup.find("span", {"class": "icl-u-xs-mr--xs attribute_snippet"})
salary.append(salaries.text if salaries else "None".strip())
requirement = soup.find("div", {"id": "jobDescriptionText", "class": "jobsearch-jobDescriptionText"}).ul
requirements_text = ""
if requirement:
for li in requirement.find_all("li"):
requirements_text += li.text.strip() + "| "
else:
requirements_text += "None"
requirements_text = requirements_text[:-2]
requirements.append(requirements_text)
my_file = [job_title, company_name, location_name, job_skill, salary, links, date, requirements]
exported = zip_longest(*my_file)
with open("/Users/Rich/Desktop/testing/indeed.csv", "w", encoding="utf-8") as myfile:
writer = csv.writer(myfile)
writer.writerow(["Job titles", "Company names", "Location names", "Job skills", "Salaries", "Links", "Dates", "Requirements"])
writer.writerows(exported)
but I don't know what encoding="utf-8"
is for, any idea ??
Python : UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 2936
You have to specify a proper encoding for your file object at opening time. This can be done by passing the encoding to encoding
keyword argument of open()
function.
For example:
with open(file_name, 'w', encoding='utf8') as text_file:
pass
Python Youtube API: UnicodeEncodeError: 'charmap' codec can't encode character '\u279c' in position 7741: character maps to undefined
I would like to add an answer to this question for anyone who has a similar problem. The simplest solution is (as stvar answered):
Try
print(json.dumps(response, ensure_ascii = True))
instead. (Of course, haveimport json
too.)
The Windows terminal is unable to display certain characters, ensuring ASCII fixed this issue.
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 1087: character maps to undefined
It seems like you have multiple misunderstandings here.
soup.prettify().encode('cp1252', errors='ignore')
This does nothing useful: you create a string representing the HTML source (with .prettify
), encode it as bytes (.encode
), and then do nothing with the resulting object. The soup
is unmodified.
Fortunately, you don't need or want to do anything about the encoding at this point in the process anyway. But it would be better to remove this line entirely, to avoid misleading yourself.
for e in soup.select("p"):
corpus.append(e.text)
return corpus
You will produce and return a list of strings, which later you are trying to convert to string forcibly using str
. The result will show the representation of the list: i.e., it will be enclosed in []
and have commas separating the items and quotes and escape sequences for each string. This is probably not what you wanted.
I assume you wanted to join the strings together, for example like '\n'.join(corpus)
. However, multiple-line data like this is not appropriate to store in a CSV. (An escaped list representation is also rather awkward to store in a CSV. You should probably think more about how you want to format the data.)
review = str(scraper.extract_corpus(scraper.take_source(str(row.__getitem__('url')))))
First off, you should not call double-underscore methods like __getitem__
directly. I know they are written that way in the documentation; that is just an artifact of how Python works in general. You are meant to use __getitem__
thus: row['url']
.
You should expect the result to be a string already, so the inner str
call is useless. Then you use take_source
, which has this error:
if 'http://' or 'https://' in url:
This does not do what you want; the function will always think the URL is "valid".
Anyway, once you manage to extract_corpus
and forcibly produce a string from it, actual problem you are asking about occurs:
with open('reviews.csv','a') as csv_f:
You cannot simply write any arbitrary string to a file in the cp1252 encoding (you know this is the one being used, because of the mention of cp1252.py
in your stack trace; it is the default for your platform). This is the place where you are supposed to specify a file encoding. For example, you could specify that the file should be written using encoding='utf-8'
, which can handle any string. (You will also need to specify this explicitly when you open the file again for any other purpose.)
If you insist on doing the encoding manually, then you would need to .encode
the thing you are .write
ing to the file. However, because .encode
produces the raw encoded bytes, you would then need to open the file in a binary mode (like 'ab'
), and that would also mean you have to handle universal newline encoding yourself. It is not a pleasant task. Please just use the library according to how it was designed to be used.
When it comes to handling text encodings etc. properly, you cannot write correct code of decent quality simply by trying to fix each error as it comes up, doing a web search for each error or silencing a type error with a forced conversion. You must actually understand what is going on. I cannot stress this enough. Please start here, and then also read here. Read both top to bottom, aiming to understand what is being said rather than trying to solve any specific problem.
Related Topics
Why Are Slice and Range Upper-Bound Exclusive
Python and Beautifulsoup Encoding Issues
What Is This Odd Colon Behavior Doing
Parameter Substitution for a SQLite "In" Clause
Error: 'Int' Object Is Not Subscriptable - Python
Dll Load Failed When Importing Pyqt5
Running a Process in Pythonw with Popen Without a Console
Force Numpy Ndarray to Take Ownership of Its Memory in Cython
Board-Drawing Code to Move an Oval
Pytz Localize VS Datetime Replace
Accessing Every 1St Element of Pandas Dataframe Column Containing Lists
Python Regular Expressions - How to Capture Multiple Groups from a Wildcard Expression
Where to Put Freeze_Support() in a Python Script
Regex Error - Nothing to Repeat
How to Implement a Binary Tree