Python Pandas Error tokenizing data
you could also try;
data = pd.read_csv('file1.csv', on_bad_lines='skip')
Do note that this will cause the offending lines to be skipped.
Edit
For Pandas < 1.3.0 try
data = pd.read_csv("file1.csv", error_bad_lines=False)
as per pandas API reference.
Pandas: How to workaround error tokenizing data?
Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.
So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)
### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens
import pandas as pd
df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code
df
contains empty string ''
for the missing entries at the beginning and the middle, and None
for the missing tokens at the end.
0 1 2 3 4 5 6
0 1 2 3 4 5 None None
1 1 2 3 4 5 6 None
2 3 4 5 None None
3 1 2 3 4 5 6 7
4 2 4 None None None
If you write this again to a file via:
df.to_csv("Test.tab",sep="\t",header=False,index=False)
1 2 3 4 5
1 2 3 4 5 6
3 4 5
1 2 3 4 5 6 7
2 4
None
will be converted to empty string ''
and everything is fine.
The next level would be to account for data strings in quotes which contain the separator, but that's another topic.
1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7
How can I solve pandas error tokenizing data?
read_csv
is by default taking your first line as a header. append
and 6.0
become your two headers. Then it looks for two columns in subsequent rows. In line 3 it finds 4 values and vomits.
You need another approach to handle this data where each line is a key-value pair with multiple values present.
Per your comment - just read it all anyway
Here's how you can do that:
import pandas as pd
import numpy as np
df = pd.read_csv("something/something.tsv", sep='\t', header=None, names=np.arange(20))
names=np.arange(20)
is the key - and can be whatever number is more than the number of values you will have in a row. Then you can do whatever you need to do to get the data the way you want it.
Python Pandas Error while Tokenizing Data
In the pandas
documentation, pd.read_csv
says:
error_bad_lines bool, optional, default None
Lines with too many fields (e.g. a csv line with too many commas) will
by default cause an exception to be raised, and no DataFrame will be
returned. If False, then these “bad lines” will be dropped from the
DataFrame that is returned.
So, you may pass parameter error_bad_lines=False
to pd.read_csv
:
df = pd.read_csv('ZCS006A_16_23AUG_ALL_20220804020843.csv', delimiter = ',', error_bad_lines=False)
How to fix ParserError: Error tokenizing data for CSV pandas
After resetting row indexing, it works!
df = df.reset_index()
output:
index vers 2.1.0
0 info days 6
1 info x a
2 info y b
Pandas.read_csv() Decoding Error tokenizing data because of a comma in data
You could try sep
with regex but it will be using python engine
and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:
1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]
pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0 | 'true' | 47 | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66 | [81,65] |
Pandas, ParserError: Error tokenizing data
Using the raw CSV file, I can read the values:
import pandas as pd
csv_url = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/stock_px_2.csv"
close_px = pd.read_csv(csv_url, parse_dates=True, index_col=0)
print(close_px.head())
Output:
AAPL MSFT XOM SPX
2003-01-02 7.40 21.11 29.22 909.03
2003-01-03 7.45 21.14 29.24 908.59
2003-01-06 7.45 21.52 29.96 929.01
2003-01-07 7.43 21.93 28.95 922.93
2003-01-08 7.28 21.31 28.83 909.93
Related Topics
Normal Arguments Vs. Keyword Arguments
Error Message: "'Chromedriver' Executable Needs to Be Available in the Path"
Python Requests Throwing Sslerror
Finding and Replacing Elements in a List
Is There Any Pythonic Way to Combine Two Dicts (Adding Values For Keys That Appear in Both)
How to Concatenate Text Files in Python
How to Remove an Element from a List by Index
What Is the Python Equivalent of Static Variables Inside a Function
Choosing the Correct Upper and Lower Hsv Boundaries For Color Detection With'Cv::Inrange' (Opencv)
How to Check If a String Is a Substring of Items in a List of Strings
How to Count the Frequency of the Elements in an Unordered List
Why Should We Not Use Sys.Setdefaultencoding("Utf-8") in a Py Script