Python Pandas Error Tokenizing Data

Python Pandas Error tokenizing data

you could also try;

data = pd.read_csv('file1.csv', on_bad_lines='skip')

Do note that this will cause the offending lines to be skipped.

Edit

For Pandas < 1.3.0 try

data = pd.read_csv("file1.csv", error_bad_lines=False)

as per pandas API reference.

Pandas: How to workaround error tokenizing data?

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

If you write this again to a file via:

df.to_csv("Test.tab",sep="\t",header=False,index=False)

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4

None will be converted to empty string '' and everything is fine.

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

How can I solve pandas error tokenizing data?

read_csv is by default taking your first line as a header. append and 6.0 become your two headers. Then it looks for two columns in subsequent rows. In line 3 it finds 4 values and vomits.

You need another approach to handle this data where each line is a key-value pair with multiple values present.

Per your comment - just read it all anyway

Here's how you can do that:

import pandas as pd
import numpy as np

df = pd.read_csv("something/something.tsv", sep='\t', header=None, names=np.arange(20))

names=np.arange(20) is the key - and can be whatever number is more than the number of values you will have in a row. Then you can do whatever you need to do to get the data the way you want it.

Python Pandas Error while Tokenizing Data

In the pandas documentation, pd.read_csv says:

error_bad_lines bool, optional, default None
Lines with too many fields (e.g. a csv line with too many commas) will
by default cause an exception to be raised, and no DataFrame will be
returned. If False, then these “bad lines” will be dropped from the
DataFrame that is returned.

So, you may pass parameter error_bad_lines=False to pd.read_csv:

df = pd.read_csv('ZCS006A_16_23AUG_ALL_20220804020843.csv', delimiter = ',', error_bad_lines=False)

How to fix ParserError: Error tokenizing data for CSV pandas

After resetting row indexing, it works!

df = df.reset_index()

output:

    index   vers    2.1.0
0   info    days    6
1   info    x       a
2   info    y       b

Pandas.read_csv() Decoding Error tokenizing data because of a comma in data

You could try sep with regex but it will be using python engine and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:

1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]

pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")

|     | 1      | 2   | 3   | 4                      | 5   | 6      | 7   | 8       |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0   | 'true' | 47  | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66  | [81,65] |

Pandas, ParserError: Error tokenizing data

Using the raw CSV file, I can read the values:

import pandas as pd

csv_url = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/stock_px_2.csv"
close_px = pd.read_csv(csv_url, parse_dates=True, index_col=0)
print(close_px.head())

Output:

            AAPL   MSFT    XOM     SPX
2003-01-02  7.40  21.11  29.22  909.03
2003-01-03  7.45  21.14  29.24  908.59
2003-01-06  7.45  21.52  29.96  929.01
2003-01-07  7.43  21.93  28.95  922.93
2003-01-08  7.28  21.31  28.83  909.93

Python Pandas Error Tokenizing Data