Import CSV with Different Number of Columns Per Row Using Pandas

import csv with different number of columns per row using Pandas

Supplying a list of columns names in the read_csv() should do the trick.

ex: names=['a', 'b', 'c', 'd', 'e']

https://github.com/pydata/pandas/issues/2981

Edit: if you don't want to supply column names then do what Nicholas suggested

Detecting records with different number of columns in a CSV

If you are okay with processing the bad records later, you can use error_bad_lines and warn_bad_lines while reading the csv file and save the row number of skipped records to a log file like this:

import contextlib

with open('bad_lines.txt', 'w') as log:
with contextlib.redirect_stderr(log):
df = pd.read_csv('output.csv', warn_bad_lines=True, error_bad_lines=False)

The above code will skip all bad lines and redirect the error lines to the log file which you can then use for reprocessing.
Let me know if that helped.

Edit:

I got you a bit of a hacky solution if you don't want to reprocess the bad records.
What I have done here is to read the three columns of the csv as one by using a different separator and then for every row, where the count of elements is greater than number of columns (3 in this case), keep only the last 2 values as is and then concatenate everything before them so no matter how many commas you get in the Name field, it should work:

df = pd.read_csv('output.csv', sep=';') # Notice the sep here

col_count = 3

def str_check(x):
x = x.split(',')
if len(x) > col_count:
x = [', '.join(x[:-(col_count-1)])] + x[-(col_count-1):]
# Here the col_count is 3 so if you hardcode the values,
# it should look like [', '.join(x[:-2])] + x[-2:]
# Join everything before the last two elements as one element

return ';'.join(x)

df['Name, Address, Phone'] = df['Name, Address, Phone'].apply(str_check)
df = df['Name, Address, Phone'].str.split(';', expand=True)
df.columns = ['Name', 'Address', 'Phone']
df

Import csv with inconsistent count of columns per row with original header use pandas

One approach would be to first read just the header row in and then pass these column names with your extra generic names as a parameter to pandas. For example:

import pandas as pd
import csv

filename = "input.csv"

with open(filename, newline="") as f_input:
header = next(csv.reader(f_input))

header += [f'x{n}' for n in range(1, 10)]

tempfile = pd.read_csv(filename,
index_col=None,
sep=',',
skiprows=1,
names=header,
error_bad_lines=False,
encoding='unicode_escape',
warn_bad_lines=True,
)

skiprows=1 tells pandas to jump over the header and names holds the full list of column headers to use.

The header would then contain:

['a', 'b', 'c', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']

Pandas read_csv create columns based on maxmimum number of delimiters in a row

If the problem is only the header, setting the header parameter as None will solve your problem.

pd.read_csv(file_data, header=None)

If the number of delimiters on each row is different, you need to read each line using open() function.

with open('test.csv', 'r') as f:
df = [i.strip().split(',') for i in f.readlines()]

df = pd.DataFrame(df)
print(df)

Output: (I added "1,2,3,4,5,6\n" and "11,22,33\n" after the last row)

         0        1        2       3     4     5
0 header1 header2 header3 None None
1 value1 value2 value3 value4 None None
2 1 2 3 4 5 6
3 11 12 13 None None None

How to read the csv file properly if each row contains different number of fields (number quite big)?

As suggested, DictReader could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

         user      item rating                                  review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant

If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

import pandas as pd
import csv

rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

  rating      time      item        user  \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc

review
0 I love this phone as it is easy to used
1 Suck restaurant

row[::-1] is used to reverse the text of the whole line, the [2:] skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.

Is there a way to add a different number of columns in each row of a csv file?

This illustrates what I meant in my comment about supplying empty values for missing columns. It uses a csv.writer to write the rows.

import csv

filename = 'varying_rows.csv'

with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['name', 'gender', 'age', 'website1', 'link1'])
writer.writerow([None]*3 + ['website2', 'link2'])
writer.writerow([None]*4 + ['link3'])

Resulting file:

screenshot of resulting file in Excel

You could also get the same result by using a csv.DictWriter instead which would allow just leaving out the empty columns:

filename = 'varying_rows2.csv'
with open(filename, 'w', newline='') as file:
col_names = 'ABCDE' # Works like ['A', 'B', 'C', 'D', 'E'].
writer = csv.DictWriter(file, col_names)
writer.writerow(dict(A='name', B='gender', C='age', D='website1', E='link1'))
writer.writerow(dict(D='website2', E='link2'))
writer.writerow(dict(E='link3'))

How to group data by count of columns in Pandas?

since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.

One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.

with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)

df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())

print(df1, df2, df3, sep='\n\n')

ID NAME AGE
0 1 NATA 18

ID NAME COUNTRY AGE
0 1 OLEG FR 18

ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG

If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.

EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.

col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]

with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)

dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)

for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')

key='df_3':
ID NAME AGE
0 1 NATA 18

key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18

key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG

Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.

If you need to import the data with pandas, you could have a look at this post.



Related Topics



Leave a reply



Submit