import csv with different number of columns per row using Pandas
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
Detecting records with different number of columns in a CSV
If you are okay with processing the bad records later, you can use error_bad_lines
and warn_bad_lines
while reading the csv
file and save the row number of skipped records to a log file like this:
import contextlib
with open('bad_lines.txt', 'w') as log:
with contextlib.redirect_stderr(log):
df = pd.read_csv('output.csv', warn_bad_lines=True, error_bad_lines=False)
The above code will skip all bad lines and redirect the error lines to the log file which you can then use for reprocessing.
Let me know if that helped.
Edit:
I got you a bit of a hacky solution if you don't want to reprocess the bad records.
What I have done here is to read the three columns of the csv
as one by using a different separator and then for every row, where the count of elements is greater than number of columns (3 in this case), keep only the last 2 values as is and then concatenate everything before them so no matter how many commas
you get in the Name
field, it should work:
df = pd.read_csv('output.csv', sep=';') # Notice the sep here
col_count = 3
def str_check(x):
x = x.split(',')
if len(x) > col_count:
x = [', '.join(x[:-(col_count-1)])] + x[-(col_count-1):]
# Here the col_count is 3 so if you hardcode the values,
# it should look like [', '.join(x[:-2])] + x[-2:]
# Join everything before the last two elements as one element
return ';'.join(x)
df['Name, Address, Phone'] = df['Name, Address, Phone'].apply(str_check)
df = df['Name, Address, Phone'].str.split(';', expand=True)
df.columns = ['Name', 'Address', 'Phone']
df
Import csv with inconsistent count of columns per row with original header use pandas
One approach would be to first read just the header row in and then pass these column names with your extra generic names as a parameter to pandas. For example:
import pandas as pd
import csv
filename = "input.csv"
with open(filename, newline="") as f_input:
header = next(csv.reader(f_input))
header += [f'x{n}' for n in range(1, 10)]
tempfile = pd.read_csv(filename,
index_col=None,
sep=',',
skiprows=1,
names=header,
error_bad_lines=False,
encoding='unicode_escape',
warn_bad_lines=True,
)
skiprows=1
tells pandas to jump over the header and names
holds the full list of column headers to use.
The header would then contain:
['a', 'b', 'c', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']
Pandas read_csv create columns based on maxmimum number of delimiters in a row
If the problem is only the header, setting the header parameter
as None
will solve your problem.
pd.read_csv(file_data, header=None)
If the number of delimiters on each row is different, you need to read each line using open()
function.
with open('test.csv', 'r') as f:
df = [i.strip().split(',') for i in f.readlines()]
df = pd.DataFrame(df)
print(df)
Output: (I added "1,2,3,4,5,6\n"
and "11,22,33\n"
after the last row)
0 1 2 3 4 5
0 header1 header2 header3 None None
1 value1 value2 value3 value4 None None
2 1 2 3 4 5 6
3 11 12 13 None None None
How to read the csv file properly if each row contains different number of fields (number quite big)?
As suggested, DictReader
could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:
import pandas as pd
import csv
rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']
with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
This would display the following:
user item rating review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant
If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:
import pandas as pd
import csv
rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']
with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
This would display:
rating time item user \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc
review
0 I love this phone as it is easy to used
1 Suck restaurant
row[::-1]
is used to reverse the text of the whole line, the [2:]
skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows
is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.
The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.
Is there a way to add a different number of columns in each row of a csv file?
This illustrates what I meant in my comment about supplying empty values for missing columns. It uses a csv.writer
to write the rows.
import csv
filename = 'varying_rows.csv'
with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['name', 'gender', 'age', 'website1', 'link1'])
writer.writerow([None]*3 + ['website2', 'link2'])
writer.writerow([None]*4 + ['link3'])
Resulting file:
You could also get the same result by using a csv.DictWriter
instead which would allow just leaving out the empty columns:
filename = 'varying_rows2.csv'
with open(filename, 'w', newline='') as file:
col_names = 'ABCDE' # Works like ['A', 'B', 'C', 'D', 'E'].
writer = csv.DictWriter(file, col_names)
writer.writerow(dict(A='name', B='gender', C='age', D='website1', E='link1'))
writer.writerow(dict(D='website2', E='link2'))
writer.writerow(dict(E='link3'))
How to group data by count of columns in Pandas?
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df
, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader
and create the df's
with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's
), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's
. I think you can't get around creating a list of columns you want to use for the seperate df's
, so you need to know what variations of number of columns you have in your data (except you want to create those df's
without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df
, instead you have them in a dictionary as keys. (I named the df
with the number of columns it has, df_3
is the df
with three columns.
If you need to import the data with pandas, you could have a look at this post.
Related Topics
Weird Behavior: Lambda Inside List Comprehension
How to Plot Normal Distribution
Attributeerror: 'Pandasexprvisitor' Object Has No Attribute 'Visit_Ellipsis', Using Pandas Eval
Removing Horizontal Lines in Image (Opencv, Python, Matplotlib)
Go to a Specific Line in Python
Pandas Reading CSV as String Type
Count Frequency of Values in Pandas Dataframe Column
Django Model "Doesn't Declare an Explicit App_Label"
Repeat Rows in Data Frame N Times
How to Set Headers Using Python's Urllib
Differencebetween Exec_Command and Send with Invoke_Shell() on Paramiko
Reference Requirements.Txt for the Install_Requires Kwarg in Setuptools Setup.Py File
Case Insensitive Flask-Sqlalchemy Query
Text Box with Line Wrapping in Matplotlib