Memory Error When Using Pandas Read_Csv

MemoryError when opening CSV file with pandas

This may not be the most efficient way but have a go.
Reduce or increase the chunk size depending on your RAM availability.

chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
i += 1
chunk_list.append(chunk)
df = pd.concat(chunk_list, sort = True)

If this doesnt work. Try this:

chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
if i >= 10:
break
i += 1
chunk_list.append(chunk)
df1 = pd.concat(chunk_list, sort = True)

chunks = pd.read_csv('report_OOP_Full.csv', skiprows = 100000, chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
if i >= 10:
break
i += 1
chunk_list.append(chunk)
df2 = pd.concat(chunk_list, sort = True)

d3 = pd.concat([d1,d2], sort = True)

skiprows was calculated by how many rows the previous dataframe has read in.

This will break after 10 chunks is loaded. store this as df1. and read in the file again by starting at chunk 11, and append that again.

i understand that you're working with some big data. I encourage you to take a look at this function i found. The link below explains how it works.
credit for this function is here:
credit

def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
df[col] = df[col].astype(np.uint8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
df[col] = df[col].astype(np.uint16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
df[col] = df[col].astype(np.uint32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
df[col] = df[col].astype(np.uint64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

This will make sure your dataframe use as low memory as possible when you're working with it.

memory error while loading csv file?

The reason you get this low_memory warning might be because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

In case using 32-bit system :
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Try this :

tp = pd.read_csv('file_name.csv', header=None, chunksize=1000)
df = pd.concat(tp, ignore_index=True)

pandas.read_csv gives memory error despite comparatively small dimensions

I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.

memory error reading big size csv in pandas

You have to iterate over the chunks:

csv_length = 0    
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=10000):
csv_length += chunk.count()
print(csv_length )

Python Pandas: read_csv with chunksize and concat still throws MemoryError

You can try writing in chunks. Roughly:

df = pd.read_csv("Data.csv", chunksize = 10000)
header = True
for chunk in df:
chunk=chunk[chunk['Geography']=='Ontario']
chunk.to_csv(outfilename, header=header, mode='a')
header = False

Idea from here.



Related Topics



Leave a reply



Submit