Python Out of Memory on Large CSV File (Numpy)

Python out of memory on large CSV file (numpy)

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt
Memory and CPU usage of numpy.loadtxt while loading a ~500MB ascii file


Genfromtxt
Memory and CPU usage of numpy.genfromtxt while loading a ~500MB ascii file


Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
data = np.random.random((length, ncols))
np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)

data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter

Using fromiter to load the same ~500MB data file

Large csv files: MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object

You can limit the number of columns with usecols. This will reduce the memory footprint. You also seem to have some bad data in the CSV file making columns you think should be int64 to be object. These could be empty cells, or any non-digit value. Here is an example that will read the csv and then scan for bad data. This example uses commas, not tab, because thats a bit easier to demonstrate.

import pandas as pd
import numpy as np
import io
import re

test_csv = io.StringIO("""field1,field2,field3,other
1,2,3,this
4,what?,6,is
7,,9,extra""")

_numbers_re = re.compile(r"\d+$")

df = pd.read_csv(test_csv,sep=",",error_bad_lines=False,
usecols=['field1', 'field2', 'field3'])
print(df)

# columns that arent int64
bad_cols = list(df.dtypes[df.dtypes!=np.dtype('int64')].index)
if bad_cols:
print("bad cols", bad_cols)
for bad_col in bad_cols:
col = df[bad_col]
bad = col[col.str.match(_numbers_re) != True]
print(bad)
exit(1)

major memory problems reading in a csv file using numpy

import pandas, re, numpy as np

def load_file(filename, num_cols, delimiter='\t'):
data = None
try:
data = np.load(filename + '.npy')
except:
splitter = re.compile(delimiter)

def items(infile):
for line in infile:
for item in splitter.split(line):
yield item

with open(filename, 'r') as infile:
data = np.fromiter(items(infile), float64, -1)
data = data.reshape((-1, num_cols))
np.save(filename, data)

return pandas.DataFrame(data)

This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!

Why does np.genfromtxt() initially use up a large amount of memory for large datasets?

@Kasramvd made a good suggestion in the comments to look into the solutions proposed here. The iter_loadtxt() solution from that answer turned out to be the perfect solution for my issue:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)

data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data

The reason genfromtxt() takes up so much memory is because it is not storing the data in efficient NumPy arrays while it is parsing the data file, thus the excessive memory usage while NumPy was parsing my large data file.

Processing a very very big data set in python - memory error

As noted by @DSM in the comments, the reason you're getting a memory error is that calling np.size on a list will copy the data into an array first and then get the size.

If you don't need to work with it as a numpy array, just don't call np.size. If you do want numpy-like indexing options and so on, you have a few options.

You could use pandas, which is meant for handling big not-necessarily-numerical datasets and has some great helpers and stuff for doing so.

If you don't want to do that, you could define a numpy structure array and populate it line-by-line in the first place rather than making a list and copying into it. Something like:

fields = [('name1', str), ('name2', float), ...]
data = np.zeros((num_rows,), dtype=fields)

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
for i, row in enumerate(csv_file_object):
data[i] = row

You could also define fields based on header so you don't have to manually type out all 50 column names, though you'd have to do something about specifying the data types for each.

Memory error when using Numpy load text

First off, I'd check that you're actually using a 64-bit build of python. On Windows, it's common to wind up with 32-bit builds, even on 64-bit systems.

Try:

import platform
print(platform.architecture()[0])

If you see 32bit, that's your problem. A 32-bit execuctable can only address 2GB of memory, so you can never have an array (or other object) over 2GB.


However, loadtxt is rather inefficient because it works by building up a list and then converting it to a numpy array. Your example code does the same thing. (pandas.read_csv is much more efficient and very heavily optimized, if you happen to have pandas around.)

A list is a much less memory-efficient structure than a numpy array. It's analogous to an array of pointers. In other words, each item in a list has an additional 64-bits.

You can improve on this by using numpy.fromiter if you need "leaner" text I/O. See Python out of memory on large CSV file (numpy) for a more complete discussion (shameless plug).


Nonetheless, I don't think your problem is loadtxt. I think it's a 32-bit build of python.



Related Topics



Leave a reply



Submit