Large, Persistent Dataframe in Pandas

Large, persistent DataFrame in pandas

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

Loading big CSV file with pandas

I suggest that you install the 64 Bit version of winpython. Then you should be able to load a 250 MB file without problems.

How to reversibly store and load a Pandas dataframe to/from disk

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

import pandas as pd
store = pd.HDFStore('store.h5')

store['df'] = df # save it
store['df'] # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

How to create a large pandas dataframe from an sql query without running out of memory?

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Maximum size of pandas dataframe

I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.

The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.

1) Check for code errors

This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)

2) Make your code more efficient

Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!

3) Check The Total Memory of the object

The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here

to find the size of an object in bites you can always use sys.getsizeof():

import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))

Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.

4) Check the memory while running

Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.

use the code below to see the documentation straight in Jupyter Notebook:

%mprun?
%memit?

Sample use:

%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB

If you need help on magic functions This is a great post

5) This one may be first.... but Check for simple things like bit version

As in your case, a simple switching of the version of python you were running solved the issue.

Usually the above steps solve my issues.



Related Topics



Leave a reply



Submit