Large, persistent DataFrame in pandas
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv
on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap
), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000
) then concatenate then with pd.concat
. The problem comes in when you pull the entire text file into memory in one big slurp.
Loading big CSV file with pandas
I suggest that you install the 64 Bit version of winpython. Then you should be able to load a 250 MB file without problems.
How to reversibly store and load a Pandas dataframe to/from disk
The easiest way is to pickle it using to_pickle
:
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
df = pd.read_pickle(file_name)
Note: before 0.11.1 save
and load
were the only way to do this (they are now deprecated in favor of to_pickle
and read_pickle
respectively).
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
import pandas as pd
store = pd.HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
How to create a large pandas dataframe from an sql query without running out of memory?
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Maximum size of pandas dataframe
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os
module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof()
:
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun?
%memit?
Sample use:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.
Related Topics
How to Update/Upgrade Pip Itself from Inside My Virtual Environment
How to Count the Occurrence of a Certain Item in an Ndarray
How to Append One String to Another in Python
Moving Matplotlib Legend Outside of the Axis Makes It Cutoff by the Figure Box
Convert String Date to Timestamp in Python
Difference Between Filter and Filter_By in SQLalchemy
Is There a Math Ncr Function in Python
Python Function Attributes - Uses and Abuses
How to Use Pip with Python 3.X Alongside Python 2.X
How to Check If Two Segments Intersect
How to Make an Exe File from a Python Program
How to Manually Install a Pypi Module Without Pip/Easy_Install
How to Increase the Cell Width of the Jupyter/Ipython Notebook in My Browser
How to Quantify Difference Between Two Images
How to Read the Rgb Value of a Given Pixel in Python
Scraping Dynamic Content Using Python-Scrapy
Imread Returns None, Violating Assertion !_Src.Empty() in Function 'Cvtcolor' Error