python - Using pandas structures with large csv(iterate and chunksize)
Solution, if need create one big DataFrame
if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader
- source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat
, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask
, because it provides advanced parallelism.
Using pandas to efficiently read in a large CSV file without crashing
You should consider using the chunksize
parameter in read_csv
when reading in your dataframe, because it returns a TextFileReader
object you can then pass to pd.concat
to concatenate your chunks.
chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)
If you just want to process each chunk individually, use,
chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv',
chunksize=chunksize,
iterator=True):
do_something_with_chunk(chunk)
I have a large CSV (53Gs) and I need to process it in chunks in parallel
I would read the csv file line by line and feed the lines into a queue from which the processes pick their tasks. This way, you don't have to split the file first.
See this example here: https://stackoverflow.com/a/53847284/4141279
open selected rows with pandas using chunksize and/or iterator
How can I read only rows from 512*n to 512*(n+1)?
df = pd.read_csv(fn, header=None, skiprows=512*n, nrows=512)
You can do it this way (and it's pretty useful):
for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
# process your chunk here
Demo:
In [61]: fn = 'd:/temp/a.csv'
In [62]: pd.DataFrame(np.random.randn(30, 3), columns=list('abc')).to_csv(fn, index=False)
In [63]: for chunk in pd.read_csv(fn, chunksize=10):
....: print(chunk)
....:
a b c
0 2.229657 -1.040086 1.295774
1 0.358098 -1.080557 -0.396338
2 0.731741 -0.690453 0.126648
3 -0.009388 -1.549381 0.913128
4 -0.256654 -0.073549 -0.171606
5 0.849934 0.305337 2.360101
6 -1.472184 0.641512 -1.301492
7 -2.302152 0.417787 0.485958
8 0.492314 0.603309 0.890524
9 -0.730400 0.835873 1.313114
a b c
0 1.393865 -1.115267 1.194747
1 3.038719 -0.343875 -1.410834
2 -1.510598 0.664154 -0.996762
3 -0.528211 1.269363 0.506728
4 0.043785 -0.786499 -1.073502
5 1.096647 -1.127002 0.918172
6 -0.792251 -0.652996 -1.000921
7 1.582166 -0.819374 0.247077
8 -1.022418 -0.577469 0.097406
9 -0.274233 -0.244890 -0.352108
a b c
0 -0.317418 0.774854 -0.203939
1 0.205443 0.820302 -2.637387
2 0.332696 -0.655431 -0.089120
3 -0.884916 0.274854 1.074991
4 0.412295 -1.561943 -0.850376
5 -1.933529 -1.346236 -1.789500
6 1.652446 -0.800644 -0.126594
7 0.520916 -0.825257 -0.475727
8 -2.261692 2.827894 -0.439698
9 -0.424714 1.862145 1.103926
In which case "iterator" can be useful?
when using chunksize
- all chunks will have the same length. Using iterator
parameter you can define how much data (get_chunk(nrows)
) you want to read in each iteration:
In [66]: reader = pd.read_csv(fn, iterator=True)
let's read first 3 rows
In [67]: reader.get_chunk(3)
Out[67]:
a b c
0 2.229657 -1.040086 1.295774
1 0.358098 -1.080557 -0.396338
2 0.731741 -0.690453 0.126648
now we'll read next 5 rows:
In [68]: reader.get_chunk(5)
Out[68]:
a b c
0 -0.009388 -1.549381 0.913128
1 -0.256654 -0.073549 -0.171606
2 0.849934 0.305337 2.360101
3 -1.472184 0.641512 -1.301492
4 -2.302152 0.417787 0.485958
next 7 rows:
In [69]: reader.get_chunk(7)
Out[69]:
a b c
0 0.492314 0.603309 0.890524
1 -0.730400 0.835873 1.313114
2 1.393865 -1.115267 1.194747
3 3.038719 -0.343875 -1.410834
4 -1.510598 0.664154 -0.996762
5 -0.528211 1.269363 0.506728
6 0.043785 -0.786499 -1.073502
How can I partially read a huge CSV file?
Use chunksize
:
for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
#do something
To answer your second part do this:
df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)
This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.
Pandas Processing Large CSV Data
Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)
Related Topics
Python Urllib2 Basic Auth Problem
Getting Today's Date in Yyyy-Mm-Dd in Python
What Is the Reason for Having '//' in Python
Python Pandas: How to Specify Data Types When Reading an Excel File
Inverse Distance Weighted (Idw) Interpolation with Python
Gradient Descent Using Python and Numpy
Python: Call a Function from String Name
Making an Asynchronous Task in Flask
How to Uninstall a Package Installed with Pip Install --User
Sqlalchemy Orm Conversion to Pandas Dataframe
What Is the Current Choice for Doing Rpc in Python
The Problem with Installing Pil Using Virtualenv or Buildout
Convert List to Tuple in Python
How to Find the Current Os in Python