Python - Using Pandas Structures with Large CSV(Iterate and Chunksize)

python - Using pandas structures with large csv(iterate and chunksize)

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

isn't dataframe, but pandas.io.parsers.TextFileReader - source.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

Using pandas to efficiently read in a large CSV file without crashing

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)

If you just want to process each chunk individually, use,

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)

I have a large CSV (53Gs) and I need to process it in chunks in parallel

I would read the csv file line by line and feed the lines into a queue from which the processes pick their tasks. This way, you don't have to split the file first.

See this example here: https://stackoverflow.com/a/53847284/4141279

open selected rows with pandas using chunksize and/or iterator

How can I read only rows from 512*n to 512*(n+1)?

df = pd.read_csv(fn, header=None, skiprows=512*n, nrows=512)

You can do it this way (and it's pretty useful):

for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
    # process your chunk here

Demo:

In [61]: fn = 'd:/temp/a.csv'

In [62]: pd.DataFrame(np.random.randn(30, 3), columns=list('abc')).to_csv(fn, index=False)

In [63]: for chunk in pd.read_csv(fn, chunksize=10):
   ....:     print(chunk)
   ....:
          a         b         c
0  2.229657 -1.040086  1.295774
1  0.358098 -1.080557 -0.396338
2  0.731741 -0.690453  0.126648
3 -0.009388 -1.549381  0.913128
4 -0.256654 -0.073549 -0.171606
5  0.849934  0.305337  2.360101
6 -1.472184  0.641512 -1.301492
7 -2.302152  0.417787  0.485958
8  0.492314  0.603309  0.890524
9 -0.730400  0.835873  1.313114
          a         b         c
0  1.393865 -1.115267  1.194747
1  3.038719 -0.343875 -1.410834
2 -1.510598  0.664154 -0.996762
3 -0.528211  1.269363  0.506728
4  0.043785 -0.786499 -1.073502
5  1.096647 -1.127002  0.918172
6 -0.792251 -0.652996 -1.000921
7  1.582166 -0.819374  0.247077
8 -1.022418 -0.577469  0.097406
9 -0.274233 -0.244890 -0.352108
          a         b         c
0 -0.317418  0.774854 -0.203939
1  0.205443  0.820302 -2.637387
2  0.332696 -0.655431 -0.089120
3 -0.884916  0.274854  1.074991
4  0.412295 -1.561943 -0.850376
5 -1.933529 -1.346236 -1.789500
6  1.652446 -0.800644 -0.126594
7  0.520916 -0.825257 -0.475727
8 -2.261692  2.827894 -0.439698
9 -0.424714  1.862145  1.103926

In which case "iterator" can be useful?

when using chunksize - all chunks will have the same length. Using iterator parameter you can define how much data (get_chunk(nrows)) you want to read in each iteration:

In [66]: reader = pd.read_csv(fn, iterator=True)

let's read first 3 rows

In [67]: reader.get_chunk(3)
Out[67]:
          a         b         c
0  2.229657 -1.040086  1.295774
1  0.358098 -1.080557 -0.396338
2  0.731741 -0.690453  0.126648

now we'll read next 5 rows:

In [68]: reader.get_chunk(5)
Out[68]:
          a         b         c
0 -0.009388 -1.549381  0.913128
1 -0.256654 -0.073549 -0.171606
2  0.849934  0.305337  2.360101
3 -1.472184  0.641512 -1.301492
4 -2.302152  0.417787  0.485958

next 7 rows:

In [69]: reader.get_chunk(7)
Out[69]:
          a         b         c
0  0.492314  0.603309  0.890524
1 -0.730400  0.835873  1.313114
2  1.393865 -1.115267  1.194747
3  3.038719 -0.343875 -1.410834
4 -1.510598  0.664154 -0.996762
5 -0.528211  1.269363  0.506728
6  0.043785 -0.786499 -1.073502

How can I partially read a huge CSV file?

Use chunksize:

for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
    #do something

To answer your second part do this:

df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)

This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.

Pandas Processing Large CSV Data

Probably easier not doing it with pandas.

with open(input_csv_file) as fin:
    with open(output_csv_file) as fout:
        writer = csv.writer(fout)
        seen_keys = set()
        header = True
        for row in csv.reader(fin):
            if header:
                writer.writerow(row)
                header = False
                continue

            key = tuple(row[i] for i in key_indices)
            if not all(key):  # skip if key is empty
                continue

            if key not in seen_keys:
                writer.writerow(row)
                seen_keys.add(key)

Python - Using Pandas Structures with Large CSV(Iterate and Chunksize)