Read a Small Random Sample from a Big CSV File into a Python Data Frame

Reading random rows of a large csv file, python, pandas

I think this is faster than other methods showed here and may be worth trying.

Say, we have already chosen rows to be skipped in a list skipped. First, I convert it to a lookup bool table.

# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True

Main stuff:

from io import StringIO
# in Python 2 use
# from StringIO import StringIO

def load_with_buffer(filename, bool_skipped, **kwargs):
s_buf = StringIO()
with open(filename) as file:
count = -1
for line in file:
count += 1
if bool_skipped[count]:
continue
s_buf.write(line)
s_buf.seek(0)
df = pd.read_csv(s_buf, **kwargs)
return df

I tested it as follows:

df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')

df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)

with 90% of rows skipped. It performs comparably to

pd.read_csv('test.csv', skiprows=skipped, index_col=0)

and is about 3-4 times faster than using dask or reading in chunks.

Selecting random rows (of data) from dataframe / csv file in Python after designating start and end row number?

I think the following code works:

import random
a=random.sample(range(250000,750000), 20000)
data=dataset.loc[a]

Read random lines from huge CSV file

import random

filesize = 1500 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset) #go to random position
f.readline() # discard - bound to be partial line
random_line = f.readline() # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0: # we have hit the end
f.seek(0)
random_line = f.readline() # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let's assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

Reading chunks of large csv file with shuffled rows for classification with ML

You could solve the label order issue by randomly shuffling the .csv on disk with utilities such as https://github.com/alexandres/terashuf - depending on your OS

EDIT

A solution using only pandas and standard libraries can be implemented using the skiprows argument.

import pandas as pd
import random, math

def read_shuffled_chunks(filepath: str, chunk_size: int,
file_lenght: int, has_header=True):

header = 0 if has_header else None
first_data_idx = 1 if has_header else 0
# create index list
index_list = list(range(first_data_idx,file_lenght))

# shuffle the list in place
random.shuffle(index_list)

# iterate through the chunks and read them
n_chunks = ceil(file_lenght/chunk_size)
for i in range(n_chunks):

rows_to_keep = index_list[(i*chunk_size):((i+1)*chunk_size - 1)]
if has_header:
rows_to_keep += [0] # include the index row
# get the inverse selection
rows_to_skip = list(set(index_list) - set(rows_to_keep))
yield pd.read_csv(filepath,skiprows=rows_to_skip, header=header)

Please note that, although the rows included in each chunk are going to be randomly sampled from the csv, they are read by pandas in their original order. If you are training your model with batches of each data chunk, you might want to consider randomize each subset DataFrame to avoid incurring in the same issue.

How to read only a slice of data stored in a big csv file in python

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )


Related Topics



Leave a reply



Submit