Reading random rows of a large csv file, python, pandas
I think this is faster than other methods showed here and may be worth trying.
Say, we have already chosen rows to be skipped in a list skipped
. First, I convert it to a lookup bool table.
# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True
Main stuff:
from io import StringIO
# in Python 2 use
# from StringIO import StringIO
def load_with_buffer(filename, bool_skipped, **kwargs):
s_buf = StringIO()
with open(filename) as file:
count = -1
for line in file:
count += 1
if bool_skipped[count]:
continue
s_buf.write(line)
s_buf.seek(0)
df = pd.read_csv(s_buf, **kwargs)
return df
I tested it as follows:
df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')
df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)
with 90% of rows skipped. It performs comparably to
pd.read_csv('test.csv', skiprows=skipped, index_col=0)
and is about 3-4 times faster than using dask or reading in chunks.
Selecting random rows (of data) from dataframe / csv file in Python after designating start and end row number?
I think the following code works:
import random
a=random.sample(range(250000,750000), 20000)
data=dataset.loc[a]
Read random lines from huge CSV file
import random
filesize = 1500 #size of the really big file
offset = random.randrange(filesize)
f = open('really_big_file')
f.seek(offset) #go to random position
f.readline() # discard - bound to be partial line
random_line = f.readline() # bingo!
# extra to handle last/first line edge cases
if len(random_line) == 0: # we have hit the end
f.seek(0)
random_line = f.readline() # so we'll grab the first line instead
As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:
Let's assume (in this case) we have min=3 and max=15
1) Find the length (Lp) of the previous line.
Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.
We accomplish this by randomly keeping the line X% of the time where:
X = min / Lp
If we don't keep the line, we do another random pick until our dice roll comes good. :-)
Reading chunks of large csv file with shuffled rows for classification with ML
You could solve the label order issue by randomly shuffling the .csv on disk with utilities such as https://github.com/alexandres/terashuf - depending on your OS
EDIT
A solution using only pandas and standard libraries can be implemented using the skiprows
argument.
import pandas as pd
import random, math
def read_shuffled_chunks(filepath: str, chunk_size: int,
file_lenght: int, has_header=True):
header = 0 if has_header else None
first_data_idx = 1 if has_header else 0
# create index list
index_list = list(range(first_data_idx,file_lenght))
# shuffle the list in place
random.shuffle(index_list)
# iterate through the chunks and read them
n_chunks = ceil(file_lenght/chunk_size)
for i in range(n_chunks):
rows_to_keep = index_list[(i*chunk_size):((i+1)*chunk_size - 1)]
if has_header:
rows_to_keep += [0] # include the index row
# get the inverse selection
rows_to_skip = list(set(index_list) - set(rows_to_keep))
yield pd.read_csv(filepath,skiprows=rows_to_skip, header=header)
Please note that, although the rows included in each chunk are going to be randomly sampled from the csv, they are read by pandas in their original order. If you are training your model with batches of each data chunk, you might want to consider randomize each subset DataFrame to avoid incurring in the same issue.
How to read only a slice of data stored in a big csv file in python
You can read the CSV file chunk by chunk and retain the rows which you want to have
iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )
Related Topics
Python Overwriting Variables in Nested Functions
Differencebetween an Opencv Bgr Image and Its Reverse Version Rgb Image[:,:,::-1]
How to Convert a Python Datetime.Datetime to Excel Serial Date Number
Opencv - Apply Mask to a Color Image
How to Initialize Weights in Pytorch
Error Running Basic Tensorflow Example
How to Check That Multiple Keys Are in a Dict in a Single Pass
Shipping Python Modules in Pyspark to Other Nodes
<Django Object > Is Not JSON Serializable
Python SQLite Parameter Substitution with Wildcards in Like
Plotting Networkx Graph with Node Labels Defaulting to Node Name
How to Call Function That Takes an Argument in a Django Template
Using Lxml and Iterparse() to Parse a Big (+- 1Gb) Xml File
Converting "Yield From" Statement to Python 2.7 Code
Failed to Upload Packages to Pypi: 410 Gone