How to Split a Huge CSV File Based on Content of First Column

How to split a huge csv file based on content of first column?

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")

Splitting large csv file by column first letter according to alphabetic order linux

It's (perhaps surprisingly) straightforward

awk -F, '{print > (substr($2,1,1) ".csv")}' large.csv

split large csv text file based on column value

C++ is fine if you know it best. Why would you try to load the entire file into memory anyways?

Since the output is dependent upon the column being read you could easily store buffers for output files and stuff the record into the appropriate file as you process, cleaning as you go to keep the memory footprint relatively small.

I do this (albeit in java) when needing to take massive extracts from a database. The records are pushed into a file buffer stream and anything in the memory is cleaned up so the footprint of the program never grows beyond what it initially starts out at.

Fly by the seat of my pants pseudo-code:

  1. Create a list to hold your output file buffers
  2. Open stream on file and begin reading in the contents one line at a time
  3. Did we encounter a record that has an open file stream for it's content type yet?

    • Yes -
      • Get the stored file stream
      • store the record into that file
      • flush the stream
    • No -
      • create a stream and save it to our list of streams
      • store the record on the stream
      • flush the stream
  4. Rinse repeat...

Basically continuing this processing until we're at the end of the file.

Since we never store more than pointers to the streams and we're flushing as soon as we write to the streams we don't ever hold anything resident in the memory of the application other than one record from the input file. Thus the footprint is kept managable.

Split a large csv file based on date in First column Python 3.4.3

Assuming your files should keep the pattern 2015.01.01, cleaning the key should work:

key = key.split()[0].replace('-', '.')

Full code:

import csv
from itertools import groupby

def shorten_key(key):
return key.split()[0].replace('-', '.')

for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
lambda row: shorten_key(row[0])):

with open("%s.txt" % shorten_key(key), "a") as output:
for row in rows:
output.write(",".join(row) + "\n")

A quick test:

keys = ['2015-03-01 00:00:02.000',  '2015.01.01']

for key in keys:
print(key.split()[0].replace('-', '.'))

Output:

2015.03.01
2015.01.01

split a csv file by content of first column without creating a copy?

I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.

You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.

How to split a huge .CSV file into n smaller files when an index in a particular column changes?

Something like this might be sufficient to split the csv file into smaller files each grouped by the first column in the csv:

awk -F, '{ print >> ($1".part.csv") }' file.csv

Breakdown

# awk iterates over each line in the specified input file
awk -F, # tell awk to split the lin into columns on ","
'{ print # print whole line
>> # append to file
($1".part.csv") }' # output file is first columns prefixed with ".part.csv"
file.csv # input file

Python: Split CSV file according to first character of the first column

Here's a simple application of groupby:

df = pandas.read_csv('basename.csv', header=None)

def firstletter(index):
firstentry = df.ix[index, 0]
return firstentry[0]

for letter, group in df.groupby(firstletter):
group.to_csv('basename_{}.csv'.format(letter))

Or, incorporating @jezrael's use of grouping by the explicit contents of the columns:

for letter, group in df.groupby(df[0].str[0]):
group.to_csv('basename_{}.csv'.format(letter))

Split Large csv File into multiple files depending on row value python

Here is one approach.

fn = 'NEM12#000000000000001#CNRGYMDP#NEMMCO.csv'

cnt = 0
outfn = f'out_{cnt}.csv'

with open(fn, 'r') as f:
for line in f:
if line.startswith('100,'): # don't write
continue
elif line.startswith('900'): # don't write
continue
elif line.startswith('200,'): # write detect start
cnt += 1
outfn = f'out_{cnt}.csv' # new filename

if line.startswith(('200,', '300,', '400,')):
with open(outfn, 'a') as w: # write
w.write(f'{line}'):

The output will be out_1.csv, out_2.csv etc



Related Topics



Leave a reply



Submit