How to Split CSV Files as Per Number of Rows Specified

How to split CSV files as per number of rows specified?

Made it into a function. You can now call splitCsv <Filename> [chunkSize]

splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}

Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html

Split CSV files into smaller files but keeping the headers?

The answer to this question is yes, this is possible with AWK.

The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

awk -v l=11000 '(NR==1){header=$0;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv

This works in the following way:

  • (NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
  • (NR%l==2){...}: Every time we wrote l=11000 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:

    • close(file): close the file you just wrote too.
    • file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file): define the new filename as FILENAME.00XXX.csv
    • print header > file: open the file and write the header to that file.
  • {print > file}: write the entries to the file.

note: If you don't care about the filename, you can use the following shorter version:

awk -v m=100 '
(NR==1){h=$0;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv

How to split large csv files into 125MB-1000MB small csv files dynamically using split command in UNIX

You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.

You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.

Rows per chunk is your "lines per chunk" count that you can finally pass to split.

Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).

Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

I created a mock CSV with 60_000_000 rows that is about 5GB:

ll -h gen_60000000x11.csv
-rw-r--r-- 1 zyoung staff 4.7G Jun 24 15:21 gen_60000000x11.csv

When I ran that script I got this output:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
1578947 chunked_0000
1578947 chunked_0001
1578947 chunked_0002
...
1578947 chunked_0036
1578947 chunked_0037
14 chunked_0038
60000000 total

Size (MB) per file:
129 chunked_0000
129 chunked_0001
129 chunked_0002
...
129 chunked_0036
129 chunked_0037
1 chunked_0038

Split a CSV file into multiple files with header and given number of records

Here is a method similar to yours using a for /F loop to read the input file. The performance is not quite good however, because each output file is opened and closed for every single line written:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1" & rem // (first command line argument is input file)
set /A "_LIMIT=5" & rem // (number of records or rows per output file)

rem // Split file name:
set "NAME=%~dpn1" & rem // (path and file name)
set "EXT=%~x1" & rem // (file name extension)

rem // Split file into multiple ones:
set "HEADER=" & set /A "INDEX=0, COUNT=0"
rem // Read file once:
for /F "usebackq delims=" %%L in ("%_FILE%") do (
rem // Read header if not done yet:
if not defined HEADER (
set "HEADER=%%L"
) else (
set "LINE=%%L"
rem // Compute line index, previous and current file count:
set /A "PREV=COUNT, COUNT=INDEX/_LIMIT+1, INDEX+=1"
rem // Write header once per output file:
setlocal EnableDelayedExpansion
if !PREV! lss !COUNT! (
> "!NAME!_!COUNT!!EXT!" echo/!HEADER!
)
rem // Write line:
>> "!NAME!_!COUNT!!EXT!" echo/!LINE!
endlocal
)
)

endlocal
exit /B

To accomplish your task you do not even need a for /F loop; rather you could use set /P, together with input redirection, in a for /L loop, like this (see all the explanatory comments):

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1" & rem // (first command line argument is input file)
set /A "_LIMIT=5" & rem // (number of records or rows per output file)

rem // Split file name:
set "NAME=%~dpn1" & rem // (path and file name)
set "EXT=%~x1" & rem // (file name extension)

rem // Determine number of lines excluding header:
for /F %%I in ('^< "%_FILE%" find /V /C ""') do set /A "COUNT=%%I-1"

rem // Split file into multiple ones:
setlocal EnableDelayedExpansion
rem // Read file once:
< "!_FILE!" (
rem // Read header (first line):
set /P HEADER=""
rem // Calculate number of output files:
set /A "DIV=(COUNT-1)/_LIMIT+1"
rem // Iterate over output files:
for /L %%J in (1,1,!DIV!) do (
rem // Write an output file:
> "!NAME!_%%J!EXT!" (
rem // Write header:
echo/!HEADER!
rem // Write as many lines as specified:
for /L %%I in (1,1,%_LIMIT%) do (
set "LINE=" & set /P LINE=""
if defined LINE echo/!LINE!
)
)
)
)
endlocal

endlocal
exit /B

The advantage of this method is that the input file as well as each output file is opened once only.

Split large csv file into multiple files and keep header in each part

Another awk. First some test records:

$ seq 1 1234567 > file

Then the awk:

$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file

Explained:

$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file

Count the records:

$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total

Split a large csv file through python

Oups! You are consistently re-opening the output file on each row, when it is an expensive operation... Your code could could become:

    ...
with open (in_file,'r') as file1:
row_count = 0
#reader = csv.reader(file1) # unused here
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
for line in file1:
#print line
if row_count >= split_size:
out_file.close()
filenum = filenum + 1
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
row_count = 0
out_file.write(line)
row_count = row_count +1
line_count = line_count+1
...

Ideally, you should even initialize out_file = None before the try block and ensure a clean close in the except blocks with if out_file is not None: out_file.close()

Remark: this code only splits in line count (as yours did). That means that is will give wrong output if the csv file can contain newlines in quoted fields...

Writing multiple csv files with specified number of rows for large dataframe

This should work

library( data.table )
#create sample data
dt = data.table( 1:2000 )
#split dt into a list, based on (f = ) the integer division (+ 1) of the 'rownumbers'
# by the preferred chuncksize (99)
# use keep.by = TRUE to keep the integer division (+ 1) result
# for naming the files when saving
l <- split( dt, f = dt[, .I] %/% 99 + 1, keep.by = TRUE )
#simple for loop, writing each list element, using it's name in the filename
for (i in 1:length(l)) {
write.csv( data.frame( l[[i]]), file = paste0( "./testfile_", names(l[i]), ".csv" ) )
}

results in

Sample Image



Related Topics



Leave a reply



Submit