How to split CSV files as per number of rows specified?
Made it into a function. You can now call splitCsv <Filename> [chunkSize]
splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}
Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html
Split CSV files into smaller files but keeping the headers?
The answer to this question is yes, this is possible with AWK.
The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv
:
awk -v l=11000 '(NR==1){header=$0;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv
This works in the following way:
(NR==1){header=$0;next}
: If the record/line is the first line, save that line as the header.(NR%l==2){...}
: Every time we wrotel=11000
records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:close(file)
: close the file you just wrote too.file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file)
: define the new filename asFILENAME.00XXX.csv
print header > file
: open the file and write the header to that file.
{print > file}
: write the entries to the file.
note: If you don't care about the filename, you can use the following shorter version:
awk -v m=100 '
(NR==1){h=$0;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv
How to split large csv files into 125MB-1000MB small csv files dynamically using split command in UNIX
You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.
You then get the row count (wc -l
, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.
Rows per chunk is your "lines per chunk" count that you can finally pass to split
.
Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).
Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:
csvFile=$1
maxSizeMB=125
rm -f chunked_*
fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"
nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"
nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"
nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"
split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"
echo "Row (line) counts per file:"
wc -l chunked_00*
echo
echo "Size (MB) per file:"
du -ms chunked_00*
I created a mock CSV with 60_000_000 rows that is about 5GB:
ll -h gen_60000000x11.csv
-rw-r--r-- 1 zyoung staff 4.7G Jun 24 15:21 gen_60000000x11.csv
When I ran that script I got this output:
./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
1578947 chunked_0000
1578947 chunked_0001
1578947 chunked_0002
...
1578947 chunked_0036
1578947 chunked_0037
14 chunked_0038
60000000 total
Size (MB) per file:
129 chunked_0000
129 chunked_0001
129 chunked_0002
...
129 chunked_0036
129 chunked_0037
1 chunked_0038
Split a CSV file into multiple files with header and given number of records
Here is a method similar to yours using a for /F
loop to read the input file. The performance is not quite good however, because each output file is opened and closed for every single line written:
@echo off
setlocal EnableExtensions DisableDelayedExpansion
rem // Define constants here:
set "_FILE=%~1" & rem // (first command line argument is input file)
set /A "_LIMIT=5" & rem // (number of records or rows per output file)
rem // Split file name:
set "NAME=%~dpn1" & rem // (path and file name)
set "EXT=%~x1" & rem // (file name extension)
rem // Split file into multiple ones:
set "HEADER=" & set /A "INDEX=0, COUNT=0"
rem // Read file once:
for /F "usebackq delims=" %%L in ("%_FILE%") do (
rem // Read header if not done yet:
if not defined HEADER (
set "HEADER=%%L"
) else (
set "LINE=%%L"
rem // Compute line index, previous and current file count:
set /A "PREV=COUNT, COUNT=INDEX/_LIMIT+1, INDEX+=1"
rem // Write header once per output file:
setlocal EnableDelayedExpansion
if !PREV! lss !COUNT! (
> "!NAME!_!COUNT!!EXT!" echo/!HEADER!
)
rem // Write line:
>> "!NAME!_!COUNT!!EXT!" echo/!LINE!
endlocal
)
)
endlocal
exit /B
To accomplish your task you do not even need a for /F
loop; rather you could use set /P
, together with input redirection, in a for /L
loop, like this (see all the explanatory comments):
@echo off
setlocal EnableExtensions DisableDelayedExpansion
rem // Define constants here:
set "_FILE=%~1" & rem // (first command line argument is input file)
set /A "_LIMIT=5" & rem // (number of records or rows per output file)
rem // Split file name:
set "NAME=%~dpn1" & rem // (path and file name)
set "EXT=%~x1" & rem // (file name extension)
rem // Determine number of lines excluding header:
for /F %%I in ('^< "%_FILE%" find /V /C ""') do set /A "COUNT=%%I-1"
rem // Split file into multiple ones:
setlocal EnableDelayedExpansion
rem // Read file once:
< "!_FILE!" (
rem // Read header (first line):
set /P HEADER=""
rem // Calculate number of output files:
set /A "DIV=(COUNT-1)/_LIMIT+1"
rem // Iterate over output files:
for /L %%J in (1,1,!DIV!) do (
rem // Write an output file:
> "!NAME!_%%J!EXT!" (
rem // Write header:
echo/!HEADER!
rem // Write as many lines as specified:
for /L %%I in (1,1,%_LIMIT%) do (
set "LINE=" & set /P LINE=""
if defined LINE echo/!LINE!
)
)
)
)
endlocal
endlocal
exit /B
The advantage of this method is that the input file as well as each output file is opened once only.
Split large csv file into multiple files and keep header in each part
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Split a large csv file through python
Oups! You are consistently re-opening the output file on each row, when it is an expensive operation... Your code could could become:
...
with open (in_file,'r') as file1:
row_count = 0
#reader = csv.reader(file1) # unused here
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
for line in file1:
#print line
if row_count >= split_size:
out_file.close()
filenum = filenum + 1
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
row_count = 0
out_file.write(line)
row_count = row_count +1
line_count = line_count+1
...
Ideally, you should even initialize out_file = None
before the try block and ensure a clean close in the except
blocks with if out_file is not None: out_file.close()
Remark: this code only splits in line count (as yours did). That means that is will give wrong output if the csv file can contain newlines in quoted fields...
Writing multiple csv files with specified number of rows for large dataframe
This should work
library( data.table )
#create sample data
dt = data.table( 1:2000 )
#split dt into a list, based on (f = ) the integer division (+ 1) of the 'rownumbers'
# by the preferred chuncksize (99)
# use keep.by = TRUE to keep the integer division (+ 1) result
# for naming the files when saving
l <- split( dt, f = dt[, .I] %/% 99 + 1, keep.by = TRUE )
#simple for loop, writing each list element, using it's name in the filename
for (i in 1:length(l)) {
write.csv( data.frame( l[[i]]), file = paste0( "./testfile_", names(l[i]), ".csv" ) )
}
results in
Related Topics
Run an Untrusted C Program in a Sandbox in Linux That Prevents It from Opening Files, Forking, etc.
How to Set Linux Environment Variables With Ansible
What Is the Maximum Size of a Linux Environment Variable Value
How to Access the System Call from User-Space
Setting the Umask of the Apache User
Difference Between "#!/Usr/Bin/Env Bash" and "#!/Usr/Bin/Bash"
Get Program Execution Time in the Shell
Retrieve Cpu Usage and Memory Usage of a Single Process on Linux
Use of Floating Point in the Linux Kernel
How to Get Unique Values from an Array in Bash
How to Add a New Device in Qemu Source Code
How to Convert Hex to Ascii Characters in the Linux Shell
Separately Redirecting and Recombining Stderr/Stdout Without Losing Ordering