Split Large Text File(Around 50Gb) into Multiple Files

Split large files by size limit without cutting lines

From the split man-page:

...
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file
...

The description of this option may not be very obvious, but it seems to cover what you are asking for: the file is split at the latest possible line break before reaching SIZE bytes.

split large file into files with a set number of lines based on 1st column value

with double scanning the file you can do

$ awk -F\| -v size=5 'NR==FNR  {a[$1]++; next} 
FNR==1 || p!=$1 {if(count+a[$1]>=size) {f++; count=0}
else count+=a[$1]; p=$1}
{print > "_file_"f+0}' file{,}

$ head _f*
==> _file_0 <==
A.B|100|20
A.B|101|20
A.X|101|30
A.X|1000|20

==> _file_1 <==
B.Y|1|1
B.Y|1|2

note however that if one of the unique keys can have more records than the desired file length, the non-splitting and keeping the max file length will conflict. In this script, I assumed non-splitting is more important. For example, for the same input file change, set size=1. The keys won't be split into separate files, but file lengths will be more than 1.

Shell command to split large file into 10 smaller files

Use split - e.g. to split a file every 3.4 million lines (should give you 10 files):

split -l 3400000

$ man split



Related Topics



Leave a reply



Submit