Splitting a File in Linux Based on Content

Splitting a file in linux based on content

If you have a mail.txt

$ cat mail.txt
<html>
mail A
</html>

<html>
mail B
</html>

<html>
mail C
</html>

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}'

- mail.txt => input file
- /^<html>$/ => pattern match every `<html>` line
- {*} => repeat the previous pattern as many times as possible

check output

$ ls
mail.txt xx00 xx01 xx02 xx03

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt 5.txt 9.txt mail.txt

Split linux files based on condition

You need to trim the underscore and trailing text from each line. %%_* does that:

while read -r line ; do
echo "$line" >> "${line%%_*}.txt"
done < split.txt

Explanation:

  • %: trim trailing text
  • %%: find the longest possible match
  • _*: an underscore and everything after

Split file into several files based on condition and also number of lines approximately

When each Order Header has a lot of records, you might consider the simple

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

When you do want different headers combined in a file, consider

awk -v max=40000 '
function flush() {
if (last+nr>max || sample==0) {
outfile="sample_" sample++ ".txt";
last=0;
}
for (i=0;i<nr;i++) print a[i] >> outfile;
last+=nr;
nr=0;
}
BEGIN { sample=0 }
/00000,/ { flush(); }
{a[nr++]=$0}
END { flush() }
' sample.txt

splitting a huge text file based on line content

try :

awk 'BEGIN{FS="/"} {print > $1}' [your file name]

output:

cat www.unix.com 
www.unix.com/man-page/opensolaris/1/csplit/&hl=en
www.unix.com/shell-programming-and-scripting/126539-csplit-help.html/RK=0/RS=iGOr1SINnK126qZciYPZtBHpEmg-
cat www.linuxdevcenter.com
www.linuxdevcenter.com/cmd/cmd.csp?path=c/csplit+"csplit"&hl=en&ct=clnk
cat www.w3cschool.cc
www.w3cschool.cc/linux/linux-comm-csplit.html

{print > $1} will redirect output to separate files based on $1, in this case, the domain name.

Most efficient method to split file into multiple files based on a column

Within awk you can redirect the output of each line to a different file whose name you build dynamically, (based on $2 in this case):

$ awk -F, '{print > ("some_prefix_" $2 "_some_suffix_date")}' file

$ ls *_date
some_prefix_345_some_suffix_date some_prefix_45_some_suffix_date some_prefix_645_some_suffix_date

$ cat some_prefix_345_some_suffix_date
rec1,345,field3,....field20
rec12,345,field3,....field20

$ cat some_prefix_645_some_suffix_date
rec1,645,field3,....field20
rec34,645,field3,....field20

$ cat some_prefix_45_some_suffix_date
frec23,45,field3,....field20

As pointed out in the comments, if you have many different values of $2 and you get an error for too many open files, you can close as you go:

 $ awk -F, '{fname = "xsome_prefix_" $2 "_some_suffix_date"
if (a[fname]++) print >> fname; else print > fname;
close fname}' file

splitting a file into multiple files based on size and occurrence

The regexp needs a little tuning, since the resultfiles do not match completely. Run it as: perl scriptname.pl < sample.txt and you get chunk files.

#!/usr/bin/perl -w

use strict;
use IO::File;


my $all = join('', (<STDIN>));

my (@pieces) = ($all =~ m%([IZO]\(.*?\)\{.*?\r\n\}\r\n)%gsx);

my $n = 1;
my $FH;
foreach my $P (@pieces) {
if ($P =~ m%^I%) {
undef $FH;
$FH = IO::File->new(sprintf("> chunk%d", $n));
$n++;
}
print $FH $P;
}

Less memory hungry:

#!/usr/bin/env python

import sys

def split(filename, size=100, outputPrefix="xxx"):
with open(filename) as I:
n = 0
FNM = "{}{}.txt"
O = open(FNM.format(outputPrefix, n), "w")
toWrite = size*1024*1024
for line in I:
toWrite -= len(line)
if line[0] == 'I' and toWrite < 0:
O.close()
toWrite = size*1024*1024
n += 1
O = open(FNM.format(outputPrefix, n), "w")
O.write(line)
O.close()

if __name__ == "__main__":
split(sys.argv[1])

use: python scriptname.py sample.txt
all concatenated files are equale to sample.txt



Related Topics



Leave a reply



Submit