Splitting a File in Linux Based on Content

Splitting a file in linux based on content

If you have a mail.txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplit to split by <html>

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt

Split linux files based on condition

You need to trim the underscore and trailing text from each line. %%_* does that:

while read -r line ; do
    echo "$line" >> "${line%%_*}.txt"  
done < split.txt

Explanation:

%: trim trailing text
%%: find the longest possible match
_*: an underscore and everything after

Split file into several files based on condition and also number of lines approximately

When each Order Header has a lot of records, you might consider the simple

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

When you do want different headers combined in a file, consider

awk -v max=40000 '
   function flush() {
      if (last+nr>max || sample==0) {
         outfile="sample_" sample++ ".txt";
         last=0;
      }
      for (i=0;i<nr;i++) print a[i] >> outfile;
      last+=nr;
      nr=0;
   }
   BEGIN { sample=0 }
   /00000,/ { flush(); }
   {a[nr++]=$0}
   END { flush() }
   ' sample.txt

splitting a huge text file based on line content

try :

awk 'BEGIN{FS="/"} {print > $1}' [your file name]

output:

cat www.unix.com 
www.unix.com/man-page/opensolaris/1/csplit/&hl=en
www.unix.com/shell-programming-and-scripting/126539-csplit-help.html/RK=0/RS=iGOr1SINnK126qZciYPZtBHpEmg-
cat www.linuxdevcenter.com 
www.linuxdevcenter.com/cmd/cmd.csp?path=c/csplit+"csplit"&hl=en&ct=clnk
cat www.w3cschool.cc 
www.w3cschool.cc/linux/linux-comm-csplit.html

{print > $1} will redirect output to separate files based on $1, in this case, the domain name.

Most efficient method to split file into multiple files based on a column

Within awk you can redirect the output of each line to a different file whose name you build dynamically, (based on $2 in this case):

$ awk -F, '{print > ("some_prefix_" $2 "_some_suffix_date")}' file

$ ls *_date
some_prefix_345_some_suffix_date    some_prefix_45_some_suffix_date     some_prefix_645_some_suffix_date

$ cat some_prefix_345_some_suffix_date 
rec1,345,field3,....field20
rec12,345,field3,....field20

$ cat some_prefix_645_some_suffix_date 
rec1,645,field3,....field20
rec34,645,field3,....field20

$ cat some_prefix_45_some_suffix_date 
frec23,45,field3,....field20

As pointed out in the comments, if you have many different values of $2 and you get an error for too many open files, you can close as you go:

 $ awk -F, '{fname = "xsome_prefix_" $2 "_some_suffix_date"
             if (a[fname]++) print >> fname; else print > fname;
             close fname}' file

splitting a file into multiple files based on size and occurrence

The regexp needs a little tuning, since the resultfiles do not match completely. Run it as: perl scriptname.pl < sample.txt and you get chunk files.

#!/usr/bin/perl -w

use strict;
use IO::File;


my $all = join('', (<STDIN>));

my (@pieces) = ($all =~ m%([IZO]\(.*?\)\{.*?\r\n\}\r\n)%gsx);

my $n = 1;
my $FH;
foreach my $P (@pieces) {
   if ($P =~ m%^I%) {
      undef $FH;
      $FH = IO::File->new(sprintf("> chunk%d", $n));
      $n++;
   }
   print $FH $P;
}

Less memory hungry:

#!/usr/bin/env python

import sys

def split(filename, size=100, outputPrefix="xxx"):
    with open(filename) as I:
        n = 0
        FNM = "{}{}.txt"
        O = open(FNM.format(outputPrefix, n), "w")
        toWrite = size*1024*1024
        for line in I:
            toWrite -= len(line)
            if line[0] == 'I' and toWrite < 0:
                O.close()
                toWrite = size*1024*1024
                n += 1
                O = open(FNM.format(outputPrefix, n), "w")
            O.write(line)
        O.close()

if __name__ == "__main__":
    split(sys.argv[1])

use: python scriptname.py sample.txt
all concatenated files are equale to sample.txt

Splitting a File in Linux Based on Content