Split Files Based on File Content and Pattern Matching

Split text file into parts based on a pattern taken from the text file

Here's a simple awk script that will do what you want:

BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$1 }
$1 == delim {
f=sprintf("test%02d.txt",fn++);
print "Creating " f
}

{ print $0 > f }
  1. initialize output file number
  2. ignore the first line
  3. extract the delimiter from the second line
  4. for every input line whose first token matches the delimiter, set up the output file name
  5. for all lines, write to the current output file

Split files based on matching string

Could you please try following, written and tested with provided samples.

awk '
/STOCKHOLM/{
close(file)
file=count=""
}
(/STOCKHOLM/ || !NF) && !file{
val=(val?val ORS:"")$0
count++
next
}
count==2{
count=""
file=$NF"_full.txt"
if(val){
print val > (file)
val=""
}
next
}
file{
print >> (file)
}
' Input_file

Explanation: Adding detailed explanation here.

awk '                             ##Starting awk program from here.
/STOCKHOLM/{ ##Checking condition if string STOCKHOLM is present in line then do following.
close(file) ##Closing the file opened in background to avoid errors.
file=count="" ##Nullifying variables file and count here.
}
(/STOCKHOLM/ || !NF) && !file{ ##Checking condition if line has string STOCKHOLM OR null fields AND file variable is NULL then do following.
val=(val?val ORS:"")$0 ##Creating val which is concatenating its own value each time cursor comes here.
count++ ##Increment variable count with 1 here.
next ##next will skip all further statements from here.
}
count==2{ ##Checking condition if count is 2 then do following.
count="" ##Nullifying count here.
file=$NF"_full.txt" ##Creating outputfile name here with last field and string adding to it.
if(val){ ##Check if val is NOT NULL then do following.
print val > (file) ##Printing val into output file here.
val="" ##Nullifying val here.
}
next ##next will skip all further statements from here.
}
file{ ##if file is NOT NULL.
print >> (file) ##Printing lines into output file here.
}
' Input_file ##Mentioning Input_file name here.

Splitting text files based on pattern and have dynamic output file names which is embedded in the file

I would open a new file on a line starting with Country and copy everything there until Delimiter is found:

with open(input_file) as f:
copy = False
out = None
for line in f:
if copy:
_ = out.write(line)
if line.strip() == 'Delimiter':
out.close()
copy = False
elif line.strip().startswith('Country'):
file = line.split(':', 1)[1].split()[0]
out = open(file + '.txt', 'w')
_ = out.write(line)
copy = True
if out and not out.closed:
out.close()

Splitting files based on some pattern and the information inside the chunk

Based on your comment: "I'd like to have a file with only chunks that have SEQ and another file with chunks of text that do not have SEQ"

In Perl, I'd do it like this:

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $has_seq, '>', 'SEQ' ) or die $!;
open ( my $no_seq, '>', 'NO_SEQ' ) or die $!;
my $seq_count = 0;
my $no_seq_count = 0;

local $/ = 'END';

#iterate stdin or files specified on command line, just like sed/grep
while ( <> ) {
#check if this chunk contains the word 'SEQ'.
#regex match, so it'll match this text anywhere.
#maybe need to tighen up to ^SEQ= or similar?
if ( m/SEQ/ ) {
#choose output filehandle
$seq_count++;
select $has_seq;
}
else {
$no_seq_count++;
select $no_seq;
}
#print current block to selected filehandle.
print;
}

select \*STDOUT;
print "SEQ: $seq_count\n";
print "No SEQ: $no_seq_count\n";

This'll create two files (called creatively "SEQ" and "NO_SEQ") and split the results from your source.

Split file into several files based on condition and also number of lines approximately

When each Order Header has a lot of records, you might consider the simple

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

When you do want different headers combined in a file, consider

awk -v max=40000 '
function flush() {
if (last+nr>max || sample==0) {
outfile="sample_" sample++ ".txt";
last=0;
}
for (i=0;i<nr;i++) print a[i] >> outfile;
last+=nr;
nr=0;
}
BEGIN { sample=0 }
/00000,/ { flush(); }
{a[nr++]=$0}
END { flush() }
' sample.txt

Split one file into multiple files based on pattern


#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=3d3d)/)) {
open(O, '>temp' . ++$n);
print O $match;
close(O);
}

How to split a file into multiple files based on condition match with line numbers starting from beginning in all split files in UNIX?

In case you need to print the count vice(matching your conditions one) then use following.

awk '/H.*/{count=1;close(x);x="F"++i;next}{print count++ "," $0 > x;}' words.txt

Add close also to avoid error which gives us sometimes "too many files opened"

Explanation: Adding explanation of above code too now.

awk '                             ##Starting awk program here.
/H.*/{ ##Checking condition from H.* to till everything it covers in line.
count=1 ##Setting variable named count value to 1 here.
close(x) ##Closing the file(in case it is opened) whose value is variable x value. To avoid too many opened files error.
x="F"++i ##Creating variable x whose value is character F with increasing value of variable F each time with 1.
next ##next will skip all further statements.
}
{ ##Following statements will be executed when above condition is NOT TRUE.
print count++ "," $0 > x ##Printing variable count value with comma and current line value into file named x here.
}
' words.txt ##Mentioning Input_file name here.

How to split a text file into smaller files based on regex pattern?

You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip() method.

input_file = 'Clean-Junction-Links1.txt'

with open(input_file, 'r') as file:
i = 0
output = None

for line in file:
if not line.strip(): # Blank line?
if output:
output.close()
output = None
else:
if output is None:
i += 1
print(f'Creating file "{i}.txt"')
output = open(f'{i}.txt','w')
output.write(line)

if output:
output.close()

print('-fini-')

Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:

  1. Reading the file and grouping the lines of each a record together.
  2. Writing each group of lines to a separate file.

The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records() below.

input_file = 'Clean-Junction-Links1.txt'

def extract_records(filename):
with open(filename, 'r') as file:
lines = []
for line in file:
if line.strip(): # Not blank?
lines.append(line)
else:
yield lines
lines = []
if lines:
yield lines

for i, record in enumerate(extract_records(input_file), start=1):
print(f'Creating file {i}.txt')
with open(f'{i}.txt', 'w') as output:
output.write(''.join(record))

print('-fini-')



Related Topics



Leave a reply



Submit