Split One File into Multiple Files Based on Delimiter

Most efficient method to split file into multiple files based on a column

Within awk you can redirect the output of each line to a different file whose name you build dynamically, (based on $2 in this case):

$ awk -F, '{print > ("some_prefix_" $2 "_some_suffix_date")}' file

$ ls *_date
some_prefix_345_some_suffix_date some_prefix_45_some_suffix_date some_prefix_645_some_suffix_date

$ cat some_prefix_345_some_suffix_date
rec1,345,field3,....field20
rec12,345,field3,....field20

$ cat some_prefix_645_some_suffix_date
rec1,645,field3,....field20
rec34,645,field3,....field20

$ cat some_prefix_45_some_suffix_date
frec23,45,field3,....field20

As pointed out in the comments, if you have many different values of $2 and you get an error for too many open files, you can close as you go:

 $ awk -F, '{fname = "xsome_prefix_" $2 "_some_suffix_date"
if (a[fname]++) print >> fname; else print > fname;
close fname}' file

Split a large text file into multiple files using delimiters

A solution to read and write at the same time to avoid keeping anyting in memory could be:

with open('input.txt') as f:
f_out = None
for line in f:
if line.startswith('[TEST]'): # we need a new output file
title = line.split(' ', 1)[1]
if f_out:
f_out.close()
f_out = open(f'{title}.txt', 'w')
if f_out:
f_out.write(line)
if f_out:
f_out.close()

How do I split a file into several files by a multi-character delimiter?

This will work robustly in any awk:

awk '/"codeView"/{close(out); out="_temp" ++c ".txt"} out!=""{print > out}' file

How to split a file into multiple files based on a delimiter, and remove the delimiter also, in Unix

I think you may be trying to reinvent the wheel. awk is a great tool that can be used to split files on delimiters and perform other text processing. You may like to try the following:

awk '{ for(i=1;i<=NF;i++) print $i > "file_" i ".txt" }' RS= FS='\\$' file

Results:

Contents of file_1.txt:

{1:F195}{2:O5350646}{3:{1028:076}}{4:
:16R:GL
:16R:ADD
:19A::P//U9,1
:16S:AFO
-}{5:{MAC:00}{CHK:1C}}{S:{SAC:}{COP:S}{MAN:P2}}

Contents of file_2.txt:

{1:33339}{2:O53}{4:
:16S:G
:16R:A
:19A::H0,
:19A::H0,
:16S:ADDINFO
-}{5:{MAC:0}{CHK:4}}{S:{SAC:}{COP:S}{MAN:GP2}}

Explanation:

Set the Record Separator to null, which puts awk in 'paragraph mode' (by default RS is set to "\n", which enables line-by-line processing). Since your file doesn't look like it contains paragraphs, this will essentially treat your file as a single record. We then set the Field Separator to a dollar-sign character (which needs to be escaped). So for each record (and there should only be one record) we loop over each field (NF is short for Number of Fields) and print it to a file using the iterator. It's worthwhile noting that you will get strange results if your input contains multiple paragraphs. In comparison with Glenn's answer above/below, his solution won't have this problem, but the last file it processes will contain a trailing newline. HTH.

split one file into multiple files according to columns using bash cut or awk

With awk:

awk -F '[\t;]' '{for(i=1; i<=NF; i++) print $i >> "column" i ".txt"}' file

Use tab and semicolon as field separator. NF contains the number of last column in the current row. $i contains content of current column and i number of current column.

This creates 11 files. column11.txt contains:


k
p
k
k

Powershell: Split a single file into multiple files - using string match criteria

I suggest using a switch statement, which offers both convenient and fast line-by-line reading of files via -File and regex-matching via -Regex:

$streamWriter = $null
switch -CaseSensitive -Regex -File "C:\COPIES.txt" {
'^.(.{8}).{58}VER' { # Start of a new embedded file.
if ($streamWriter) { $streamWriter.Close() } # Close previous output file.
# Create a new output file.
$fileName = $Matches[1].Trim() + '.txt'
$streamWriter = [System.IO.StreamWriter] (Join-Path $PWD.ProviderPath $fileName)
$streamWriter.WriteLine($_)
}
default { # Write subsequent lines to the same file.
if ($streamWriter) { $streamWriter.WriteLine($_) }
}
}
$streamWriter.Close()

Note: A solution using the .Substring() method of the [string] type is possible too, but would be more verbose.

  • The ^.(.{8}).{58} portion of the regex matches the first 67 characters on each line, while capturing those in (1-based) columns 2 through 9 (the file name) via capture group (.{8}), which makes the captured text available in index [1] of the automatic $Matches variable. The VER portion of the regex then ensures that the line only matches if VER is found at column position 68.

  • For efficient output-file creation, [System.IO.StreamWriter] instances are used, which is much faster than line-by-line Add-Content calls. Additionally, with Add-Content you'd have to ensure that a target file doesn't already exist, as the existing content would then be appended to.



Related Topics



Leave a reply



Submit