Most efficient method to split file into multiple files based on a column
Within awk you can redirect the output of each line to a different file whose name you build dynamically, (based on $2
in this case):
$ awk -F, '{print > ("some_prefix_" $2 "_some_suffix_date")}' file
$ ls *_date
some_prefix_345_some_suffix_date some_prefix_45_some_suffix_date some_prefix_645_some_suffix_date
$ cat some_prefix_345_some_suffix_date
rec1,345,field3,....field20
rec12,345,field3,....field20
$ cat some_prefix_645_some_suffix_date
rec1,645,field3,....field20
rec34,645,field3,....field20
$ cat some_prefix_45_some_suffix_date
frec23,45,field3,....field20
As pointed out in the comments, if you have many different values of $2
and you get an error for too many open files, you can close as you go:
$ awk -F, '{fname = "xsome_prefix_" $2 "_some_suffix_date"
if (a[fname]++) print >> fname; else print > fname;
close fname}' file
Split a large text file into multiple files using delimiters
A solution to read and write at the same time to avoid keeping anyting in memory could be:
with open('input.txt') as f:
f_out = None
for line in f:
if line.startswith('[TEST]'): # we need a new output file
title = line.split(' ', 1)[1]
if f_out:
f_out.close()
f_out = open(f'{title}.txt', 'w')
if f_out:
f_out.write(line)
if f_out:
f_out.close()
How do I split a file into several files by a multi-character delimiter?
This will work robustly in any awk:
awk '/"codeView"/{close(out); out="_temp" ++c ".txt"} out!=""{print > out}' file
How to split a file into multiple files based on a delimiter, and remove the delimiter also, in Unix
I think you may be trying to reinvent the wheel. awk
is a great tool that can be used to split files on delimiters and perform other text processing. You may like to try the following:
awk '{ for(i=1;i<=NF;i++) print $i > "file_" i ".txt" }' RS= FS='\\$' file
Results:
Contents of file_1.txt
:
{1:F195}{2:O5350646}{3:{1028:076}}{4:
:16R:GL
:16R:ADD
:19A::P//U9,1
:16S:AFO
-}{5:{MAC:00}{CHK:1C}}{S:{SAC:}{COP:S}{MAN:P2}}
Contents of file_2.txt
:
{1:33339}{2:O53}{4:
:16S:G
:16R:A
:19A::H0,
:19A::H0,
:16S:ADDINFO
-}{5:{MAC:0}{CHK:4}}{S:{SAC:}{COP:S}{MAN:GP2}}
Explanation:
Set the Record Separator to null, which puts awk
in 'paragraph mode' (by default RS
is set to "\n"
, which enables line-by-line processing). Since your file doesn't look like it contains paragraphs, this will essentially treat your file as a single record. We then set the Field Separator to a dollar-sign character (which needs to be escaped). So for each record (and there should only be one record) we loop over each field (NF
is short for Number of Fields) and print it to a file using the iterator. It's worthwhile noting that you will get strange results if your input contains multiple paragraphs. In comparison with Glenn's answer above/below, his solution won't have this problem, but the last file it processes will contain a trailing newline. HTH.
split one file into multiple files according to columns using bash cut or awk
With awk:
awk -F '[\t;]' '{for(i=1; i<=NF; i++) print $i >> "column" i ".txt"}' file
Use tab and semicolon as field separator. NF
contains the number of last column in the current row. $i
contains content of current column and i
number of current column.
This creates 11 files. column11.txt contains:
k
p
k
k
Powershell: Split a single file into multiple files - using string match criteria
I suggest using a switch
statement, which offers both convenient and fast line-by-line reading of files via -File
and regex-matching via -Regex
:
$streamWriter = $null
switch -CaseSensitive -Regex -File "C:\COPIES.txt" {
'^.(.{8}).{58}VER' { # Start of a new embedded file.
if ($streamWriter) { $streamWriter.Close() } # Close previous output file.
# Create a new output file.
$fileName = $Matches[1].Trim() + '.txt'
$streamWriter = [System.IO.StreamWriter] (Join-Path $PWD.ProviderPath $fileName)
$streamWriter.WriteLine($_)
}
default { # Write subsequent lines to the same file.
if ($streamWriter) { $streamWriter.WriteLine($_) }
}
}
$streamWriter.Close()
Note: A solution using the .Substring()
method of the [string]
type is possible too, but would be more verbose.
The
^.(.{8}).{58}
portion of the regex matches the first 67 characters on each line, while capturing those in (1-based) columns 2 through 9 (the file name) via capture group(.{8})
, which makes the captured text available in index[1]
of the automatic$Matches
variable. TheVER
portion of the regex then ensures that the line only matches ifVER
is found at column position 68.For efficient output-file creation,
[System.IO.StreamWriter]
instances are used, which is much faster than line-by-lineAdd-Content
calls. Additionally, withAdd-Content
you'd have to ensure that a target file doesn't already exist, as the existing content would then be appended to.
Related Topics
"No Such File or Directory" Error When Executing a Binary
How to Use 'Cp' Command to Exclude a Specific Directory
Managing Log Files Created by Cron Jobs
How to Run a Script At a Certain Time on Linux
How to Remove Cached Credentials from Git
Curl: (6) Could Not Resolve Host: Google.Com; Name or Service Not Known
How to Convert Dos/Windows Newline (Crlf) to Unix Newline (Lf)
Glibc Scanf Segmentation Faults When Called from a Function That Doesn't Align Rsp
How to Search For a Multiline Pattern in a File
Printing an Integer as a String With At&T Syntax, With Linux System Calls Instead of Printf
How to Preserve Quotes in Printing a Bash Script'S Arguments
Ld Cannot Find an Existing Library
Calling Printf in X86_64 Using Gnu Assembler
How to Build & Install Glfw 3 and Use It in a Linux Project
Setting the Umask of the Apache User
Is There Really No Asynchronous Block I/O on Linux
Register File Extensions/Mime Types in Linux
How to Configure Qt For Cross-Compilation from Linux to Windows Target