How to Split My File into Multiple Files

How can I split one text file into multiple *.txt files?

You can use the Linux Bash core utility split:

split -b 1M -d  file.txt file

Note that M or MB both are OK but size is different. MB is 1000 * 1000, M is 1024^2

If you want to separate by lines you can use -l parameter.

UPDATE

a=(`wc -l yourfile`) ; lines=`echo $(($a/12)) | bc -l` ; split -l $lines -d  file.txt file

Another solution as suggested by Kirill, you can do something like the following

split -n l/12 file.txt

Note that is l not one, split -n has a few options, like N, k/N, l/k/N, r/N, r/k/N.

How to split large text file in windows?

If you have installed Git for Windows, you should have Git Bash installed, since that comes with Git.

Use the split command in Git Bash to split a file:

  • into files of size 500MB each: split myLargeFile.txt -b 500m

  • into files with 10000 lines each: split myLargeFile.txt -l 10000

Tips:

  • If you don't have Git/Git Bash, download at https://git-scm.com/download

  • If you lost the shortcut to Git Bash, you can run it using C:\Program Files\Git\git-bash.exe

That's it!


I always like examples though...

Example:

Sample Image

You can see in this image that the files generated by split are named xaa, xab, xac, etc.

These names are made up of a prefix and a suffix, which you can specify. Since I didn't specify what I want the prefix or suffix to look like, the prefix defaulted to x, and the suffix defaulted to a two-character alphabetical enumeration.

Another Example:

This example demonstrates

  • using a filename prefix of MySlice (instead of the default x),
  • the -d flag for using numerical suffixes (instead of aa, ab, ac, etc...),
  • and the option -a 5 to tell it I want the suffixes to be 5 digits long:

Sample Image

How can I split a large text file into smaller files with an equal number of lines?

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

How to split a .rtf file into multiple files based on date line?

awk to the rescue!

$ awk '/^[0-9]{4}(-[0-9]{2}){2}$/ {close(f); f=$0".rtf"} f{print > f}' file

==> 2020-11-11.rtf <==
2020-11-11

Superheterodyne receiver - Wikipedia - A superheterodyne receiver, often shortened to superhet, is a type of radio receiver that uses frequency mixing to convert a received signal to a fixed intermediate frequency (IF) which can be more conveniently processed than the original carrier frequency. Protected Cell Unprotected Cell Logic Gates from Transistors: Transistors and Boolean Logic - YouTube -

==> 2020-11-15.rtf <==
2020-11-15

How to properly use a NanoVNA V2 Vector Network Analyzer (Tutorial) - YouTube - VNA NanoVNA

Impedence matching an antenna Reflection VNA vs Spectrum Analyzer? Usually impedence of 50 ohms VSWR Meter SWR - Standing Wave Ratio SWR = 1 means there is no reflected power

How to split big 30GB bz2 file into multiple small bz2 files and add a header to each

You can use the bz2 package to open BZ2 encoded files and treat them as regular file objects. There is a minor performance advantage to read / write in binary. Assuming your data is either ASCII or UTF-8 and no tab characters need to be escaped in the data, you can just read the file line by line, opening and writing the outputs as new timestamps appear.

import bz2
import os

outfile = None
date = b""

with bz2.open("file") as fileobj:
for line in filobj:
# get date from, ex. "2021-05-01 00:00:00", timestamp
new_date = line.split(b"\t")[7].split(b" ")[0]
# roll to new file as needed, appending, so existing data not overwritten
if new_date != date:
date = new_date
new_file = f"splitted_file_{new_date}.csv.bz2"
exists = os.path.exists(new_file)
outfile = bz2.open(new_file, "ab")
if not exists:
outfile.write(b"\t".join([b'a', b'b', b'c', b'd', b'e', b'f', b'timestamp']) + b"\n")
# write the row
outfile.writeline(line)
if outfile:
outfile.close()

You may be able to speed this up with a pipeline. Give both the decryption and encryption to separate bzip2 processes that will run in parallel on different cores. Instead of a shell pipeline, you can create pipes and files to do it in the script itself. Assuming bzip2 exists on your system you could do the following. I added the tqdm module to print progress along the way.

#!/usr/bin/env python3

import subprocess as subp
from pathlib import Path
import sys
import tqdm

# TODO: Better command line
try:
in_file_name = Path(sys.argv[1])
except IndexError:
print("usage: unbzcsv.py filename")
exit(1)

# build the format string used for generating output file names
out_file_name_fmt = "{}-{{}}.{}".format(*in_file_name.name.split(".", maxsplit=1))
out_file = None
date = b""
bzwriter = None
bzfile = None

# run bzip2 to decompress to stdout
bzreader = subp.Popen(["bzip2", "--decompress", "--stdout", in_file_name],
stdin=subp.DEVNULL, stdout=subp.PIPE)

# use tqdm to display progress as line count
progress = tqdm.tqdm(bzreader.stdout, desc="Lines", unit=" lines", unit_scale=True)

# read lines and fan out to files
try:
for line in progress:
# get date from, ex. "2021-05-01 00:00:00", timestamp
new_date = line.split(b"\t")[7].split(b" ")[0]
# roll to new file as needed, appending, so existing data not overwritten
if new_date != date:
date = new_date
out_file_name = out_file_name_fmt.format(date.decode("utf-8"))
if bzwriter is not None:
bzwriter.stdin.close()
bzwriter.wait()
bzwriter = None
bzfile.close()
print("\nwriting", out_file_name)
progress.refresh()
bzfile = open(out_file_name, "wb")
bzwriter = subp.Popen(["bzip2", "--compress"],
stdin=subp.PIPE, stdout=bzfile)
# write the row
bzwriter.stdin.write(line)
finally:
bzreader.terminate() # in case of error
if bzwriter:
bzwriter.stdin.close()
bzwriter.wait()
bzfile.close()
bzreader.wait()

Split one file into multiple files based on a pattern

A minor adjustment of the same solution should work.

Have you read https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html ?

for f in *.faa; do 
awk '/^>/ { file=substr($1,2,2) "." FILENAME }
{ print >> f; close(f); }' "$f"
done

Results:

$: grep . [0-9][0-9].*.faa
15.plate9.H6.phages.faa:>15_fragment_1_38 (32335..32991) 1 K01356 lexA; repressor LexA [EC:3.4.21.88]
15.plate9.H6.phages.faa:MIRRMNKWYEVARQVMDTQQISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRLLGELGLPILTTSIPWNEPGQQNVAPTEQPSRFYRYPVISWVEAGGWNEAVEPYPVGYSDTFELSDYKAKGRAFWLVVRGDSMTAPAGQSIPEGMLILVDTGIEPTPGKLVIAKLPESNEATFKKLVEDAGRYFLKPLNPAYPTIAISEECKLIGVIRQMTMRL*
15.plate9.H6.phages.faa:>15_fragment_1_39 (33140..33397) 1 VOG04172 REFSEQ hypothetical protein
15.plate9.H6.phages.faa:MQNLRPEASQHDAYLALAQRIQDLITSPKAQIEHQVLLVREPGESPVHWEQIVEQISEAEGINVTRNFENGSVNVSWYVESADAY*
39.plate9.H6.phages.faa:>39_fragment_4_246 (275156..276328) -1 K14059 int; integrase
39.plate9.H6.phages.faa:MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*

Split a file into multiple gzip files in one line

With awk:

awk '
!d[$1]++ {
system("mkdir -p "$1)
c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
}
{ print | c[$1] }
' data.txt

Assumes:

  • sufficiently few distinct $1 (there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)
  • no problematic characters in $1

Incorporating improvements suggested by @EdMorton:

  • If you have a sort that supports -s (so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.
  • You can remove the second limit by suitable testing and quoting before you use $1. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)
sort -s -k1,1 data.txt | awk '
$1 ~ "/" {
print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
next
}
$1 != prev {
close(cmd)
prev = $1

# escape single-quote (\047) for use below
s = $1
gsub(/\047/,"\047\\\047\047",s)

system("mkdir -p -- \047"s"\047")
cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
}
{ print | cmd }
'

Note that the code above still has gotchas:

  • for a path d1/d2/f:
    • the total length can't exceed getconf PATH_MAX d1/d2; and
    • the name part (f) can't exceed getconf NAME_MAX d1/d2

Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.



Related Topics



Leave a reply



Submit