How to file split at a line number
file_name=test.log
# set first K lines:
K=1000
# line count (N):
N=$(wc -l < $file_name)
# length of the bottom file:
L=$(( $N - $K ))
# create the top of file:
head -n $K $file_name > top_$file_name
# create bottom of file:
tail -n $L $file_name > bottom_$file_name
Also, on second thought, split will work in your case, since the first split is larger than the second. Split puts the balance of the input into the last split, so
split -l 300000 file_name
will output xaa
with 300k lines and xab
with 100k lines, for an input with 400k lines.
Splitting a text file using a separate file of line numbers
Bash and awk solution
# Assumption: You have a bash array named arr with the indices you want,
# like this
arr=( 1 7 14 )
counter=1
for ((i=0; i<${#arr[@]}-1; i++)); do
# Get current index
index="${arr[$i]}"
# Get next index
next_index="${arr[$i+1]}"
awk "NR>=$index && NR<$next_index" file_to_chop.txt > "log${counter}.txt"
(( counter++ ))
done
# If the array is non-empty, we also need to write last set of lines
# to the last file
[ "${#arr[@]}" -gt 1 ] && {
# Get last element in the array
index="${arr[${#arr[@]}-1]}"
awk "NR>=$index" file_to_chop.txt > "log${counter}.txt"
}
This script won't work with a narrowly POSIX-compliant shell, since it uses several "bashisms", including arithmetic within (())
.
This functions primarily by using awk's NR
, which gives the record number. The expression
NR>=3
for example tells awk to only perform actions on (or in our case, print) records (or in our case, lines) with record numbers greater than or equal to 3. More complex boolean expressions involving NR
can be produced using &&
, for example,
NR>=3 && NR<=7
If you do not already have the indices in a bash array, you can generate the array from a file like this:
arr=()
while read -r line; do arr+=( "$line" ); done < /path/to/your/file/here
Or if you want to generate the array from the output of a command:
arr=()
while read -r line; do arr+=( "$line" ); done < <(your_command_here)
Python solution
import sys
def write_lines(filename, lines):
try:
with open(filename, 'w') as f:
f.write('\n'.join(lines))
except OSError:
print(f'Error: failed to write to "{filename}".', file=sys.stderr)
exit(1)
if len(sys.argv) != 2:
print('Must pass path to input file.', file=sys.stderr)
exit(1)
input_file = sys.argv[1]
line_indices = [line.rstrip() for line in sys.stdin]
try:
with open(input_file, 'r') as f:
input_lines = [line.rstrip() for line in f]
except OSError:
print(f'Error: failed to read from "{input_file}".', file=sys.stderr)
exit(1)
counter = 1
while len(line_indices) > 1:
index = int(line_indices.pop(0))
next_index = int(line_indices[0])
write_lines(f'log{counter}.txt', input_lines[index-1:next_index-1])
counter += 1
if line_indices:
index = int(line_indices[0])
write_lines(f'log{counter}.txt', input_lines[index-1:])
This is the usage, assuming you wanted to cut a file so lines 1-6 are output to log1.txt
, lines 7-13 output to log2.txt
, and lines 14 and on are output to log3.txt
:
printf '1\n7\n14\n' | python chop_file_script.py /path/to/file/to/chop
The way this operates is by reading stdin
to see how to chop the input file into separate files. This is by design, so the required line numbers can be fed to the script from a parent shell script using a pipe (as in the usage example above).
This is not a fully robust script. It does not handle things like, for example:
- Line numbers in
stdin
not being in ascending order stdin
containing non-numeric values- Numbers in
stdin
exceeding the length of the input file
I believe that it is fine that this script is not fully robust, as it should work correctly as long as it is used in the intended way.
split a file based upon line number
Maybe something like that:
#!/bin/bash
EVEN="even.log"
ODD="odd.log"
line_count=0
block_count=0
while read line
do
# ignore blank lines
if [ ! -z "$line" ]; then
if [ $(( $block_count % 2 )) -eq 0 ]; then
# even
echo "$line" >> "$EVEN"
else
# odd
echo "$line" >> "$ODD"
fi
line_count=$[$line_count +1]
if [ "$line_count" -eq "4" ]; then
block_count=$[$block_count +1]
line_count=0
fi
fi
done < "$1"
The first argument is the source file: ./split.sh split_input
How can I split a large text file into smaller files with an equal number of lines?
Have a look at the split command:
$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
You could do something like this:
split -l 200000 filename
which will create files each with 200000 lines named xaa xab xac
...
Another option, split by size of output file (still splits on line breaks):
split -C 20m --numeric-suffixes input_filename output_prefix
creates files like output_prefix01 output_prefix02 output_prefix03 ...
each of maximum size 20 megabytes.
How to split a text file at lines which begin with a number
You can try this awk
if the condition is always 2021
.
It generates files (and overwrites existing ones without asking) with names fileX
with X being the number of splits.
$ awk 'BEGIN{x=1} NR>1 && /^2021/{ close("file"x); x++ }
{ print > ("file"x) }' tosplit
$ for i in file[12];do echo $i; cat $i ;done
file1
20210101 blah blah
blah 20210101
blah 20210101
blah 20210101
file2
20210315 blah blah
blah 20210315
blah 20210315
blah 20210315
A more generalized version
$ awk 'BEGIN{x=1} NR>1 && /^[[:digit:]]/{ close("file"x); x++ }
{ print > ("file"x) }'
Read the file line-by-line and use the split() function to break the line into a list of integers using python
with this "tiny.txt":
5
0 1
1 2 1 2 1 3 1 3 1 4
2 3
3 0
4 0 4 2
and this code:
def adjMatrixFromFile(file):
Our_numbers = []
file = open(file, 'r')
line = file.readlines()
for i in line:
i=i.replace('\n','') #remove all \n
numbers = i.split(' ')
numbers = filter(None, numbers) #remove '' in the list
Our_numbers.extend(numbers) #add values from a list to another list
Our_numbers = [int(i) for i in Our_numbers] #convert all element str -> int
return Our_numbers
print(adjMatrixFromFile("tiny.txt"))
I got this output:
[5, 0, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 4, 2, 3, 3, 0, 4, 0, 4, 2]
split a txt file into multiple files with the number of lines in each file being able to be set by a user
Here's a way to do it using streams. This has the benefit of not needing to read it all into memory at once, allowing it to work on very large files.
Console.Write("> ");
var maxLines = int.Parse(Console.ReadLine());
var filename = ofd.FileName;
var fileStream = File.OpenRead(filename);
var readStream = new StreamReader(fileStream);
var nameBase = filename[0..^4]; //strip .txt
var parts = 1;
var notfinished = true;
while (notfinished)
{
var part = File.OpenWrite($"{nameBase}-{parts}.txt");
var writer = new StreamWriter(part);
for (int i = 0; i < maxLines; i++)
{
writer.WriteLine(readStream.ReadLine());
if (readStream.EndOfStream)
{
notfinished = false;
break;
}
}
writer.Close();
parts++;
}
Console.WriteLine($"Done splitting the file into {parts} parts.");
Split text file to multiple files by specific string or line number using python
This can be done in a single pass of the input file as follows:
key = 'Weight Total'
outfile = None
fileno = 0
lineno = 0
with open('textfile_1.txt') as infile:
while line := infile.readline():
lineno += 1
if outfile is None:
fileno += 1
outfile = open(f'E{fileno}.txt', 'w')
outfile.write(line)
if key in line:
print(f'"{key}" found in line {lineno}')
outfile.close()
outfile = None
if outfile:
outfile.close()
Note:
This assumes that any line containing the key is to be included in the output file(s)
How to split the csv file with line numbers which pass as parameter and save into different files
And here's how you would do it in Python3
.
import argparse
import time
from itertools import zip_longest
def grouper(n, iterable, fill_value=None):
args = [iter(iterable)] * n
return zip_longest(fillvalue=fill_value, *args)
def splitter(n_lines, file):
with open(file) as f:
for i, payload in enumerate(grouper(n_lines, f, fill_value=''), 1):
f_name = f"{time.strftime('%Y%m%d-%H%M%S')}_{i*n_lines}.log"
with open(f_name, 'w') as out:
out.writelines(payload)
def get_parser():
parser = argparse.ArgumentParser(description="File splitter")
parser.add_argument("file", metavar="FILE", type=str, help="Target file to be chopped up")
parser.add_argument("n_lines", type=int, default=2, help="Number of lines to output per file")
return parser
def command_line_runner():
parser = get_parser()
args = vars(parser.parse_args())
splitter(args['n_lines'], args['file'])
if __name__ == "__main__":
command_line_runner()
Sample run: python3 main.py sample.csv 2
produces 3
files:
20200921-095943_2.log
20200921-095943_4.log
20200921-095943_6.log
The first two have two lines each and the last one, well, one line.
The contents of sample.csv
is as in your example:
1,Network activity,ip-dst,80.179.42.44,,1,20160929
2,Payload delivery,md5,4ad2924ced722ab65ff978f83a40448e,,1,20160929
3,Network activity,domain,alkamaihd.net,,1,20160929
4,Payload delivery,md5,197c018922237828683783654d3c632a,,1,20160929
5,Network activity,domain,dnsrecordsolver.tk,,1,20160929
Related Topics
How to Increase the Scrollback Buffer in a Running Screen Session
Bash Scripting - How to Set the Group That New Files Will Be Created With
Can't Change Tomcat 7 Heap Size
Concatenating Two String Variables in Bash Appending Newline
What Is The Significance of This_Module in Linux Kernel Module Drivers
Why Does Find -Exec Mv {} ./Target/ + Not Work
Why Docker Has Ability to Run Different Linux Distribution
Linux Pipe Audio File to Microphone Input
What's the Meaning of a ! Before a Command in the Shell
Bluetooth Low Energy: Use Bluez Stack as a Peripheral (With Custom Services and Characteristics)
Run Matlab in Linux Without Graphical Environment