How to File Split at a Line Number

How to file split at a line number

file_name=test.log

# set first K lines:
K=1000

# line count (N):
N=$(wc -l < $file_name)

# length of the bottom file:
L=$(( $N - $K ))

# create the top of file:
head -n $K $file_name > top_$file_name

# create bottom of file:
tail -n $L $file_name > bottom_$file_name

Also, on second thought, split will work in your case, since the first split is larger than the second. Split puts the balance of the input into the last split, so

split -l 300000 file_name

will output xaa with 300k lines and xab with 100k lines, for an input with 400k lines.

Splitting a text file using a separate file of line numbers

Bash and awk solution

# Assumption: You have a bash array named arr with the indices you want,
# like this
arr=( 1 7 14 )

counter=1

for ((i=0; i<${#arr[@]}-1; i++)); do
# Get current index
index="${arr[$i]}"
# Get next index
next_index="${arr[$i+1]}"

awk "NR>=$index && NR<$next_index" file_to_chop.txt > "log${counter}.txt"

(( counter++ ))
done

# If the array is non-empty, we also need to write last set of lines
# to the last file
[ "${#arr[@]}" -gt 1 ] && {
# Get last element in the array
index="${arr[${#arr[@]}-1]}"

awk "NR>=$index" file_to_chop.txt > "log${counter}.txt"
}

This script won't work with a narrowly POSIX-compliant shell, since it uses several "bashisms", including arithmetic within (()).

This functions primarily by using awk's NR, which gives the record number. The expression

NR>=3

for example tells awk to only perform actions on (or in our case, print) records (or in our case, lines) with record numbers greater than or equal to 3. More complex boolean expressions involving NR can be produced using &&, for example,

NR>=3 && NR<=7

If you do not already have the indices in a bash array, you can generate the array from a file like this:

arr=()
while read -r line; do arr+=( "$line" ); done < /path/to/your/file/here

Or if you want to generate the array from the output of a command:

arr=()
while read -r line; do arr+=( "$line" ); done < <(your_command_here)

Python solution

import sys

def write_lines(filename, lines):
try:
with open(filename, 'w') as f:
f.write('\n'.join(lines))
except OSError:
print(f'Error: failed to write to "{filename}".', file=sys.stderr)
exit(1)

if len(sys.argv) != 2:
print('Must pass path to input file.', file=sys.stderr)
exit(1)

input_file = sys.argv[1]
line_indices = [line.rstrip() for line in sys.stdin]

try:
with open(input_file, 'r') as f:
input_lines = [line.rstrip() for line in f]
except OSError:
print(f'Error: failed to read from "{input_file}".', file=sys.stderr)
exit(1)

counter = 1

while len(line_indices) > 1:
index = int(line_indices.pop(0))
next_index = int(line_indices[0])

write_lines(f'log{counter}.txt', input_lines[index-1:next_index-1])

counter += 1

if line_indices:
index = int(line_indices[0])

write_lines(f'log{counter}.txt', input_lines[index-1:])

This is the usage, assuming you wanted to cut a file so lines 1-6 are output to log1.txt, lines 7-13 output to log2.txt, and lines 14 and on are output to log3.txt:

printf '1\n7\n14\n' | python chop_file_script.py /path/to/file/to/chop

The way this operates is by reading stdin to see how to chop the input file into separate files. This is by design, so the required line numbers can be fed to the script from a parent shell script using a pipe (as in the usage example above).

This is not a fully robust script. It does not handle things like, for example:

  • Line numbers in stdin not being in ascending order
  • stdin containing non-numeric values
  • Numbers in stdin exceeding the length of the input file

I believe that it is fine that this script is not fully robust, as it should work correctly as long as it is used in the intended way.

split a file based upon line number

Maybe something like that:

#!/bin/bash

EVEN="even.log"
ODD="odd.log"

line_count=0
block_count=0
while read line
do
# ignore blank lines
if [ ! -z "$line" ]; then
if [ $(( $block_count % 2 )) -eq 0 ]; then
# even
echo "$line" >> "$EVEN"
else
# odd
echo "$line" >> "$ODD"
fi
line_count=$[$line_count +1]
if [ "$line_count" -eq "4" ]; then
block_count=$[$block_count +1]
line_count=0
fi
fi
done < "$1"

The first argument is the source file: ./split.sh split_input

How can I split a large text file into smaller files with an equal number of lines?

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

How to split a text file at lines which begin with a number

You can try this awk if the condition is always 2021.
It generates files (and overwrites existing ones without asking) with names fileX with X being the number of splits.

$ awk 'BEGIN{x=1} NR>1 && /^2021/{ close("file"x); x++ } 
{ print > ("file"x) }' tosplit

$ for i in file[12];do echo $i; cat $i ;done
file1
20210101 blah blah
blah 20210101
blah 20210101
blah 20210101
file2
20210315 blah blah
blah 20210315
blah 20210315
blah 20210315

A more generalized version

$ awk 'BEGIN{x=1} NR>1 && /^[[:digit:]]/{ close("file"x); x++ } 
{ print > ("file"x) }'

Read the file line-by-line and use the split() function to break the line into a list of integers using python

with this "tiny.txt":

5 
0 1
1 2 1 2 1 3 1 3 1 4
2 3
3 0
4 0 4 2

and this code:

def adjMatrixFromFile(file):
Our_numbers = []
file = open(file, 'r')
line = file.readlines()

for i in line:
i=i.replace('\n','') #remove all \n
numbers = i.split(' ')
numbers = filter(None, numbers) #remove '' in the list
Our_numbers.extend(numbers) #add values from a list to another list

Our_numbers = [int(i) for i in Our_numbers] #convert all element str -> int
return Our_numbers

print(adjMatrixFromFile("tiny.txt"))

I got this output:

[5, 0, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 4, 2, 3, 3, 0, 4, 0, 4, 2]

split a txt file into multiple files with the number of lines in each file being able to be set by a user

Here's a way to do it using streams. This has the benefit of not needing to read it all into memory at once, allowing it to work on very large files.

Console.Write("> ");
var maxLines = int.Parse(Console.ReadLine());

var filename = ofd.FileName;
var fileStream = File.OpenRead(filename);
var readStream = new StreamReader(fileStream);

var nameBase = filename[0..^4]; //strip .txt

var parts = 1;
var notfinished = true;
while (notfinished)
{
var part = File.OpenWrite($"{nameBase}-{parts}.txt");
var writer = new StreamWriter(part);
for (int i = 0; i < maxLines; i++)
{
writer.WriteLine(readStream.ReadLine());
if (readStream.EndOfStream)
{
notfinished = false;
break;
}
}
writer.Close();
parts++;
}

Console.WriteLine($"Done splitting the file into {parts} parts.");

Split text file to multiple files by specific string or line number using python

This can be done in a single pass of the input file as follows:

key = 'Weight Total'

outfile = None
fileno = 0
lineno = 0

with open('textfile_1.txt') as infile:
while line := infile.readline():
lineno += 1
if outfile is None:
fileno += 1
outfile = open(f'E{fileno}.txt', 'w')
outfile.write(line)
if key in line:
print(f'"{key}" found in line {lineno}')
outfile.close()
outfile = None
if outfile:
outfile.close()

Note:

This assumes that any line containing the key is to be included in the output file(s)

How to split the csv file with line numbers which pass as parameter and save into different files

And here's how you would do it in Python3.

import argparse
import time
from itertools import zip_longest

def grouper(n, iterable, fill_value=None):
args = [iter(iterable)] * n
return zip_longest(fillvalue=fill_value, *args)

def splitter(n_lines, file):
with open(file) as f:
for i, payload in enumerate(grouper(n_lines, f, fill_value=''), 1):
f_name = f"{time.strftime('%Y%m%d-%H%M%S')}_{i*n_lines}.log"
with open(f_name, 'w') as out:
out.writelines(payload)

def get_parser():
parser = argparse.ArgumentParser(description="File splitter")
parser.add_argument("file", metavar="FILE", type=str, help="Target file to be chopped up")
parser.add_argument("n_lines", type=int, default=2, help="Number of lines to output per file")
return parser

def command_line_runner():
parser = get_parser()
args = vars(parser.parse_args())
splitter(args['n_lines'], args['file'])

if __name__ == "__main__":
command_line_runner()

Sample run: python3 main.py sample.csv 2 produces 3 files:

20200921-095943_2.log
20200921-095943_4.log
20200921-095943_6.log

The first two have two lines each and the last one, well, one line.

The contents of sample.csv is as in your example:

1,Network activity,ip-dst,80.179.42.44,,1,20160929
2,Payload delivery,md5,4ad2924ced722ab65ff978f83a40448e,,1,20160929
3,Network activity,domain,alkamaihd.net,,1,20160929
4,Payload delivery,md5,197c018922237828683783654d3c632a,,1,20160929
5,Network activity,domain,dnsrecordsolver.tk,,1,20160929


Related Topics



Leave a reply



Submit