How to Split a File and Keep the First Line in Each of the Pieces

How to split a file and keep the first line in each of the pieces?

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

Split CSV files into smaller files but keeping the headers?

The answer to this question is yes, this is possible with AWK.

The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

awk -v l=11000 '(NR==1){header=$0;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv

This works in the following way:

  • (NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
  • (NR%l==2){...}: Every time we wrote l=11000 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:

    • close(file): close the file you just wrote too.
    • file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file): define the new filename as FILENAME.00XXX.csv
    • print header > file: open the file and write the header to that file.
  • {print > file}: write the entries to the file.

note: If you don't care about the filename, you can use the following shorter version:

awk -v m=100 '
(NR==1){h=$0;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv

How can I split a large text file into smaller files with an equal number of lines?

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

How to split CSV files as per number of rows specified?

Made it into a function. You can now call splitCsv <Filename> [chunkSize]

splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}

Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html

Split large csv file and keep header in each part

First you need to separate the header and the content :

header=$(head -1 $file)
data=$(tail -n +2 $file)

Then you want to split the data

echo $data | split [options...] -

In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing - must not be removed as it specifies split to read data from stdin.

Then you can insert the header at the top of each file

sed -i "1i$header" $splitOutputFile

You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split operation.

Split a large flat file by first two characters on each line

Using awk you can do:

awk -F, '{fn=$1 ".txt"; print > fn}' file

If you want to keep it clean by closing all file handles in the end use this awk:

awk -F, '!($1 in files){files[$1]=$1 ".txt"} {print > files[$1]}
END {for (f in files) close(files[$f])}' file

How to split a file into equal parts, without breaking individual lines?

If you mean an equal number of lines, split has an option for this:

split --lines=75

If you need to know what that 75 should really be for N equal parts, its:

lines_per_part = int(total_lines + N - 1) / N

where total lines can be obtained with wc -l.

See the following script for an example:

#!/usr/bin/bash

# Configuration stuff

fspec=qq.c
num_files=6

# Work out lines per file.

total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.

split --lines=${lines_per_file} ${fspec} xyzzy.

# Debug information

echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*

This outputs:

Total lines     = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total

More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:

split --number=l/6 ${fspec} xyzzy.

(that's ell-slash-six, meaning lines, not one-slash-six).

That will give you roughly equal files in terms of size, with no mid-line splits.

I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.

So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.



Related Topics



Leave a reply



Submit