How to split a file and keep the first line in each of the pieces?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc
, cut
, ls
and echo
in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp
or tempfile
to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split
it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter
is specified, split
runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE
, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat"
for example.
Split CSV files into smaller files but keeping the headers?
The answer to this question is yes, this is possible with AWK.
The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv
:
awk -v l=11000 '(NR==1){header=$0;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv
This works in the following way:
(NR==1){header=$0;next}
: If the record/line is the first line, save that line as the header.(NR%l==2){...}
: Every time we wrotel=11000
records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:close(file)
: close the file you just wrote too.file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file)
: define the new filename asFILENAME.00XXX.csv
print header > file
: open the file and write the header to that file.
{print > file}
: write the entries to the file.
note: If you don't care about the filename, you can use the following shorter version:
awk -v m=100 '
(NR==1){h=$0;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv
How can I split a large text file into smaller files with an equal number of lines?
Have a look at the split command:
$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
You could do something like this:
split -l 200000 filename
which will create files each with 200000 lines named xaa xab xac
...
Another option, split by size of output file (still splits on line breaks):
split -C 20m --numeric-suffixes input_filename output_prefix
creates files like output_prefix01 output_prefix02 output_prefix03 ...
each of maximum size 20 megabytes.
How to split CSV files as per number of rows specified?
Made it into a function. You can now call splitCsv <Filename> [chunkSize]
splitCsv() {
HEADER=$(head -1 $1)
if [ -n "$2" ]; then
CHUNK=$2
else
CHUNK=1000
fi
tail -n +2 $1 | split -l $CHUNK - $1_split_
for i in $1_split_*; do
sed -i -e "1i$HEADER" "$i"
done
}
Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html
Split large csv file and keep header in each part
First you need to separate the header and the content :
header=$(head -1 $file)
data=$(tail -n +2 $file)
Then you want to split the data
echo $data | split [options...] -
In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing -
must not be removed as it specifies split
to read data from stdin.
Then you can insert the header at the top of each file
sed -i "1i$header" $splitOutputFile
You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split
operation.
Split a large flat file by first two characters on each line
Using awk you can do:
awk -F, '{fn=$1 ".txt"; print > fn}' file
If you want to keep it clean by closing all file handles in the end use this awk
:
awk -F, '!($1 in files){files[$1]=$1 ".txt"} {print > files[$1]}
END {for (f in files) close(files[$f])}' file
How to split a file into equal parts, without breaking individual lines?
If you mean an equal number of lines, split
has an option for this:
split --lines=75
If you need to know what that 75
should really be for N
equal parts, its:
lines_per_part = int(total_lines + N - 1) / N
where total lines can be obtained with wc -l
.
See the following script for an example:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
This outputs:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
More recent versions of split
allow you to specify a number of CHUNKS
with the -n/--number
option. You can therefore use something like:
split --number=l/6 ${fspec} xyzzy.
(that's ell-slash-six
, meaning lines
, not one-slash-six
).
That will give you roughly equal files in terms of size, with no mid-line splits.
I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.
Related Topics
How Would One Disable Nagle's Algorithm in Linux
Error on Execution -Version 'Qt_5' Not Found Required By
Memory Limit to a 32-Bit Process Running on a 64-Bit Linux Os
Bash Copy All Files Except One
Grep Without Showing Path/File:Line
How to Export the Variable Through Script File
How to Copy the Output of a Command Directly into My Clipboard
How to Search Contents of Multiple PDF Files
What Is the Linux Equivalent to Dos Pause
How to Create a Link to a Directory
Docker-Compose Up and User Inputs on Stdin
Prevent File Descriptors Inheritance During Linux Fork
File Names with Spaces in Bash
Limiting Memory Usage in R Under Linux