How to Extract Every N Columns and Write into New Files

Terminal: extract same columns data and output into multiple files

EDIT: As per OP needed output file as Output_file00 etc format so following may help you. Also if you have only 30 files to read then you could remove this close(prev);prev=FILENAME part, this will save you from too many files opened error in case you are reading n number of files by this code too.

awk 'FNR==1{close(out); out=sprintf("Output_%02d",++i)} {print $1, $2, $3, $4 > out}' File_*.txt 

Since you haven't provided samples so couldn't test it, could you please try following and let me know if this helps you.

awk '{print $1,$2,$3,$4 > ("Output_file"i)}' File_*.txt

Above should create Output_file(s) eg--> Output_file1, Output_file2 and so on.

Extracting same column from each file to one file

Here is a very quick and dirty way of doing it:

Assuming your files are in the same order:

$ awk '(FNR==NR){a[FNR]=$0;next}
{a[FNR]=a[FNR] FS $NF}
END{for(i=1;i<=FNR;++i) print a[i]}' file1 file2 file3 ... filen

if you want the header a bit cleaner:

$ awk '(FNR==NR){a[FNR]=$0 (FNR==1?++c:"");next}
{a[FNR]=a[FNR] FS $NF (FNR==1?++c:"")}
END{for(i=1;i<=FNR;++i) print a[i]}' file1 file2 file3 ... filen

Assuming your files are not in the same order:

$ awk '{key=$1 FS $2}
(FNR==NR){a[key]=$0 (FNR==1?++c:"");next}
{a[key]=a[key] FS $NF (FNR==1?++c:"")}
END{for(i in a) print a[i]}' file1 file2 file3 ... filen

copy set of every 3 colums to a new file in unix

$ awk '{for (i=1;i<=NF;i+=3) {print $i,$(i+1),$(i+2) > ("output" ((i+2)/3) ".txt")}}' file.txt

# output1.txt
ID Name Place
19 john NY
20 Jen NY

# output2.txt
ID Name Place
23 Key NY
22 Jill NY

# output3.txt
ID Name Place
22 Tom Ny
22 Ki LA

# output4.txt
ID Name place
24 Jeff NY
34 Jack Roh

Skipping every nth row and copy/transform data into another matrix

pd.read_csv is a function that returns a dataframe.
To select specific rows from a dataframe you can use this function :

df.loc[start:stop:step]
so it would look something like this :

df = pd.read_csv(your_file)
new_df = df.loc[11:140]
#transform it as you please

#convert it to excel or csv
new_df .to_excel("new_file.xlsx") or new_df .to_csv("new_file.csv")

Split huge file into n files keeping first 7 columns + next 3 columns until column n

Here are a couple of solutions with bash tools.


1. bash

Using cut inside a bash loop.This will raise n processes and parse the file n times.

Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk and print them every 8th, 11th, etc column into the bash array.

#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )

for i in "${!ids[@]}"; do
cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done

Usage: bash test.sh file


2. awk

Or you can use awk. Here I customize just the number of outputs, but the others can also be done like in the first solution.

BEGIN { FS=OFS=","; times=14 } 
{
for (i=1;i<=times;i++) {
print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
}
}

Usage: awk -f test.awk file.

This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.



Related Topics



Leave a reply



Submit