Extracting Columns from Text File with Different Delimiters in Linux

Extracting columns from text file with different delimiters in Linux

If the command should work with both tabs and spaces as the delimiter I would use awk:

awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile

As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a for loop:

awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile

If you want to use cut, you need to use the -f option:

cut -f100-105 myfile > outfile

If the field delimiter is different from TAB you need to specify it using -d:

cut -d' ' -f100-105 myfile > outfile

Check the man page for more info on the cut command.

Extract Column(s) from text file having Multi Character Delimiter i.e. %$%

The symbol $ is a special character in a regex, so you need to escape it with a \, which is also a special character for the string literal, so it needs to be escaped again.

So, finally we have:

$ cat sample 
ghkjlj;lk%$%23e;k32poek%$%eqdje2oijd%$%xrgtdy5h

$ awk -F'%\\$%' '{print $1}' sample
ghkjlj;lk

Extract specific columns from delimited file using Awk

I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

something else to consider - and this faster and more concise:

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.

How to extract a certain column in a space-delimited .txt file and store each unique value along with the number of times it appears [Unix - Bash]

awk '{print $2}' extracts the second column, not row.

You can indeed use sort and uniq to do this, and that's the traditional Unix 'toolbox' method, which a great many people before you have also thought of:

awk '{print $2}' file.txt | sort -n | uniq -c

(uniq -c counts adjacent duplicates instead of removing them. On any non-weird Unix system, you can use man {programname} to get documentation on a program, and man uniq shows you several options that can be useful for various things including -c.)

But awk can also do the whole job (or nearly) by itself:

awk '{++c[$2]} END{for(v in c){print c[v],v}}' file.txt

awk has 'associative' arrays subscripted or 'keyed' by any values, not just more-or-less consecutive integers; this was the 1970s name for what nowadays is often called a dictionary. (And all array elements, and variables other than predefined ones like NR NF OFS etc, are initialized to an empty value, which is treated numerically as zero.)

Since this is normally implemented as a hash-table, the for..in statement in traditional awk can produce the values in an arbitrary order, and the standard (POSIX) codifies this. If you want them in numeric order (as the sort|uniq method produces), you can add ... | sort -nk2, or only on non-ancient versions of GNU awk (which is now common but not universal) you can use:

awk '{++c[$2]} END{PROCINFO["sorted_in"]="@val_num_asc";for(v in c){print c[v],v}}' file.txt

Extract several space-delimited fields from file with varying delimiters into another file in Bash

I figured out a solution.

  1. Remove the header line.
  2. Filter all lines based on the word "rectangle" using grep.
  3. Replace whitespaces with commas to make it easier to deal with.
  4. Iterate through each line, saving to file as needed.
#!/bin/bash
#Code here to retrieve the file from command arguments and set it as $inputFile (removed for brevity)
sed -i 1d $inputFile #Remove header line

sed 's/^ *//g' < $inputFile > work.txt #Remove first character in each line (a space).
tr -s ' ' <work.txt | tr ' ' ',' >work2.txt #Switch spaces for commas.
grep "rectangle" work2.txt > work3.txt #Print all lines containing "rectangle" in them to new file.
rm lineout.txt #Delete output file in case script was run previously.
touch lineout.txt
count=0
while IFS='' read -r line || [[ -n "$line" ]]; do
printf "$line" > line.txt
awk 'BEGIN { FS="," } { printf $1 >> "lineout.txt" }' line.txt
printf "," >> lineout.txt
awk 'BEGIN { FS="," } { printf $2 >> "lineout.txt" }' line.txt
printf "," >> lineout.txt
count=$((count + 1))
if [[ $count = "1" ]]
then
printf "$count\n" >> lineout.txt
else
printf "0\n" >> lineout.txt
if [[ $count = "4" ]]
then
count=0
fi
fi
done < work3.txt

Awk command to extract columns on dual delimiter

awk interprets the field separator as a regular expression, so you just need to double \\ escape each character to get the literals.

echo 'name[^legalName[^code[^type[^contactNumber1[^contactNumber2' | awk -F'\\[\\^' '{print $2}'
legalName

Ubuntu: How do I extract only specific columns from tab-delimited file if it contains a specific string?

Simplifying your code (with code borrowed from Extract column using grep)

grep -E "chr6.fa" FC305JN_s_1_eland_result.txt > out.txt
awk '{print $1, "\t", $2, "\t", $7, "\t", $8, "\t", $9}' out.txt > outfile.txt

produces output:

FC305JN_20080525:1:15:944:72     GATGACTTCCTTAATTTTCTTTATNNNN    chr6.fa     7200804     R
FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R


Related Topics



Leave a reply



Submit