How to Parse a CSV File in Bash

How to parse a CSV file in Bash?

You need to use IFS instead of -d:

while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done < myfile.csv

Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit.

Bash / Shell: Parsing CSV file in bash script and skipping first line

OP hasn't (yet) provided any sample input data nor the desired output so some assumptions:

  • data values could be integer or reals, positive or negative
  • the user wants the average for each line (no need to calculate an average for the entire file)

Some sample data:

$ cat user-list.txt
a,b,c,d,e,f,g,h
1,id1,3,4,5,6,7
2,id2,13,14.233,15,16,17
3,id2,3.2,4.3,5.9233,6.0,7.32
4,id4,-3.2,4.3,-15.3,96.0,7.32

One awk solution:

$ awk -F"," 'FNR>=2 { printf "%s %10.3f\n", $2, ($3+$4+$5+$6+$7)/5.0 }' user-list.txt

Where:

  • -F"," - use comma as input field separator
  • FNR>=2 - skip the first line of the file
  • printf "%s %10.3f\n" - print field 2 using %s format; print the average using %10.3f format (width of 10 w/ max of 6 digits to left of decimal plus the decimal plus 3 digits to the right of the decimal); append a linefeed (\n) on the end

The above generates:

id1      5.000
id2 15.047
id2 5.349
id4 17.824

OP has added a new requirement ... sort the output by the calculated averages however, there are a few potential issues that need further input from the OP:

  • Can a userID show up more than once in the data file?
  • If a userID can show up more than once then do we need to generate a single line of output for each unique userID or do we generate separate lines for each occurrence of a userID?
  • Is the data to be sorted in ascending or descending order?

For now I'm going to assume:

  • A userID may show up more than once in the source data (eg, as with id2 in my sample data set - above).
  • We will not combine multiple lines for a given userID (ie, each line will stand on its own).
  • We'll show sorting in both ascending and descending order.

While the sorting can be done within awk I'm going to opt for piping the awk output to sort as this will require a bit less code and (imo) be a bit easier to understand.

Ascending sort:

$ awk -F"," 'FNR>=2 { printf "%s %10.3f\n", $2, ($3+$4+$5+$6+$7)/5.0 }' user-list.txt | sort -nk2
id1 5.000
id2 5.349
id2 15.047
id4 17.824

Where sort -nk2 says to sort by column #2 using a numeric sort.

Descending sort:

$ awk -F"," 'FNR>=2 { printf "%s %10.3f\n", $2, ($3+$4+$5+$6+$7)/5.0 }' user-list.txt | sort -rnk2
id4 17.824
id2 15.047
id2 5.349
id1 5.000

Where sort -rnk2 says to sort by column #2 using a numeric sort but to reverse the order

How to parse a CSV in a Bash script?

First prototype using plain old grep and cut:

grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"

If that's fast enough and gives the proper output, you're done.

Bash: Parse CSV and edit cell values

See Why is using a shell loop to process text considered bad practice?

As question is tagged linux, assuming GNU sed is available. And also that the input is actually csv, not space/tab separated

$ cat ip.csv 
ID,Location,Way,Day,DayTime,NightTime,StandNo
1,abc,Up,mon,6.00,18.00,6
2,xyz,down,TUE,2.32,5.23,4

$ sed '2,$ {s/[^,]*/\L\u&/4; s/[^,]*/\U&/3; s/[^,]*/\U&/2}' ip.csv
ID,Location,Way,Day,DayTime,NightTime,StandNo
1,ABC,UP,Mon,6.00,18.00,6
2,XYZ,DOWN,Tue,2.32,5.23,4
  • 2,$ to process input from 2nd line to end of file
  • s/[^,]*/\L\u&/4 capitalize only first letter of 4th field
  • s/[^,]*/\U&/3 capitalize all letters in 3rd field
  • s/[^,]*/\U&/2 capitalize all letters in 2nd field

If the fields themselves can contain , within double quotes and so on, use perl, python, etc which has csv modules

What method should I use to parse csv files in bash

This is repeating a deleted but correct answer

IFS=";" read -r -a array < Input.csv
declare -p array

That reads the first line of the input file, splits on semi-colons and stores into the array variable named array.

The -r option for the read command means any backslashes in the input are handled as literal characters, not as introducing an escape sequence.

The -a option reads the words from the input into an array named by the gived variable name.

At a bash prompt, type help declare and help read.

Also find a bash tutorial that talks about the effect of IFS on the read command, for example BashGuide

The bash tag info page has tons of resources.

Parsing multiple CSV files in bash by pattern with counter

Could you please try following. Since no samples are given so couldn't test it. But this should be faster than a for loop which traverse through all csv files and calls awk in each iteration.

Following are the points taken care in this program:

  • NO need to use a for loop to traverse through .csv files, since awk is capable of it.
  • OP's code is NOT taking care of getting x, y values from file names I have added that logic too.
  • One could setup the output file name in BEGIN section of code as per need too.


awk -v max=0 '
BEGIN{
OFS=" , "
output_file="output.txt"
}
FNR==1{
if(want){
print output":"ORS want > (output_file)
}
split(FILENAME,array,"[-.]")
output=array[2] array[3]
want=max=""
}
{
if($1>max){
want=$2
max=$1
}
}
END{
print output":"ORS want > (output_file)
}
' *.csv

Typo fixed by OP



Related Topics



Leave a reply



Submit