Sorting CSV File by 5Th Column Using Bash

Sort CSV file by multiple columns using the sort command

You need to use two options for the sort command:

  • --field-separator (or -t)
  • --key=<start,end> (or -k), to specify the sort key, i.e. which range of columns (start through end index) to sort by. Since you want to sort on 3 columns, you'll need to specify -k 3 times, for columns 2,2, 1,1, and 3,3.

To put it all together,

sort -t ';' -k 2,2 -k 1,1 -k 3,3

Note that sort can't handle the situation in which fields contain the separator, even if it's escaped or quoted.

Also note: this is an old question, which belongs on UNIX.SE, and was also asked there a year later.


Old answer: depending on your system's version of sort, the following might also work:

sort --field-separator=';' --key=2,1,3

Or, you might get "stray character in field spec".

According to the sort manual, if you don't specify the end column of the sort key, it defaults to the end of the line.

How to sort csv by specific column

Use the following sort command:

sort -t, -k4,4 -nr temperature.csv

The output:

2017-06-24 14:25,22.21,19.0,17.5,0.197,4.774
2017-06-24 14:00,22.22,19.0,17.4,0.197,4.639
2017-06-24 16:00,22.42,19.0,17.3,0.134,5.93
2017-06-24 15:10,22.30,19.0,17.1,0.134,5.472
2017-06-24 13:00,21.92,19.0,17.1,0.096,4.229
2017-06-24 12:45,22.03,19.0,17.1,0.096,4.152
2017-06-24 17:45,22.07,21.0,17.0,0.144,6.472
2017-06-24 19:40,23.01,21.0,16.9,0.318,8.503
2017-06-24 18:25,21.90,21.0,16.9,0.15,6.814
2017-06-24 11:25,23.51,19.0,16.7,0.087,3.689
2017-06-24 11:20,23.57,19.0,16.7,0.087,3.615

  • -t, - field delimiter

  • -k4,4 - sort by 4th field only

  • -nr - sort numerically in reverse order

Sort CSV file based on first column

sort -k1 -n -t, filename should do the trick.

-k1 sorts by column 1.

-n sorts numerically instead of lexicographically (so "11" will not come before "2,3...").

-t, sets the delimiter (what separates values in your file) to , since your file is comma-separated.

Bash: Reading a CSV file and sorting column based on a condition

The problem with your script (and I presume with the "sorting isn't working") is the place where you redirect (and may have tried to sort) - the following variant of your own script does the job:

#!/bin/bash
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [ $col3 -gt 0 ]
then
echo "$col2"
fi
done < <(tail -n+2 user-list.txt) | sort > login.txt

Edit 1: Match new requirement

Sure we can fix the sorting; sort -k1.5,1.7n > login.txt

Of course, that, too, will only work if your user IDs are all 4 alphas and n digits ...

How do I sort a CSV file by a specific column?

This command seems to yield correct output:

sort -t"," -k2,2 -k3,3n /tmp/test.csv

I use comma to constrain order to that column only, and use the numeric (-n) switch to last character in the third column.

It yields:

55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55b64670abb9c0663e77de84,525e3bfad07b4377dc142a24:9999,0.081032
5510b33ec720d80086865312,525e3bfad07b4377dc142a24:9999,0.081033
5514548ec720d80086bfec46,525e3bfad07b4377dc142a24:9999,0.081035
551d4e21c720d80086084f45,525e3bfad07b4377dc142a24:9999,0.081036
557bff5276bd54a8df83268a,525e3bfad07b4377dc142a24:9999,0.081036

How to sort this CSV file by date with the Unix sort command?

You don't need the n — indeed, it is counterproductive. The dates are in ISO 8601 format, and they sort in time order when sorted alphanumerically. Numeric sorting only pays attention to the 2013 part of the field; the rest isn't part of a single number. You also don't need to worry about subsetting the time information — the fact that only some parts change won't matter.

You've given a very minimal data set with the pickup-time information already in sorted order, so we have to get a little inventive. The heading information won't sort numerically; you can remove it, or let it float around. To show that the sorting works when the data is sorted, I specify r (reverse order). This puts the heading data at the top and reverses the two lines of actual data.

$ sort -t, -k6r data.file
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
$

Or, in ascending order (the heading goes at the end):

$ sort -t, -k6 data.file
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
$

Also, you can decide which dates are relevant and modify this grep command to select the correct dates for the first week — which reduces the data size to about one quarter of its original size.

grep ',2013-01-0[1-7] [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file

That looks for dates in the range 2013-01-01 through 2013-01-07 (allowing any time for each day). You could omit the regex after the blank if you prefer; if the data is valid, it won't make any difference, but the regex avoids selecting some invalid data. Obviously, you can change the dates if you want the first week to run, for example, from the first Sunday through the first Saturday (Sunday 6th to Saturday 12th 2013):

grep -E ',2013-01-(0[6-9]|1[012]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file

You could then run this reduced data set through the sort process.

In future, please give 5 lines or so for sample data — it's easier to demonstrate what's working and what's not.

Sorting csv columns in bash, reading bash output into python variables

If you are going to use Python, then use Python. Why are you intermixing bash commands together? It makes your code not portable/dependent on a bash environment.

import os
import glob
import operator
os.chdir("/mypath")
for file in glob.glob("*.csv"):
data=open(file).readlines()
data=[i.strip().split(";") for i in data if i[:3] in ["637","638","639"]]
# data=[i.strip().split(";") for i in data if i[:3] in ["637","638","639"] and isinstance(float(i[:6]),float) ]
sortedlist = sorted(data, key=operator.itemgetter(1), reverse=True)
print "Highest for file %s: %s" % (file,sortedlist[0])

OR, if you are more interested in a bash+tools solution

find . -type f -name '*.csv' |while read -r FILE
do
grep -a "63[789]\...;" "$FILE" | sort -n -r -t ';' -k 2 | head -1 >> output.txt
done


Related Topics



Leave a reply



Submit