Sort Command in Not Working Properly in Unix for Sorting a CSV File

Sort command in not working properly in unix for sorting a csv file

The below should work

 awk 'NR<2{print $_;next}{ print $_ | "sort -t, -k3.8,3.11rn -k3.1,3.3rM -k3.5,3.6rn -k3.12rd" }'

The 'awk' snippet passes all lines except header to the sort command.
The order of the keys is important here :

k3.8,3.11rn extracts the year part of the column and reverse sorts

k3.1,3.3rM extracts the first 3 characters in column three to be reverse monthly sorted and the rest we do a reverse dictionary sort

k3.5,3.6rn extracts the day and reverse sort and finally we do the same for time

How to sort this CSV file by date with the Unix sort command?

You don't need the n — indeed, it is counterproductive. The dates are in ISO 8601 format, and they sort in time order when sorted alphanumerically. Numeric sorting only pays attention to the 2013 part of the field; the rest isn't part of a single number. You also don't need to worry about subsetting the time information — the fact that only some parts change won't matter.

You've given a very minimal data set with the pickup-time information already in sorted order, so we have to get a little inventive. The heading information won't sort numerically; you can remove it, or let it float around. To show that the sorting works when the data is sorted, I specify r (reverse order). This puts the heading data at the top and reverses the two lines of actual data.

$ sort -t, -k6r data.file
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
$

Or, in ascending order (the heading goes at the end):

$ sort -t, -k6 data.file
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
$

Also, you can decide which dates are relevant and modify this grep command to select the correct dates for the first week — which reduces the data size to about one quarter of its original size.

grep ',2013-01-0[1-7] [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file

That looks for dates in the range 2013-01-01 through 2013-01-07 (allowing any time for each day). You could omit the regex after the blank if you prefer; if the data is valid, it won't make any difference, but the regex avoids selecting some invalid data. Obviously, you can change the dates if you want the first week to run, for example, from the first Sunday through the first Saturday (Sunday 6th to Saturday 12th 2013):

grep -E ',2013-01-(0[6-9]|1[012]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file

You could then run this reduced data set through the sort process.

In future, please give 5 lines or so for sample data — it's easier to demonstrate what's working and what's not.

How do I sort a csv file in linux according to timestamps?

Because you are using a sensible timestamp format, you can simply use lexical sorting:

sort -t, -k3,3 file

Sort CSV file by multiple columns using the sort command

You need to use two options for the sort command:

  • --field-separator (or -t)
  • --key=<start,end> (or -k), to specify the sort key, i.e. which range of columns (start through end index) to sort by. Since you want to sort on 3 columns, you'll need to specify -k 3 times, for columns 2,2, 1,1, and 3,3.

To put it all together,

sort -t ';' -k 2,2 -k 1,1 -k 3,3

Note that sort can't handle the situation in which fields contain the separator, even if it's escaped or quoted.

Also note: this is an old question, which belongs on UNIX.SE, and was also asked there a year later.


Old answer: depending on your system's version of sort, the following might also work:

sort --field-separator=';' --key=2,1,3

Or, you might get "stray character in field spec".

According to the sort manual, if you don't specify the end column of the sort key, it defaults to the end of the line.

Differences between Unix commands for Sorting CSV

See also the answer by @morido for some other pointers, but here's a description of exactly what those two sort invocations do:

sort -k 1n -o output.csv

This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.

sort -t "," -k1 -n -k2

This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).

Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by @morido:

sort -t, -k1,1 -k2,2

or

sort -t, -k1,1n -k2,2n          (alternatively sort -t, -n -k1,1 -k2,2)

if the data is numeric.

How to sort csv by specific column

Use the following sort command:

sort -t, -k4,4 -nr temperature.csv

The output:

2017-06-24 14:25,22.21,19.0,17.5,0.197,4.774
2017-06-24 14:00,22.22,19.0,17.4,0.197,4.639
2017-06-24 16:00,22.42,19.0,17.3,0.134,5.93
2017-06-24 15:10,22.30,19.0,17.1,0.134,5.472
2017-06-24 13:00,21.92,19.0,17.1,0.096,4.229
2017-06-24 12:45,22.03,19.0,17.1,0.096,4.152
2017-06-24 17:45,22.07,21.0,17.0,0.144,6.472
2017-06-24 19:40,23.01,21.0,16.9,0.318,8.503
2017-06-24 18:25,21.90,21.0,16.9,0.15,6.814
2017-06-24 11:25,23.51,19.0,16.7,0.087,3.689
2017-06-24 11:20,23.57,19.0,16.7,0.087,3.615

  • -t, - field delimiter

  • -k4,4 - sort by 4th field only

  • -nr - sort numerically in reverse order



Related Topics



Leave a reply



Submit