Linux Bash Commands to Remove Duplicates from a CSV File

Linux Bash commands to remove duplicates from a CSV file

You need to provide more details for question 3, but for question 1 and 2 the following awk one-liner will work.

awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

And with some whitespace added for clarity:

awk 'seen[$0]++ {
print $0 > "dups.csv"; next
}
{
print $0 > "new.csv"
}' mycsv

This will not print anything to STDOUT but will create two files. dups.csv will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv will contain all unique rows.

seen[$0]++ is a test we do for each line. If the line is present in our array it will be inserted to dups.csv file and we will move to the next line using next. If line is not present we will add that line to the array and write it to new.csv file.

Use of $0 means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv but I don't see any comma delimiters so I am using the default separator which is [[:space:]]+.

Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?

For true csv just set the field separator to ,. seen is not a command. It is a hash that retains column as keys. So you will modify the above command to:

awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

Update:

Once you have a list without dups using the commands stated above. We are left with:

$ cat new.csv 
john,kerry
adam,stephenson
ashley,hudson


$ cat remove.txt 
adam


$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv 
john,kerry
ashley,hudson

csv remove all rows with duplicate values in one column

This awk can do that in a single command:

awk -F, '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv

id2, value2
id3, value3

For your edited question use:

awk -F'","' '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv

"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"

Bash/Shell: How to remove duplicates from csv file by columns?

Using awk:

awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant

Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

MERGE 2 csv files removing duplicates using BASH or Awk

If you don't care about output order then using GNU sort for -s:

$ sort -t, -k3,3 -u -s small_set.csv large_set.csv
0,116070743,34603492144,false,2020-02-02 15:24:31,2020-02-02 15:24:31
0,116070741,34617138590,false,2020-02-02 15:24:30,2020-02-02 15:24:30
0,116070422,34617138999,false,2020-02-02 15:24:30,2020-02-02 15:24:30
0,116070777,34991221138,false,2020-02-02 15:24:30,2020-02-02 15:24:30

If you do care then you'd have to tell us what that output order should be as it's not obvious in your question.

Removing specific lines/rows or duplicates from a file in Linux

@zinovyev's code works fine for me. @Tabbi, to resolve your issue, run the remove_dup.sh script (Code + DATA in one file) below:

#!/bin/bash

# Write the top line
sed '0,/^__DATA__$/d' "$0" | head -n 1

# Make unique, sort and append other lines
sed '0,/^__DATA__$/d' "$0" | tail -n +2 | sort -k 3,4 -u | sort -V

exit

__DATA__
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003349 6069 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003356 9053 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
1003365 6843 6843 5392 1 48.70 23.08 0 0

and you should see the result

 ID      SAMPLE LABNO Oth_ID sex   age   bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0

Unix Delete Duplicate rows from csv based on 2 columns

@Mr Smith: Could you please try following and let me know if this helps you.

awk -F"[[:space:]]+,[[:space:]]+"  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}'   Input_file  Input_file

EDIT: Try:

awk -F","  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file   Input_file

EDIT2: Following is explanation as per OP's request:
awk -F"," ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{ ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
Because we want to save the values of last field as an index in array A and whose value is $1.
So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is
both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
where NR will be keep on increasing till all the Input_files are completed. So this condition will be
TRUE only when first Input_file is being read.
A[$NF]= ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF] ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then
? ##### Using ? wild character means if condition is TRUE then perform following statements.
$1 ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
: ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF]; ##### Keep the value of A[$NF] same as [$NF] no change in it.
next} ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here
if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file Input_file ##### Mentioning the Input_files here.


Related Topics



Leave a reply



Submit