Linux Bash commands to remove duplicates from a CSV file
You need to provide more details for question 3, but for question 1 and 2 the following awk
one-liner will work.
awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
And with some whitespace added for clarity:
awk 'seen[$0]++ {
print $0 > "dups.csv"; next
}
{
print $0 > "new.csv"
}' mycsv
This will not print anything to STDOUT but will create two files. dups.csv
will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv
will contain all unique rows.
seen[$0]++
is a test we do for each line. If the line is present in our array it will be inserted to dups.csv
file and we will move to the next line using next
. If line is not present we will add that line to the array and write it to new.csv
file.
Use of $0
means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv
but I don't see any comma
delimiters so I am using the default separator which is [[:space:]]+
.
Also, it is comma separated, I was just putting sample data up. So, if I want to use the above example but want to test only columns 3 & 4 (using the seen command), how would I do that in a comma separated file?
For true csv
just set the field separator to ,
. seen
is not a command. It is a hash that retains column as keys. So you will modify the above command to:
awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
Update:
Once you have a list without dups using the commands stated above. We are left with:
$ cat new.csv
john,kerry
adam,stephenson
ashley,hudson
$ cat remove.txt
adam
$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv
john,kerry
ashley,hudson
csv remove all rows with duplicate values in one column
This awk
can do that in a single command:
awk -F, '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv
id2, value2
id3, value3
For your edited question use:
awk -F'","' '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv
"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"
Bash/Shell: How to remove duplicates from csv file by columns?
Using awk
:
awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
Explanation: If $2,$3
combination doesn't exist in seen
array then a new entry with key of $2,$3
is stored in data
array with whole record. Every time $2,$3
entry is found a counter for $2,$3
is incremented. Then in the end those entries with counter==1
are printed.
MERGE 2 csv files removing duplicates using BASH or Awk
If you don't care about output order then using GNU sort for -s
:
$ sort -t, -k3,3 -u -s small_set.csv large_set.csv
0,116070743,34603492144,false,2020-02-02 15:24:31,2020-02-02 15:24:31
0,116070741,34617138590,false,2020-02-02 15:24:30,2020-02-02 15:24:30
0,116070422,34617138999,false,2020-02-02 15:24:30,2020-02-02 15:24:30
0,116070777,34991221138,false,2020-02-02 15:24:30,2020-02-02 15:24:30
If you do care then you'd have to tell us what that output order should be as it's not obvious in your question.
Removing specific lines/rows or duplicates from a file in Linux
@zinovyev's code works fine for me. @Tabbi, to resolve your issue, run the remove_dup.sh script (Code + DATA in one file) below:
#!/bin/bash
# Write the top line
sed '0,/^__DATA__$/d' "$0" | head -n 1
# Make unique, sort and append other lines
sed '0,/^__DATA__$/d' "$0" | tail -n +2 | sort -k 3,4 -u | sort -V
exit
__DATA__
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003349 6069 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003356 9053 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
1003365 6843 6843 5392 1 48.70 23.08 0 0
and you should see the result
ID SAMPLE LABNO Oth_ID sex age bmi ca_1 pd_7
1003341 21863 21863 NA 1 48.68 22.42 0 0
1003343 22697 22697 NA 1 48.98 23.25 0 0
1003347 4421 4421 NA 1 48.70 25.56 NA NA
1003348 1642 1642 NA 1 48.72 16.57 NA NA
1003349 4163 4163 6069 1 49.02 23.47 1 NA
1003356 5347 9053 5347 1 49.08 24.81 0 0
1003357 695 695 NA 1 49.08 22.32 NA NA
1003360 19833 19833 NA 1 48.55 22.48 0 0
1003365 5392 6843 5392 1 48.70 23.08 0 0
Unix Delete Duplicate rows from csv based on 2 columns
@Mr Smith: Could you please try following and let me know if this helps you.
awk -F"[[:space:]]+,[[:space:]]+" 'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file Input_file
EDIT: Try:
awk -F"," 'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file Input_file
EDIT2: Following is explanation as per OP's request:
awk -F"," ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{ ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
Because we want to save the values of last field as an index in array A and whose value is $1.
So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is
both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
where NR will be keep on increasing till all the Input_files are completed. So this condition will be
TRUE only when first Input_file is being read.
A[$NF]= ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF] ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then
? ##### Using ? wild character means if condition is TRUE then perform following statements.
$1 ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
: ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF]; ##### Keep the value of A[$NF] same as [$NF] no change in it.
next} ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here
if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file Input_file ##### Mentioning the Input_files here.
Related Topics
Why Cannot I Directly Compare 2 Thread Ids Instead of Using Pthread_Equal
Visual Studio - X11: The Display Environment Variable Is Missing
How to Wget The More Recent File of a Directory
Getting Kernel Version from The Compressed Kernel Image
Munin Dynamic Graph Zoom (Dynazoom) Not Working (Centos6,Nginx,PHP-Fpm)
Dialog in Bash Is Not Grabbing Variables Correctly
How to Write Own Package for Recipe in Arago Project Build
Merge Multiple Lines to 1 Row with Awk(Or Familiar)
Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor
Process Control Block in Linux
Nohup Failing with Anaconda Ipython
How Limit Memory Usage for a Single Linux Process and Not Kill The Process
Tomcat 6 Log4J - Linux - Safely Remove Catalina.Out
How to Avoid High CPU Usage While Reading/Writing Character Device
Documentation About Device Driver Programming on Kernel 3.X
Sharing Executable Memory Pages in Linux
Notify Gpio Interrupt to User Space from a Kernel Module
How to Ensure That a Process Runs in a Specific Physical CPU Core and Thread