How to Remove Duplicates from a CSV File

How to remove duplicates from a csv file

UPDATE: 2016

If you are happy to use the helpful more_itertools external library:

from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
out_file.writelines(unique_everseen(f))

A more efficient version of @IcyFlame's solution

with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate

seen.add(line)
out_file.write(line)

To edit the same file in-place you could use this (Old Python 2 code)

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue # skip duplicate

seen.add(line)
print line, # standard output is now redirected to the file

Removing duplicate rows from a CSV file using a python script and update this CSV file

Simply and short with pandas module:

import pandas as pd

df = pd.read_csv('myfile.csv')
df.drop_duplicates(inplace=True)
df.to_csv('myfile.csv', index=False)

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

How to remove duplicate/repeated rows in csv with python?

If you want to do it with pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)

# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)

Removing duplicate records from CSV file using Python Pandas

You don't need numpy or anything you can just do the unique-ify in one line, while importing the csv using pandas:

import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()

output:

   minzoom  maxzoom
0 0 15
1 8 15
2 15 18

Then to write it out to csv:

df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.

So the whole code:

import pandas as pd
df = pd.read_csv('lookup_scales.csv', usecols=['minzoom', 'maxzoom']).drop_duplicates(keep='first').reset_index()
file_name = "C:/Marine/lookup/distinct_lookup_scales.csv"
df.to_csv(file_name, index=False) # you don't need to set sep in this because to_csv makes it comma delimited.

Find and remove duplicates in a CSV file

Note: This question was done before the OP changed the python tag for awk tag.

If you don't mind the order of the elements you might do:

with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))

with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")

Output:

Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC

Note that you might want to treat the first line (the titles) in an special way to not loose their order.

But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:

from itertools import filterfalse

def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element

with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))

with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.

In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).

Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.

My first attempt using less RAM is with the shelve module:

import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l

with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")

os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")

Sadly, this code takes hundred of times more than the first two codes which uses the RAM.

Another attempt is:

with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")

This is slightly faster but not much.

If your computer has several cores, you could try to use multiprocessing:

from multiprocessing import Process, Queue, cpu_count
from os import remove

def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])

def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []

for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()

with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)

for s in slavesList:
s.join()

slavesList = []

with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)

slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()

for s in slavesList:
s.join()

As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).

A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:

lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()

if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv

This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.

Remove duplicate values in csv files

Depending on the size of your input, a naive approach could be fine:

$ cat test 
68,68
70,70
80,90
$ cat readvals.py
#! /usr/bin/env python
import csv
vals = [] # a list for the entire file
with open('test') as infile:
lines = csv.reader(infile,delimiter=',')
for i, line in enumerate(lines):
vals.append([]) # append a sub-list for this row.
for val in line:
if val not in vals[i]:
vals[i].append(val) # add values for the row
print(vals)
$ python readvals.py
[['68'], ['70'], ['80', '90']]

remove the duplicate rows in CSV file

Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.

import csv

csv_columns = ['No', 'Name', 'Country']
dict_data = [
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 1, 'Name': 'Alex', 'Country': ['India']},
{'No': 2, 'Name': 'Ben', 'Country': ['USA']},

]
csv_file = "Names.csv"

with open(csv_file, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
entries = set()
for data in dict_data:
val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
if val not in entries:
writer.writerow(data)
entries.add(val)
print('done')

Names.csv

 No,Name,Country
1,Alex,['India']
2,Ben,['USA']


Related Topics



Leave a reply



Submit