Compare Two CSV Files and Search for Similar Items

Compare two CSV files and search for similar items

Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.

You can find the documentation for the python CSV module here.

What you're looking for is something like this:

import csv

f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

masterlist = list(c2)

for hosts_row in c1:
row = 1
found = False
for master_row in masterlist:
results_row = hosts_row
if hosts_row[3] == master_row[1]:
results_row.append('FOUND in master list (row ' + str(row) + ')')
found = True
break
row = row + 1
if not found:
results_row.append('NOT FOUND in master list')
c3.writerow(results_row)

f1.close()
f2.close()
f3.close()

Compare two csv files and write the matching entries in third file python

You are rewriting output file each time.
Change "w" to "a+":

with open('file3.csv', "a+", encoding....

Compare two CSV files and look for matches Python

Try this:

import csv

alist, blist = [], []

with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
for row_str in row:
alist += row_str.strip().split()

with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist += row

first_set = set(alist)
second_set = set(blist)

print first_set.intersection(second_set)

Basically, iterating through the csv file via csv reader returns a row which is a list of the items (strings) like this ['arm', 'biopsy', 'forearm'], so you have to sum lists to insert all of the items.

On the other hand, to remove duplications only one set conversion via the set() function is required, and the intersection method returns another set with the elements.

how to compare two csv file in python and flag the difference?

The idea here is to flatten your dataframe with melt to compare each value:

# Load your csv files
df1 = pd.read_csv('file1.csv', ...)
df2 = pd.read_csv('file2.csv', ...)

# Select columns (not mandatory, it depends on your 'Sn' column)
cols = ['Name', 'Subject', 'Marks']

# Flat your dataframes
out1 = df1[cols].melt('Name', var_name='Item', value_name='Old')
out2 = df2[cols].melt('Name', var_name='Item', value_name='New')
out = pd.merge(out1, out2, on=['Name', 'Item'], how='outer')

# Flag the state of each item
condlist = [out['Old'] != out['New'],
out['Old'].isna(),
out['New'].isna()]

out['State'] = np.select(condlist, choicelist=['changed', 'added', 'deleted'],
default='unchanged')

Output:

>>> out
Name Item Old New State
0 Ram Subject Maths computer changed
1 sita Subject Engilsh Engilsh unchanged
2 vishnu Subject science science unchanged
3 balaji Subject social social unchanged
4 Ram Marks 85 85 unchanged
5 sita Marks 66 66 unchanged
6 vishnu Marks 50 90 changed
7 balaji Marks 60 60 unchanged
8 kishor Subject NaN chem changed
9 kishor Marks NaN 99 changed

Compare two csv files and output changes

My solution is to turn each csv into a dictionary with the first column as the keys and the second column as the values. After that, I can loop through the keys and determine if the corresponding values were changed, removed, or added.

import csv
import re

def csv2dict(filename):
with open(filename) as file_handle:
reader = csv.reader(file_handle)
dict_object = dict(reader)
return dict_object

def separate_text_and_number(value):
text, number = re.match(r'(\D+)(\d+)', value).groups()
number = int(number)
return (text, number)

def main():
""" Entry """
csv1 = csv2dict('file1.csv')
csv2 = csv2dict('file2.csv')
all_keys = csv1.keys() | csv2.keys()

for key in sorted(all_keys, key=separate_text_and_number):
if key not in csv2:
print(f'{key} value removed')
elif key not in csv1:
print(f'{key} value added')
elif csv1[key] != csv2[key]:
print(f'{key} value changed from {csv1[key]} to {csv2[key]}')

if __name__ == '__main__':
main()

Output

name1 value changed from 2.0001 to 3.0000
name3 value added
name4 value removed
name5 value changed from 1.0000 to 1.0901
name7 value added
name8 value removed
name10 value removed
name11 value added
name12 value added

Notes

  • The function csv2dict opens a file and converts the contents into a dictionary
  • The function separate_text_and_number splits name14 into ('name', 14) to help with sorting the keys
  • In Python 3, the dict.keys() method returns a set-like object which contains all the keys. I use the | operator to find a union of two sets of keys.
  • For a more readable output, I sort the keys with the help of separate_text_and_number


Related Topics



Leave a reply



Submit