Pandas - How to Compare 2 CSV Files and Output Changes

Compare 2 different csv files and output all the changes into a new csv

First, read both CSV files into a dictionary, using the longName values as keys.

import csv

with open(old_csv_file, "r") as fh:
reader = csv.reader(fh)
old_csv = {row[0]: row for row in reader}

with open(new_csv_file, "r") as fh:
reader = csv.reader(fh)
new_csv = {row[0]: row for row in reader}

Then, it's easy to find newly added and deleted keys using set operations.

old_longNames = set(old_csv.keys())
new_longNames = set(new_csv.keys())

# common: set intersection
common_longNames = old_longNames.intersection(new_longNames)
# removed: whatever's in old but not in new
removed_longNames = old_longNames - new_longNames
# added: whatever's in new but not in old
added_longNames = new_longNames - old_longNames

Finally, iterate over the common set to find where there are changes:

changed_longNames = []
for key in common_longNames:
old_row = old_csv[key]
new_row = new_csv[key]
# if any(o != n for o, n in zip(old_row, new_row)):
if old_row != new_row:
# this row has at least one column changed. Do whatever
print(f"LongName {key} has changes")
changed_longNames.append(key)

Or, as a list comprehension:

changed_longNames = [key for key in common_longNames if old_csv[key] != new_csv[key]]

Writing everything to a new csv file is also fairly trivial. Note that the sets don't preserve the order, so you might not get the result in the same order.

with open("deleted.csv", "w") as fh:
writer = csv.writer(fh)
for key in removed_longNames:
writer.writerow(old_csv[key])

with open("inserted.csv", "w") as fh:
writer = csv.writer(fh)
for key in added_longNames:
writer.writerow(new_csv[key])

with open("changed.csv", "w") as fh:
writer = csv.writer(fh)
for key in changed_longNames:
old_row = old_csv[key]
new_row = new_csv[key]
merged_row = []
for oi, ni in zip(old_row, new_row):
merged_row.append(oi)
merged_row.append(ni)
writer.writerow(merged_row)

Compare two csv files and output changes

My solution is to turn each csv into a dictionary with the first column as the keys and the second column as the values. After that, I can loop through the keys and determine if the corresponding values were changed, removed, or added.

import csv
import re


def csv2dict(filename):
with open(filename) as file_handle:
reader = csv.reader(file_handle)
dict_object = dict(reader)
return dict_object


def separate_text_and_number(value):
text, number = re.match(r'(\D+)(\d+)', value).groups()
number = int(number)
return (text, number)


def main():
""" Entry """
csv1 = csv2dict('file1.csv')
csv2 = csv2dict('file2.csv')
all_keys = csv1.keys() | csv2.keys()

for key in sorted(all_keys, key=separate_text_and_number):
if key not in csv2:
print(f'{key} value removed')
elif key not in csv1:
print(f'{key} value added')
elif csv1[key] != csv2[key]:
print(f'{key} value changed from {csv1[key]} to {csv2[key]}')


if __name__ == '__main__':
main()

Output

name1 value changed from 2.0001 to 3.0000
name3 value added
name4 value removed
name5 value changed from 1.0000 to 1.0901
name7 value added
name8 value removed
name10 value removed
name11 value added
name12 value added

Notes

  • The function csv2dict opens a file and converts the contents into a dictionary
  • The function separate_text_and_number splits name14 into ('name', 14) to help with sorting the keys
  • In Python 3, the dict.keys() method returns a set-like object which contains all the keys. I use the | operator to find a union of two sets of keys.
  • For a more readable output, I sort the keys with the help of separate_text_and_number

Python compare two csv

Using pandas, you can merge two DataFrames where one contains relevant information which will be used in the other DataFrame. Here's an example:

import pandas as pd

csv1 = pd.DataFrame({"name":["test1","test2","test3","test4","test5"],"type":["A","B","C","A","D"]})

csv2 = pd.DataFrame({"type":["A","B","C"],"value":[1,2,3]})

pd.merge(csv1, csv2, on="type", how='outer')

And the output would be:

name    type    value
test1 A 1.0
test4 A 1.0
test2 B 2.0
test3 C 3.0
test5 D NaN

Comparing two CSV files and exporting the differences and similarities in Python?

You could merge netscan and computer DataFrames, then fill missing values in the Serial column with SerialN/A.

import pandas as pd
netscan = pd.read_csv('netscan.csv')
computer = pd.read_csv('computer_list.csv', usecols=['Name'])
for df in [netscan, computer]:
df['Name'] = df['Name'].str.rstrip()
result = pd.merge(netscan, computer, on='Name', how='outer')
result['Serial'] = result['Serial'].fillna('SerialN/A')
result.to_csv('result.csv', index=False)
print(result)

produces a CSV file (result.csv) containing

Name,Serial,Models
computer1,serial1,model1
computer2,serial2,model2
computer3,serial3,model3
computer4,serial4,model4
computerH,SerialN/A,
computerP,SerialN/A,


Related Topics



Leave a reply



Submit