Compare 2 different csv files and output all the changes into a new csv
First, read both CSV files into a dictionary, using the longName
values as keys.
import csv
with open(old_csv_file, "r") as fh:
reader = csv.reader(fh)
old_csv = {row[0]: row for row in reader}
with open(new_csv_file, "r") as fh:
reader = csv.reader(fh)
new_csv = {row[0]: row for row in reader}
Then, it's easy to find newly added and deleted keys using set operations.
old_longNames = set(old_csv.keys())
new_longNames = set(new_csv.keys())
# common: set intersection
common_longNames = old_longNames.intersection(new_longNames)
# removed: whatever's in old but not in new
removed_longNames = old_longNames - new_longNames
# added: whatever's in new but not in old
added_longNames = new_longNames - old_longNames
Finally, iterate over the common set to find where there are changes:
changed_longNames = []
for key in common_longNames:
old_row = old_csv[key]
new_row = new_csv[key]
# if any(o != n for o, n in zip(old_row, new_row)):
if old_row != new_row:
# this row has at least one column changed. Do whatever
print(f"LongName {key} has changes")
changed_longNames.append(key)
Or, as a list comprehension:
changed_longNames = [key for key in common_longNames if old_csv[key] != new_csv[key]]
Writing everything to a new csv file is also fairly trivial. Note that the sets don't preserve the order, so you might not get the result in the same order.
with open("deleted.csv", "w") as fh:
writer = csv.writer(fh)
for key in removed_longNames:
writer.writerow(old_csv[key])
with open("inserted.csv", "w") as fh:
writer = csv.writer(fh)
for key in added_longNames:
writer.writerow(new_csv[key])
with open("changed.csv", "w") as fh:
writer = csv.writer(fh)
for key in changed_longNames:
old_row = old_csv[key]
new_row = new_csv[key]
merged_row = []
for oi, ni in zip(old_row, new_row):
merged_row.append(oi)
merged_row.append(ni)
writer.writerow(merged_row)
Compare two csv files and output changes
My solution is to turn each csv into a dictionary with the first column as the keys and the second column as the values. After that, I can loop through the keys and determine if the corresponding values were changed, removed, or added.
import csv
import re
def csv2dict(filename):
with open(filename) as file_handle:
reader = csv.reader(file_handle)
dict_object = dict(reader)
return dict_object
def separate_text_and_number(value):
text, number = re.match(r'(\D+)(\d+)', value).groups()
number = int(number)
return (text, number)
def main():
""" Entry """
csv1 = csv2dict('file1.csv')
csv2 = csv2dict('file2.csv')
all_keys = csv1.keys() | csv2.keys()
for key in sorted(all_keys, key=separate_text_and_number):
if key not in csv2:
print(f'{key} value removed')
elif key not in csv1:
print(f'{key} value added')
elif csv1[key] != csv2[key]:
print(f'{key} value changed from {csv1[key]} to {csv2[key]}')
if __name__ == '__main__':
main()
Output
name1 value changed from 2.0001 to 3.0000
name3 value added
name4 value removed
name5 value changed from 1.0000 to 1.0901
name7 value added
name8 value removed
name10 value removed
name11 value added
name12 value added
Notes
- The function
csv2dict
opens a file and converts the contents into a dictionary - The function
separate_text_and_number
splitsname14
into('name', 14)
to help with sorting the keys - In Python 3, the
dict.keys()
method returns a set-like object which contains all the keys. I use the|
operator to find a union of two sets of keys. - For a more readable output, I sort the keys with the help of
separate_text_and_number
Python compare two csv
Using pandas
, you can merge
two DataFrames where one contains relevant information which will be used in the other DataFrame. Here's an example:
import pandas as pd
csv1 = pd.DataFrame({"name":["test1","test2","test3","test4","test5"],"type":["A","B","C","A","D"]})
csv2 = pd.DataFrame({"type":["A","B","C"],"value":[1,2,3]})
pd.merge(csv1, csv2, on="type", how='outer')
And the output would be:
name type value
test1 A 1.0
test4 A 1.0
test2 B 2.0
test3 C 3.0
test5 D NaN
Comparing two CSV files and exporting the differences and similarities in Python?
You could merge netscan
and computer
DataFrames, then fill missing values in the Serial
column with SerialN/A
.
import pandas as pd
netscan = pd.read_csv('netscan.csv')
computer = pd.read_csv('computer_list.csv', usecols=['Name'])
for df in [netscan, computer]:
df['Name'] = df['Name'].str.rstrip()
result = pd.merge(netscan, computer, on='Name', how='outer')
result['Serial'] = result['Serial'].fillna('SerialN/A')
result.to_csv('result.csv', index=False)
print(result)
produces a CSV file (result.csv
) containing
Name,Serial,Models
computer1,serial1,model1
computer2,serial2,model2
computer3,serial3,model3
computer4,serial4,model4
computerH,SerialN/A,
computerP,SerialN/A,
Related Topics
How to Read a List of Parquet Files from S3 as a Pandas Dataframe Using Pyarrow
Python How to Remove Escape Characters from a String
Python | Make the Percentage of a List
What Is the Most Pythonic Way to Check If Multiple Variables Are Not None
Pandas Get Frequency of Item Occurrences in a Column as Percentage
Python Regex - Finding Phone Number
Pyspark Data Frame Converting False and True to 0 and 1
Clearing All Labels from a Tkinter Window
Swap First and Last Digits of a Number( Using Loops)
How to Properly Redirect Stdout/Stderr from a Systemd Service on Raspbian
How to Create Multiple Data Frames Using a for Loop in Python
How to Match a Newline Character in a Raw String
How to Test If a Column Exists and Is Not Null in a Dataframe
How to Change Border Color in Tkinter Widget
How to Display a Plot in Fullscreen
Faster Way to Read Excel Files to Pandas Dataframe
How to Convert a Pandas Dataframe to a Pytorch Tensor
How to Iterate Over a Timespan After Days, Hours, Weeks and Months