Comparing Two Columns in a Data Frame Across Many Rows

Compare multiple columns within same row and highlight differences in pandas

The simplest (and naïve) approach is to use Series.eq to test each row against the first value. Setting an appropriate subset is very important here, as we only want to compare against other similar values.

def highlight_row(s: pd.Series) -> List[str]:
bg_color = 'red'
if s.eq(s[0]).all():
bg_color = 'green'
return [f'background-color:{bg_color}'] * len(s)

df.style.apply(
func=highlight_row,
subset=['DB1', 'DB2', 'DB3', 'DB4'],
axis=1
)

Styled table with naïve styling (considers empty string and nan when doing comparison)


We can be a bit less naïve by excluding empty string and null values (and any other invalid values) from each row with a boolean indexing before doing the equality comparison with just the filtered array:

def highlight_row(s: pd.Series) -> List[str]:
filtered_s = s[s.notnull() & ~s.eq('')]
# Check for completely empty row (prevents index error from filtered_s[0])
if filtered_s.empty:
# No valid values in row
css_str = ''
elif filtered_s.eq(filtered_s[0]).all():
# All values are the same
css_str = 'background-color: green'
else:
# Row Values Differ
css_str = 'background-color: red'
return [css_str] * len(s)

We can also leverage an IndexSlice to more dynamically select the columns for the subset instead of manually passing a list of column names:

df.style.apply(
func=highlight_row,
subset=pd.IndexSlice[:, 'DB1':],
axis=1
)

Styled table that considers only "valid" values for equality comparison


Lastly, it is possible to instead pass the idx/cols to the styling function instead of subsetting if wanting the entire row to be highlighted:

def highlight_row(s: pd.Series, idx: pd.IndexSlice) -> List[str]:
css_str = 'background-color: red'
# Filter Columns
filtered_s = s[idx]
# Filter Values
filtered_s = filtered_s[filtered_s.notnull() & ~filtered_s.eq('')]
# Check for completely empty row
if filtered_s.empty:
css_str = '' # Empty row Styles
elif filtered_s.eq(filtered_s[0]).all():
css_str = 'background-color: green'
return [css_str] * len(s)

df.style.apply(
func=highlight_row,
idx=pd.IndexSlice['DB1':], # 1D IndexSlice!
axis=1
)

Styled table with entire row highlighting


Setup and Imports:

from typing import List

import pandas as pd # version 1.4.2

df = pd.DataFrame({
'NAME': ['WORKFLOW_1', 'WORKFLOW_2', 'WORKFLOW_3', 'WORKFLOW_4'],
'DB1': ['workflow1-1.jar', 'workflow2-1.jar', 'workflow3-2.jar', ''],
'DB2': ['workflow1-2.jar', 'workflow2-1.jar', 'workflow3-1.jar',
'workflow4-1.jar'],
'DB3': ['workflow1-1.jar', 'workflow2-1.jar', 'workflow3-1.jar', ''],
'DB4': ['workflow1-3.jar', 'workflow2-1.jar', 'workflow3-1.jar', '']
})

how to compare two columns in dataframe and update a column based on matching fields

import pandas as pd

d1={
"a":(1,4,7),
"b":(2,5,8),
"c":(0,0,0)
}

d2={
"a_1": (1, 4, 7),
"b_1": (5, 2, 8)
}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

# Iterate through each entry in a and compare it to a_1
for i in range(len(df1["a"])):
for j in range(len(df2["a_1"])):
if df1["a"][i] == df2["a_1"][j]:
df1["c"][i] = df2["b_1"][j]

Trying to compare two dataframes with many columns in R row by row and label the incorrect rows

Without seeing any data it is kind of hard to answer you.

Using the which function can tell you which rows match some criteria.
Below is an example how to use which.
You can change it to say which(df2$answers %in% df1$answer_key) or something similar

# Load the data
data(iris)

# Take a look
head(iris)
which_example <- c(5.4, 4.6)

# The way I think of which is to ask R "which rows in iris$Sepal.Length are 5.4?"
which(iris$Sepal.Length %in% 5.4)
which(iris$Sepal.Length %in% which_example)

# Once you have the rows, you can display only those specific rows and all or some columns
# The format is df[row,column]
# Which gives the rows. You can leave column blank to get all or enter specific ones
iris[which(iris$Sepal.Length %in% 5.4),]
iris[which(iris$Sepal.Length %in% 5.4),c(2,4)]

Pandas better method to compare two dataframes and find entries that only exist in one

Looks like using 'outer' as the how was the solution

z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both

Output looks like this (after only showing the columns I care about)

  Name       Site   _merge
Charlie A left_only
Doug B right_only

Compare two dataframe column values and join with condition in python?

While this isn't a highly efficient solution, you can use some sets to solve this problem.

matches = df1["Id"].apply(set) <= df2["Id"].apply(set)

out = df1.copy()
out.loc[matches, df2.columns.difference(["Id"])] = df2

print(out)
Id Value Product_Name
0 [101, 102, 103] 10001 Shoe
1 [101, 102, 104] 10000 jeans
2 [101, 102, 105] 10002 make-up
3 [101, 107, 105] 10003 NaN

In the above snippet:

  1. matches = df1["Id"].apply(set) <= df2["Id"].apply(set) returns a boolean Series that is True where the contents of each row in df1['Id'] is in the corresponding row in df2['Id'], and False otherwise
  2. Instead of performing an actual merge we can simply align the 2 DataFrames on the aforementioned boolean Series

If you want to test Ids against eachother in both dataframes, you can take the cartesian product of both DataFrames, filter it down to the inner join via the set criteria, and then append back any missing left join keys.

out = (
pd.merge(df1, df2, how="cross")
.loc[lambda df: df["Id_x"].map(set) <= df["Id_y"].map(set)]
.pipe(
lambda df: df.append(
df1.loc[~df1["Id"].isin(df["Id_x"])].rename(columns={"Id": "Id_x"})
)
)
.reset_index(drop=True)
)

print(out)
Id_x Value Id_y Product_Name
0 [101, 102, 103] 10001 [101, 102, 103, 104] Shoe
1 [101, 102, 104] 10000 [101, 102, 103, 104] Shoe
2 [101, 102, 104] 10000 [101, 102, 109, 104] jeans
3 [101, 102, 105] 10002 [101, 105, 102, 108] make-up
4 [101, 107, 105] 10003 NaN NaN

Comparing two columns in a data frame across many rows

A nested ifelse should do it (take a look at help(ifelse) for usage):

ifelse(dat$Genotype==dat$S288C,1,ifelse(dat$Genotype==dat$SK1,0,NA))

With this test data:

> dat
Genotype S288C SK1
[1,] "G" "A" "G"
[2,] "G" "A" "G"
[3,] "C" "T" "C"
[4,] "G" "A" "G"
[5,] "G" "G" "T"
[6,] "G" "A" "A"

We get:

> ifelse(dat$Genotype==dat$S288C,1,ifelse(dat$Genotype==dat$SK1,0,NA))
[1] 0 0 0 0 1 NA

(Note: If you have trouble using this, you'll want to make sure that the columns are vectors, and are not treated by R as factors...a simple for loop should do it: for (i in 1:ncol(dat)){dat[,i]=as.vector(dat[,i])}).

Pandas - Compare 2 columns in a dataframe and return count

Compare values first by Series.eq and count by Series.value_counts, then replace True, False indices:

s = (df.input_number.eq(df.org_number)
.value_counts()
.rename({True:'mach', False: 'no match'}))

If need DataFrame:

sdf1 = (df.input_number.eq(df.org_number)
.value_counts()
.rename({True:'mach', False: 'no match'})
.rename_axis('state')
.reset_index(name='count'))


Related Topics



Leave a reply



Submit