Find Match of Two Data Frames and Rewrite The Answer as Data Frame

matching two data frames and change values in one of the data frames

A less-than-elegant brute force approach

cols <- names(df1)[!names(df1) %in% df2$StmtNo]
df <- data.frame( matrix(NA, ncol = length(cols), nrow = 3) )
names(df) <- cols
df <- cbind(df, df1[, df2$StmtNo])

df[, order(as.numeric(names(df)))]

# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 3 0 1 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA -1
# 6 1 0 NA NA NA NA -1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
# 12 1 -1 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Pandas dataframes - Match two columns in the two dataframes to change the value of a third column

You can use MultiIndex.isin:

c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0


   x  y  knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0

Find difference between two data frames

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3

Correct Output

Out[656]: 
A B
1 2 3
2 3 4
3 3 4


How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

How can I match characters values of two data frames and apply a function corresponding to this match in R?

df1 %>%
separate(symbol, c("first", "second"), fill = "right", remove = FALSE)%>%
left_join(rbind(df2, transform(df2, first = second, second = first)))%>%
group_by(symbol)%>%
summarise(calc = if(is.na(value[1])) max(c_across(A:C))
else pmin(c_across(A:C))[value[1]])

# A tibble: 3 × 2
symbol calc
<chr> <dbl>
1 A 26
2 B,C 7
3 D,A 10

Find Partial matching elements between two dataframe columns in r

We could do it with an ifelsestatement:

library(dplyr)
library(stringr)

Input %>%
mutate(Matched = ifelse(str_detect(A, paste(Lookup$Matches, collapse = "|")), "Yes", "No"))
                 A  B Matched
1 Green|Red|Yellow 23 Yes
2 Blue 41 No
3 Orange|Peach 65 Yes
4 Violet 89 No

Trying to compare two dataframes with many columns in R row by row and label the incorrect rows

Without seeing any data it is kind of hard to answer you.

Using the which function can tell you which rows match some criteria.
Below is an example how to use which.
You can change it to say which(df2$answers %in% df1$answer_key) or something similar

# Load the data
data(iris)

# Take a look
head(iris)
which_example <- c(5.4, 4.6)

# The way I think of which is to ask R "which rows in iris$Sepal.Length are 5.4?"
which(iris$Sepal.Length %in% 5.4)
which(iris$Sepal.Length %in% which_example)

# Once you have the rows, you can display only those specific rows and all or some columns
# The format is df[row,column]
# Which gives the rows. You can leave column blank to get all or enter specific ones
iris[which(iris$Sepal.Length %in% 5.4),]
iris[which(iris$Sepal.Length %in% 5.4),c(2,4)]

Merge two data frames based on common column values in Pandas

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.

import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')

For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.

dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')

If you want to be even more specific, you may read the documentation of pandas merge operation.

How to use grep or any other method to compare different no of row in two data frame and get the match and mismatch?

You can use this:

ABData1 <- data.frame(a = c(1,2,3,4,5))
ABData2 <- data.frame(b = c(1,4,3,4))

equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}

ABData <- equLength(ABData1$a, ABData2$b)

... and then use your working code for one dataframe.

library("dplyr")
resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))

For the extended question:

library("dplyr")

ABData1 <- data.frame(id=c(11,12,13,14,15), a = c(1,2,3,4,5))
ABData2 <- data.frame(id=c(11,12,13,14), b = c(1,4,3,4))

equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}

if (nrow(ABData1)>nrow(ABData2)) ABData <- data.frame(ABData1, b=equLength(ABData1$a, ABData2$b)$b) else
ABData <- data.frame(ABData2, a=equLength(ABData1$a, ABData2$b)$a)

resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))

Compare two DataFrames and output their differences side-by-side

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation

* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.



Related Topics



Leave a reply



Submit