matching two data frames and change values in one of the data frames
A less-than-elegant brute force approach
cols <- names(df1)[!names(df1) %in% df2$StmtNo]
df <- data.frame( matrix(NA, ncol = length(cols), nrow = 3) )
names(df) <- cols
df <- cbind(df, df1[, df2$StmtNo])
df[, order(as.numeric(names(df)))]
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 3 0 1 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA -1
# 6 1 0 NA NA NA NA -1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
# 12 1 -1 NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Pandas dataframes - Match two columns in the two dataframes to change the value of a third column
You can use MultiIndex.isin
:
c = ['x', 'y']
df1.loc[df1.set_index(c).index.isin(df2.set_index(c).index), 'knn'] = 0
x y knn
0 1 1 0
1 1 2 0
2 1 3 0
3 1 4 0
Find difference between two data frames
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
How can I match characters values of two data frames and apply a function corresponding to this match in R?
df1 %>%
separate(symbol, c("first", "second"), fill = "right", remove = FALSE)%>%
left_join(rbind(df2, transform(df2, first = second, second = first)))%>%
group_by(symbol)%>%
summarise(calc = if(is.na(value[1])) max(c_across(A:C))
else pmin(c_across(A:C))[value[1]])
# A tibble: 3 × 2
symbol calc
<chr> <dbl>
1 A 26
2 B,C 7
3 D,A 10
Find Partial matching elements between two dataframe columns in r
We could do it with an ifelse
statement:
library(dplyr)
library(stringr)
Input %>%
mutate(Matched = ifelse(str_detect(A, paste(Lookup$Matches, collapse = "|")), "Yes", "No"))
A B Matched
1 Green|Red|Yellow 23 Yes
2 Blue 41 No
3 Orange|Peach 65 Yes
4 Violet 89 No
Trying to compare two dataframes with many columns in R row by row and label the incorrect rows
Without seeing any data it is kind of hard to answer you.
Using the which function can tell you which rows match some criteria.
Below is an example how to use which.
You can change it to say which(df2$answers %in% df1$answer_key) or something similar
# Load the data
data(iris)
# Take a look
head(iris)
which_example <- c(5.4, 4.6)
# The way I think of which is to ask R "which rows in iris$Sepal.Length are 5.4?"
which(iris$Sepal.Length %in% 5.4)
which(iris$Sepal.Length %in% which_example)
# Once you have the rows, you can display only those specific rows and all or some columns
# The format is df[row,column]
# Which gives the rows. You can leave column blank to get all or enter specific ones
iris[which(iris$Sepal.Length %in% 5.4),]
iris[which(iris$Sepal.Length %in% 5.4),c(2,4)]
Merge two data frames based on common column values in Pandas
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title'
as 'movie_name'
.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge
operation.
How to use grep or any other method to compare different no of row in two data frame and get the match and mismatch?
You can use this:
ABData1 <- data.frame(a = c(1,2,3,4,5))
ABData2 <- data.frame(b = c(1,4,3,4))
equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}
ABData <- equLength(ABData1$a, ABData2$b)
... and then use your working code for one dataframe.
library("dplyr")
resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))
For the extended question:
library("dplyr")
ABData1 <- data.frame(id=c(11,12,13,14,15), a = c(1,2,3,4,5))
ABData2 <- data.frame(id=c(11,12,13,14), b = c(1,4,3,4))
equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}
if (nrow(ABData1)>nrow(ABData2)) ABData <- data.frame(ABData1, b=equLength(ABData1$a, ABData2$b)$b) else
ABData <- data.frame(ABData2, a=equLength(ABData1$a, ABData2$b)$a)
resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))
Compare two DataFrames and output their differences side-by-side
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
Here the first entry is the index and the second the columns which has been changed.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
Related Topics
Split Line by Multiple Points Using Sf Package
Shiny Datatable in Landscape Orientation
Remove Whiskers in Box-Whisker-Plot
Using Dplyr to Group_By and Conditionally Mutate a Dataframe by Group
Include Link to Local HTML File in Datatable in Shiny
R: Xmleventparse with Large, Varying-Node Xml Input and Conversion to Data Frame
Extract First N Digits from a String
Calculate a 2D Spline Curve in R
How to Predict Survival Probabilities in R
How to Create Dynamic Number of Observeevent in Shiny
How to Remove Certain Columns in Multiple Data Frames in R
Xaringan Slide Separator Not Separating Slides
Processing The Input File Based on Range Overlap
When/How/Where Is Parent.Frame in a Default Argument Interpreted