Join Two Dataframes from a Conditional Row

How can I merge two dataframes together with some conditional requirements?

Does this work for you?

library(dplyr)
library(data.table)
merge(x = df1,
y = df2) %>%
filter(TestDate %between% list(Date1, Date2))

Merging two dataframes - conditional rows

I think you can just use merge(), unless I'm missing something:

ndf = pd.merge(ndf, df, how='inner', on='Market')

Here's a full code example with test case:

import pandas as pd
ndf = pd.DataFrame({'Date':['1994-01-15']*5 + ['2021-09-15']*5, 'Market':'Delhi,Delhi,Delhi,Delhi,Ahmedabad,Kharagpur,Kharagpur,Kharagpur,Kharagpur,Kharagpur'.split(','),
'Category':'cereals and tubers,cereals and tubers,miscellaneous food,oil and fats,cereals and tubers,pulses and nuts,pulses and nuts,pulses and nuts,vegetables and fruits,vegetables and fruits'.split(',')})

df = pd.DataFrame({'Market':'Delhi,Ahmedabad,Shimla,Bengaluru,Bhopal,Kharagpur'.split(','),
'geocoded':[(28.6517178, 77.2219388),(23.0216238, 72.5797068),(31.1041526, 77.1709729),(12.9767936, 77.590082),(23.2584857, 77.401989),(22.22, 73.73)]})

ndf = pd.merge(ndf, df, how='inner', on='Market')
print(ndf)

Output:

         Date     Market               Category                  geocoded
0 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
1 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
2 1994-01-15 Delhi miscellaneous food (28.6517178, 77.2219388)
3 1994-01-15 Delhi oil and fats (28.6517178, 77.2219388)
4 1994-01-15 Ahmedabad cereals and tubers (23.0216238, 72.5797068)
5 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
6 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
7 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
8 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
9 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)

Pandas join two dataframes with condition

based on your clairification I sugegst the following solution:

1) concatenate (not join) the 2 dataframes.

df12 =  pd.concat([df1, df2], axis=1)

I assume that the indices match. If not - reindex on id or join on id.

2) filter the rows that match criteria

df12 = df12[df12['date2'] > df12['date1]]

Compare two dataframe column values and join with condition in python?

While this isn't a highly efficient solution, you can use some sets to solve this problem.

matches = df1["Id"].apply(set) <= df2["Id"].apply(set)

out = df1.copy()
out.loc[matches, df2.columns.difference(["Id"])] = df2

print(out)
Id Value Product_Name
0 [101, 102, 103] 10001 Shoe
1 [101, 102, 104] 10000 jeans
2 [101, 102, 105] 10002 make-up
3 [101, 107, 105] 10003 NaN

In the above snippet:

  1. matches = df1["Id"].apply(set) <= df2["Id"].apply(set) returns a boolean Series that is True where the contents of each row in df1['Id'] is in the corresponding row in df2['Id'], and False otherwise
  2. Instead of performing an actual merge we can simply align the 2 DataFrames on the aforementioned boolean Series

If you want to test Ids against eachother in both dataframes, you can take the cartesian product of both DataFrames, filter it down to the inner join via the set criteria, and then append back any missing left join keys.

out = (
pd.merge(df1, df2, how="cross")
.loc[lambda df: df["Id_x"].map(set) <= df["Id_y"].map(set)]
.pipe(
lambda df: df.append(
df1.loc[~df1["Id"].isin(df["Id_x"])].rename(columns={"Id": "Id_x"})
)
)
.reset_index(drop=True)
)


print(out)
Id_x Value Id_y Product_Name
0 [101, 102, 103] 10001 [101, 102, 103, 104] Shoe
1 [101, 102, 104] 10000 [101, 102, 103, 104] Shoe
2 [101, 102, 104] 10000 [101, 102, 109, 104] jeans
3 [101, 102, 105] 10002 [101, 105, 102, 108] make-up
4 [101, 107, 105] 10003 NaN NaN

Joining two dataframes on a condition (grepl)

TLDR

You just need to fix match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...


Background

You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

Solution

A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
df1, df2,

by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`

mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
Result

Given your sample data reproduced here

df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

this solution should produce the following data.frame for joined_dfs, formatted as requested:

        fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
Note

Because each ages is coincidentally a unique key, the following join on only *names

fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)

will better illustrate the behavior of matching on substrings:

       fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
Where You Went Wrong
Error in Type

The value passed to match_fun should be either (the symbol for) a function

fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)

or a list of such (symbols for) functions:

fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)

Instead of providing a list of symbols

match_fun = list(=, grepl)

you incorrectly provided a vector of character strings:

match_fun = c("=", "grepl()")
Error in Syntax

The user should name the functions

`=`
grepl

yet you incorrectly attempted to call them:

=
grepl()

Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

* Assuming the calls didn't fail with errors. Here, they would fail.

Inappropriate Functions

To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.

Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

its first argument (pattern) is (treated as) a single character string:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

Thus, grepl() is not a

Vectorized function given two columns...

but rather a function given one string (scalar) and one column (vector) of strings.

The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

and which wraps stringi::stri_detect().

Note

Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.

If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.



Related Topics



Leave a reply



Submit