How can I merge two dataframes together with some conditional requirements?
Does this work for you?
library(dplyr)
library(data.table)
merge(x = df1,
y = df2) %>%
filter(TestDate %between% list(Date1, Date2))
Merging two dataframes - conditional rows
I think you can just use merge()
, unless I'm missing something:
ndf = pd.merge(ndf, df, how='inner', on='Market')
Here's a full code example with test case:
import pandas as pd
ndf = pd.DataFrame({'Date':['1994-01-15']*5 + ['2021-09-15']*5, 'Market':'Delhi,Delhi,Delhi,Delhi,Ahmedabad,Kharagpur,Kharagpur,Kharagpur,Kharagpur,Kharagpur'.split(','),
'Category':'cereals and tubers,cereals and tubers,miscellaneous food,oil and fats,cereals and tubers,pulses and nuts,pulses and nuts,pulses and nuts,vegetables and fruits,vegetables and fruits'.split(',')})
df = pd.DataFrame({'Market':'Delhi,Ahmedabad,Shimla,Bengaluru,Bhopal,Kharagpur'.split(','),
'geocoded':[(28.6517178, 77.2219388),(23.0216238, 72.5797068),(31.1041526, 77.1709729),(12.9767936, 77.590082),(23.2584857, 77.401989),(22.22, 73.73)]})
ndf = pd.merge(ndf, df, how='inner', on='Market')
print(ndf)
Output:
Date Market Category geocoded
0 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
1 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
2 1994-01-15 Delhi miscellaneous food (28.6517178, 77.2219388)
3 1994-01-15 Delhi oil and fats (28.6517178, 77.2219388)
4 1994-01-15 Ahmedabad cereals and tubers (23.0216238, 72.5797068)
5 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
6 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
7 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
8 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
9 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
Pandas join two dataframes with condition
based on your clairification I sugegst the following solution:
1) concatenate
(not join
) the 2 dataframes.
df12 = pd.concat([df1, df2], axis=1)
I assume that the indices match. If not - reindex on id or join
on id.
2) filter the rows that match criteria
df12 = df12[df12['date2'] > df12['date1]]
Compare two dataframe column values and join with condition in python?
While this isn't a highly efficient solution, you can use some set
s to solve this problem.
matches = df1["Id"].apply(set) <= df2["Id"].apply(set)
out = df1.copy()
out.loc[matches, df2.columns.difference(["Id"])] = df2
print(out)
Id Value Product_Name
0 [101, 102, 103] 10001 Shoe
1 [101, 102, 104] 10000 jeans
2 [101, 102, 105] 10002 make-up
3 [101, 107, 105] 10003 NaN
In the above snippet:
matches = df1["Id"].apply(set) <= df2["Id"].apply(set)
returns a booleanSeries
that is True where the contents of each row in df1['Id'] is in the corresponding row in df2['Id'], and False otherwise- Instead of performing an actual
merge
we can simply align the 2 DataFrames on the aforementioned booleanSeries
If you want to test Ids against eachother in both dataframes, you can take the cartesian product of both DataFrames, filter it down to the inner join via the set criteria, and then append back any missing left join keys.
out = (
pd.merge(df1, df2, how="cross")
.loc[lambda df: df["Id_x"].map(set) <= df["Id_y"].map(set)]
.pipe(
lambda df: df.append(
df1.loc[~df1["Id"].isin(df["Id_x"])].rename(columns={"Id": "Id_x"})
)
)
.reset_index(drop=True)
)
print(out)
Id_x Value Id_y Product_Name
0 [101, 102, 103] 10001 [101, 102, 103, 104] Shoe
1 [101, 102, 104] 10000 [101, 102, 103, 104] Shoe
2 [101, 102, 104] 10000 [101, 102, 109, 104] jeans
3 [101, 102, 105] 10002 [101, 105, 102, 108] make-up
4 [101, 107, 105] 10003 NaN NaN
Joining two dataframes on a condition (grepl)
TLDRYou just need to fix match_fun
:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
Background
You had the right idea, but you went wrong in your interpretation of the match_fun
parameter in fuzzyjoin::fuzzy_join()
. Per the documentation, match_fun
should be a
SolutionVectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in
by
(if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
A simple correction will do the trick, with further formatting by dplyr
. For conceptual clarity, I've typographically aligned the by
columns with the function
s used to match them:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
ResultGiven your sample data reproduced here
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
this solution should produce the following data.frame
for joined_dfs
, formatted as requested:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
NoteBecause each ages
is coincidentally a unique key, the following join on only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
will better illustrate the behavior of matching on substrings:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
Where You Went WrongError in Type
The value passed to match_fun
should be either (the symbol
for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
or a list
of such (symbol
s for) function
s:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
Instead of providing a list
of symbol
s
match_fun = list(=, grepl)
you incorrectly provided a vector
of character
strings:
Error in Syntaxmatch_fun = c("=", "grepl()")
The user should name the function
s
`=`
grepl
yet you incorrectly attempted to call them:
=
grepl()
Naming them will pass the function
s themselves to match_fun
, as intended, whereas calling them will pass their return values*. In R, an operator like =
is named using backticks: `=`
.
* Assuming the calls didn't fail with errors. Here, they would fail.
Inappropriate FunctionsTo compare two values for equality, here the character
vectors df1$fullnames
and df2$lastnames
, you should use the relational operator ==
; yet you incorrectly supplied the assignment operator =
.
Furthermore grepl()
is not vectorized in quite the way match_fun
desires. While its second argument (x
) is indeed a vector
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
its first argument (pattern
) is (treated as) a single character
string:
character string containing a regular expression (or character string for
fixed = TRUE
) to be matched in the given character vector. Coerced byas.character
to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except forregexpr
,gregexpr
andregexec
.
Thus, grepl()
is not a
Vectorized function given two columns...
but rather a function
given one string (scalar) and one column (vector) of strings.
The answer to your prayers is not grepl()
but rather something like stringr::str_detect()
, which is
Vectorised over
string
andpattern
. Equivalent togrepl(pattern, x)
.
and which wraps stringi::stri_detect()
.
Note
Since you're simply trying to detect whether a literal string in df1$fullnames
contains a literal string in df2$lastnames
, you don't want to accidentally treat the strings in df2$lastnames
as regular expression patterns. Now your df2$lastnames
column is statistically unlikely to contain names with special regex characters; with the lone exception of -
, which is interpreted literally outside of []
, which are very unlikely to be found in a name.
If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed()
or stringi::stri_detect_coll()
. These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.
Related Topics
How to Do a Conditional Count After Groupby on a Pandas Dataframe
Python: [Errno 10054] an Existing Connection Was Forcibly Closed by the Remote Host
Easiest Way to Ignore Blank Lines When Reading a File in Python
How to Start a Background Process in Python
How to Count Occurrences of Key in List of Dictionaries
How to Run Two Python Scripts Simultaneously from a Master Script
Add N Empty Rows in a Dataframe
How to Use Variables in SQL Statement in Python
How to Map the Differences Between Two Strings
Filtering the Dataframe Based on the Column Value of Another Dataframe
How to Add List into a New Column in CSV - Python
Convert Number Strings With Commas in Pandas Dataframe to Float
How to Delete Tkinter Widgets from a Window
How to Install Pip for a Specific Python Version
How to Plot Date and Time in X Axis Against Y Value (Python)
How to Completely Remove Python from a Windows Machine
Csv File Written With Python Has Blank Lines Between Each Row
How to Increase the Font Size of the Legend in My Seaborn Factorplot/Facetgrid