Create group based on fuzzy criteria

Approach

Here's a solution with data.table, as preferred:

I would prefer a solution with data.table but any solutions at all are much appreciated!

While dplyr and fuzzyjoin might appear more elegant, they might also prove less efficient with sufficiently large datasets.

Credit goes to ThomasIsCoding for beating me to the punch on this other question, with an answer that harnesses igraph to index networks in graphs. Here, the networks are the separate "chains" (Wanted groups) comprised of "links" (data.frame rows), which are joined by their "closeness" (between their Start_Dates and End_Dates). Such an approach seemed necessary to model the transitive relationship ℛ requested here

I am trying to create the chain of "close" links so that I can map A's movements over time.

with care to also preserve the symmetry of ℛ (see Further Reading).

Per that same request

So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.

and your further clarification

...I would want another column that indicates that [flag].

I have also included a Flag column, to flag each row whose Start_Date is matched by the End_Dates of at least flag_at other rows; or vice versa.

Solution

Using your sample data.frame, reproduced here as my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

we apply data.table and igraph (among other packages) as follows:

library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)

# ...
# Code to generate your data.frame 'my_data_frame'.
# ...

# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()

# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2

#####################################
# BEGIN: Start Indexing the Groups. #
#####################################

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  
  ########################################################
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  ########################################################
  
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
    
  # JOIN rows to each other, to obtain pairings.
  .[my_data_table,
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           Name,
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]

#############################
# PAUSE: Count the Matches. #
#############################

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  
  # Extract the count column.
  .$Matches

# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  
  # Extract the count column.
  .$Matches

#########################################
# RESUME: Continue Indexing the Groups. #
#########################################

# Resume indexing:
output <- output %>%
  
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  
  ###############################################################
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  ###############################################################
  
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  
  # Pair each ID with its group index.
  igraph::membership() %>%
  
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  
  
  
  #####################################################
  # STEP 3: Match the Original Rows to their "Chains" #
  #####################################################
  
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  
  # Transform output into final form.
  .[# Sort into original order.
    order(ID),
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]

# View results.
output

Result

The resulting output is the following data.table:

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

Keep in mind that the Flags are all FALSE simply because your data lacks any Start_Date matched by (at least) two End_Dates; along with any End_Date matched by (at least) two Start_Dates.

Hypothetically, if we lowered flag_at to 1, then the output would Flag every row with even a single match (in either direction):

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE

Warning

Because some data.table operations modify by reference (or "in-place"), the value of my_data_table changes throughout the workflow. After Step 1, my_data_table becomes

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

a structural departure from the my_data_frame it initially copied.

Since dplyr (among other packages) assigns by value rather than by reference, a dplyr solution would sidestep this issue entirely.

As it is, however, you must take care when modifying the workflow, because the version of my_data_table available before Step 1 cannot be recovered afterwards.

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      4 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      3 FALSE
4:    A 2012-04-16 2015-01-09      5 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      1 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

By contrast, the rows can be properly grouped via the use of an undirected graph (directed = FALSE); or via more lenient criteria (mode = "weak"). Either of these approaches will effectively simulate the presence of ℛ whenever ℛ is present in the JOINed data.

This symmetric property is particularly important when modeling the behavior you describe here:

...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...

In this situation, you want the model to recognize that any two rows and must be in the same group ( ℛ ), whenever their End_Dates match the same Start_Date of some other row : ℛ and ℛ .

So suppose we know that ℛ and ℛ . Because our model has preserved symmetry, we can say from ℛ that ℛ too. Since we now know that ℛ and ℛ , transitivity implies that ℛ . Thus, our model recognizes that ℛ whenever ℛ and ℛ ! Similar logic will suffice for "vice versa".

We can verify this outcome by using

my_data_frame <- my_data_frame %>%
  rbind(list(Name = "A",
             Start_Date = as.Date("2010-01-01"),
             End_Date = as.Date("2015-01-05")))

to append an 8th row to my_data_frame, prior to the workflow:

    Name Start_Date   End_Date
  1    A 2015-01-01 2019-12-29
# ⋮    ⋮      ⋮           ⋮
  4    A 2012-04-16 2015-01-09
# ⋮    ⋮      ⋮           ⋮
  8    A 2010-01-01 2015-01-05

This 8th row serves as our , where is the 1st row and is the 4th row, as before. Indeed, the output properly classifies and and as belonging to the same group 1: ℛ .

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE
8:    A 2010-01-01 2015-01-05      1 FALSE

Likewise, the output properly Flags the 1st row, whose Start_Date is now matched by two End_Dates: in the 4th and 8th rows.

Cheers!

Assign rows to a group based on spatial neighborhood and temporal criteria in R
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr) # Create the distances fire_dist <- fire_df %>% # Normalize dates mutate( norm_dates = as.numeric(dates)/4) %>% # Only keep the three variables of interest select( rows, cols, norm_dates ) %>% # Compute distance using L-infinite-norm (maximum) dist( method="maximum" ) # Do hierarchical clustering with "single" aggl method fire_clust <- hclust(fire_dist, method="single") # Cut the tree at height 1 and obtain groups group_id <- cutree(fire_clust, h=1) # First attach the group ids back to the data frame fire_df2 <- cbind( fire_df, group_id ) %>% # Then sort the data arrange( group_id, dates, rows, cols ) # Print the first 20 records fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6 cols<-1:6 dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day") fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id 1 1 1 2000-01-01 1 2 2 2 2000-01-02 1 3 3 3 2000-01-03 1 4 4 4 2000-01-04 1 5 5 5 2000-01-05 1 6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234) rows<-sample(seq(1,50,1),6, replace=TRUE) cols<-sample(seq(1,50,1),6, replace=TRUE) dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE) fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id 1 6 1 2000-01-10 1 2 32 12 2000-01-30 2 3 31 34 2000-01-10 3 4 32 26 2000-01-27 4 5 44 35 2000-01-10 5 6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day") rows_start <- 50 cols_start <- 50 fire_df <- data.frame(dates = dates) %>% rowwise() %>% do({ diff = as.numeric(.$dates - as.Date("2000/01/01")) expand.grid(rows=seq(rows_start-diff,rows_start+diff), cols=seq(cols_start-diff,cols_start+diff), dates=.$dates) })
gives me:
rows cols dates group_id 1 50 50 2000-01-01 1 2 49 49 2000-01-02 1 3 49 50 2000-01-02 1 4 49 51 2000-01-02 1 5 50 49 2000-01-02 1 6 50 50 2000-01-02 1 7 50 51 2000-01-02 1 8 51 49 2000-01-02 1 9 51 50 2000-01-02 1 10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

How can I match entries in pandas dataframes using multiple criteria and fuzzy logic?
This task is quite difficult and involves a number of steps, but at least
I attempt to lay out some general principles.
Start from tidying up the state column.
If somewhere there is full name of a state, replace it with state code.
Maybe you should also take some time to clarify "No state" cases in df1,
as another step to clean the data.
Then, for each row in df1, attempt to find the best matching row in df2.
To do it, use the following procedure:
Using process.extract, find in df2 a pool of best matches, by name,
with the current row, assuming some values for limit and score_cutoff.
If row contains state, check in df2 only rows from this state.
Save match ratio for each match found as name_ratio.
For each item from the above pool, compute WRatio on city column,
saving it as city_ratio.
Use some aggregation formula, to compute total_ratio for each match
from name_ratio and city_ratio.
I'm also not sure how this formula should be.
Take the match with maximal total_ratio, but if this (best) ratio
is below some total_ratio_cutoff, assume that the current row has no match.
Of course, it remains to you to experiment with values of particular
parameters and look how changes in their values affect the final result.

is it possible to do fuzzy match merge with python pandas?
Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:
In [23]: import difflib In [24]: difflib.get_close_matches Out[24]: <function difflib.get_close_matches> In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0]) In [26]: df2 Out[26]: letter one a two b three c four d five e In [31]: df1.join(df2) Out[31]: number letter one 1 a two 2 b three 3 c four 4 d five 5 e
.
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name']) df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name']) df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0]) df1.merge(df2)

scala merge tuples using fuzzy string match
Here's an approach to preprocess your input with fuzzy-match, which will then be used as input by your existing code.
The idea is to first generate 2-combinations of your input tuples, fuzzy-match them to create a Map of distinct Sets consisting of the matched values per key, and finally use the Map to fuzzy-match your original input.
To make sure more arbitrary cases are covered, I've expanded your input:
val input = List( ("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"), ("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"), ("c", "7 cm"), ("c", "7 in") ) // Trivialized fuzzy match def fuzzyMatch(s1: String, s2: String): Boolean = { val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in") val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in") st1 == st2 } // Create a Map of Sets of fuzzy-matched values from all 2-combinations per key val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){ case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) => if (fuzzyMatch(t1._2, t2._2)) { val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map( x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x ) if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2)) m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2))) else m + (t1._1 -> fuzSets) } else m } // fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map( // a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)), // b -> List(Set(2 cm, 2.00 CM))) // )
Note that for large input, it might make sense to first groupBy key and generate 2-combinations per key.
Next step would be to fuzzy-match the original input using the created Map:
// Fuzzy-match original input using fuzMap val fuzInput = input.map{ case (k, v) => if (fuzMap.get(k).isDefined) { val fuzValues = fuzMap(k).map{ case x => if (x.contains(v)) Some(x.min) else None }.flatten if (!fuzValues.isEmpty) (k, fuzValues.head) else (k, v) } else (k, v) } // fuzInput: List[(String, String)] = List( // (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches), // (b,2 cm), (b,4 cm), (b,2 cm), // (c,7 cm), (c,7 in) // )

Related Topics

How to Format Kable Table When Knit from .Rmd to Word (With Bookdown)
Combining Date and Time into a Date Column for Plotting
Why Does Apt-Get Install R-Base Install 3.2.3 Instead of 3.4.0 in R
How to Get Rstudio to Show Function Arguments and Descriptions for Custom Functions
Error with New R 3.1.3 Version
Piecewise Function Fitting with Nls() in R
How to Calculate Euclidean Distance Between Two Matrices in R
Classification Functions in Linear Discriminant Analysis in R
Fill in Gaps (E.G. Not Single Cells) of Na Values in Raster Using a Neighborhood Analysis
Same Seed, Different Os, Different Random Numbers in R
R- Assign Inside a Function
Put Y Axis Title in Top Left Corner of Graph
How to Set The Maximum Recursion Depth In
Margins Between Plots in Grid.Arrange
How to Do Histograms of This Row-Column Table in R Ggplot
Clear R Environment of All Objetcs & Packages
Filter by Ranges Supplied by Two Vectors, Without a Join Operation
How to Format Kable Table When Knit from .Rmd to Word (With Bookdown)

Create Group Based on Fuzzy Criteria

Create group based on fuzzy criteria

Approach

Solution

Result

Warning

Further Reading

Cheers!

Assign rows to a group based on spatial neighborhood and temporal criteria in R

How can I match entries in pandas dataframes using multiple criteria and fuzzy logic?

is it possible to do fuzzy match merge with python pandas?

scala merge tuples using fuzzy string match

Related Topics

Leave a reply