R - simple Record Linkage - the next step ?
Taken from this post, here's an example that should work for you:
tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE",
"TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"
tv3 %>% compare.dedup(strcmp = TRUE) %>%
epiWeights() %>%
epiClassify(0.5) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
In result, the matches
dataframe should help you determining thresholds (set in epiClassify()
).
Generating a unique ID column for large dataset with the RecordLinkage package
First, import the following libraries:
library(RecordLinkage)
library(dplyr)
library(magrittr)
Consider these example datasets from the RecordLinkage package:
data(RLdata500)
data(RLdata10000)
Assume we care about these matching variables and threshold:
matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5
The record linkage for SMALL datasets is as follows:
RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data
Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID
The equivalent code for LARGE datasets is as follows:
RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data
Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID
Retrieving matched record ids in the recordlinkage library
Don't forget that a Pandas data frame has an "index" in addition to its data columns. Usually this is a single "extra" column of integers or strings, but more complex indices are possible, e.g. a "multi-index" consisting of more than one column.
You can see this if you print(matches.head())
. The first two columns have names that are slightly offset, because they aren't data columns; they are columns in the index itself. This data frame index is in fact a multi-index containing two columns: rec_id_1
and rec_id_2
.
The result from load_febrl
encodes record ID as the index of dfA
. Compare.compute
preserves the indices of the input data: you can always expect the indices from the original data to be preserved as a multi-index.
The index of a data frame by itself can be accessed with the DataFrame.index
attribute. This returns an Index
object (of which MultiIndex
is a subclass) that can in turn be converted as follows:
Index.tolist()
: convert to alist
of its elements;MultiIndex
becomes alist
oftuple
sIndex.to_series()
: convert to aSeries
of its elements;MultiIndex
becomes aSeries
oftuple
sIndex.values
: access underlying data as NumPyndarray
;MultiIndex
becomes andarray
oftuple
s.Index.to_frame()
: convert to aDataFrame
, with index columns as data frame columns
So you can quickly access the record id's with matches.index
, or export them to a list with matches.tolist()
.
You can also use matches.reset_index()
to turn Index columns back into regular data columns.
Show all matched pairs in a single dataframe - Python Record Linkage
Using stack
with iloc
or reindex
df.iloc[m.to_frame().stack()].assign(key=m.to_frame().reset_index(drop=True).stack().index.get_level_values(0))
Out[205]:
col_1 col_2 key
1 2 3 0
10 20 21 0
1 2 3 1
11 22 23 1
2 4 5 2
10 20 21 2
2 4 5 3
11 22 23 3
3 6 7 4
10 20 21 4
3 6 7 5
11 22 23 5
8 16 17 6
10 20 21 6
8 16 17 7
11 22 23 7
Related Topics
Schedule a Rscript Crontab Everyminute
Splitting (1:N)[Boolean] into Contiguous Sequences
Include Link to Local HTML File in Datatable in Shiny
Do Not Open Rstudio Internal Browser After Knitting
Find Match of Two Data Frames and Rewrite The Answer as Data Frame
How to Place +/- Plus Minus Operator in Text Annotation of Plot (Ggplot2)
R: How to Count How Many Points Are in Each Cell of My Grid
How to Simulate Bimodal Distribution
Data.Table Join (Multiple) Selected Columns with New Names
Calculate a 2D Spline Curve in R
Column Name with Brackets or Other Punctuations for Dplyr Group_By
Loop Linear Regression and Saving Coefficients
Line Spacing for Wrapped Text in Ggplot
How to Align or Center The Bars of a Histogram on The X Axis
What Happens When Prob Argument in Sample Sums to Less/Greater Than 1
Tidyr Separate Column Values into Character and Numeric Using Regex