Using Recordlinkage to Add a Column with a Number for Each Person

R - simple Record Linkage - the next step ?

Taken from this post, here's an example that should work for you:

tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", 
"TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"

tv3 %>% compare.dedup(strcmp = TRUE) %>%
epiWeights() %>%
epiClassify(0.5) %>%
getPairs(show = "links", single.rows = TRUE) -> matches

In result, the matches dataframe should help you determining thresholds (set in epiClassify()).

Generating a unique ID column for large dataset with the RecordLinkage package

First, import the following libraries:

library(RecordLinkage)
library(dplyr)
library(magrittr)

Consider these example datasets from the RecordLinkage package:

data(RLdata500)
data(RLdata10000)

Assume we care about these matching variables and threshold:

matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5

The record linkage for SMALL datasets is as follows:

RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data

Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID

The equivalent code for LARGE datasets is as follows:

RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data

Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID

Retrieving matched record ids in the recordlinkage library

Don't forget that a Pandas data frame has an "index" in addition to its data columns. Usually this is a single "extra" column of integers or strings, but more complex indices are possible, e.g. a "multi-index" consisting of more than one column.

You can see this if you print(matches.head()). The first two columns have names that are slightly offset, because they aren't data columns; they are columns in the index itself. This data frame index is in fact a multi-index containing two columns: rec_id_1 and rec_id_2.

The result from load_febrl encodes record ID as the index of dfA. Compare.compute preserves the indices of the input data: you can always expect the indices from the original data to be preserved as a multi-index.

The index of a data frame by itself can be accessed with the DataFrame.index attribute. This returns an Index object (of which MultiIndex is a subclass) that can in turn be converted as follows:

  • Index.tolist(): convert to a list of its elements; MultiIndex becomes a list of tuples
  • Index.to_series(): convert to a Series of its elements; MultiIndex becomes a Series of tuples
  • Index.values: access underlying data as NumPy ndarray; MultiIndex becomes a ndarray of tuples.
  • Index.to_frame(): convert to a DataFrame, with index columns as data frame columns

So you can quickly access the record id's with matches.index, or export them to a list with matches.tolist().

You can also use matches.reset_index() to turn Index columns back into regular data columns.

Show all matched pairs in a single dataframe - Python Record Linkage

Using stack with iloc or reindex

df.iloc[m.to_frame().stack()].assign(key=m.to_frame().reset_index(drop=True).stack().index.get_level_values(0))
Out[205]:
col_1 col_2 key
1 2 3 0
10 20 21 0
1 2 3 1
11 22 23 1
2 4 5 2
10 20 21 2
2 4 5 3
11 22 23 3
3 6 7 4
10 20 21 4
3 6 7 5
11 22 23 5
8 16 17 6
10 20 21 6
8 16 17 7
11 22 23 7


Related Topics



Leave a reply



Submit