Using Recordlinkage to Add a Column with a Number for Each Person

R - simple Record Linkage - the next step ?

Taken from this post, here's an example that should work for you:

tv3 = as.data.frame(c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", 
    "TOURDE FRANZ", "GET FRESH"))
colnames(tv3) <- "name"

tv3 %>% compare.dedup(strcmp = TRUE) %>%
        epiWeights() %>%
        epiClassify(0.5) %>%
        getPairs(show = "links", single.rows = TRUE) -> matches

In result, the matches dataframe should help you determining thresholds (set in epiClassify()).

Generating a unique ID column for large dataset with the RecordLinkage package

First, import the following libraries:

library(RecordLinkage)
library(dplyr)
library(magrittr)

Consider these example datasets from the RecordLinkage package:

data(RLdata500)
data(RLdata10000)

Assume we care about these matching variables and threshold:

matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5

The record linkage for SMALL datasets is as follows:

RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
  compare.dedup() %>%
  epiWeights() %>%
  epiClassify(threshold) %>%
  getPairs(show = "links", single.rows = TRUE) -> matching_data

Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
                       select(matching_data, id1, id2) %>%
                         arrange(id1) %>% filter(!duplicated(id2)),
                       by = c("ID" = "id2")) %>%
  mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
  select(-id1)
RLdata$ID <- RLdata_ID$ID

The equivalent code for LARGE datasets is as follows:

RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
  RLBigDataDedup() %>%
  epiWeights() %>%
  epiClassify(threshold) %>%
  getPairs(filter.link = "link", single.rows = TRUE) -> matching_data

Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):

RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
                       select(matching_data, id.1, id.2) %>%
                         arrange(id.1) %>% filter(!duplicated(id.2)),
                       by = c("ID" = "id.2")) %>%
  mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
  select(-id.1)
RLdata$ID <- RLdata_ID$ID

Retrieving matched record ids in the recordlinkage library

Don't forget that a Pandas data frame has an "index" in addition to its data columns. Usually this is a single "extra" column of integers or strings, but more complex indices are possible, e.g. a "multi-index" consisting of more than one column.

You can see this if you print(matches.head()). The first two columns have names that are slightly offset, because they aren't data columns; they are columns in the index itself. This data frame index is in fact a multi-index containing two columns: rec_id_1 and rec_id_2.

The result from load_febrl encodes record ID as the index of dfA. Compare.compute preserves the indices of the input data: you can always expect the indices from the original data to be preserved as a multi-index.

The index of a data frame by itself can be accessed with the DataFrame.index attribute. This returns an Index object (of which MultiIndex is a subclass) that can in turn be converted as follows:

Index.tolist(): convert to a list of its elements; MultiIndex becomes a list of tuples
Index.to_series(): convert to a Series of its elements; MultiIndex becomes a Series of tuples
Index.values: access underlying data as NumPy ndarray; MultiIndex becomes a ndarray of tuples.
Index.to_frame(): convert to a DataFrame, with index columns as data frame columns

So you can quickly access the record id's with matches.index, or export them to a list with matches.tolist().

You can also use matches.reset_index() to turn Index columns back into regular data columns.

Show all matched pairs in a single dataframe - Python Record Linkage

Using stack with iloc or reindex

df.iloc[m.to_frame().stack()].assign(key=m.to_frame().reset_index(drop=True).stack().index.get_level_values(0))
Out[205]: 
    col_1  col_2  key
1       2      3    0
10     20     21    0
1       2      3    1
11     22     23    1
2       4      5    2
10     20     21    2
2       4      5    3
11     22     23    3
3       6      7    4
10     20     21    4
3       6      7    5
11     22     23    5
8      16     17    6
10     20     21    6
8      16     17    7
11     22     23    7

Using Recordlinkage to Add a Column with a Number for Each Person

R - simple Record Linkage - the next step ?

Generating a unique ID column for large dataset with the RecordLinkage package

Retrieving matched record ids in the recordlinkage library

Show all matched pairs in a single dataframe - Python Record Linkage

Related Topics

Leave a reply