Match Dataframes Excluding Last Non-Na Value and Disregarding Order

Match Dataframes Excluding Last Non-NA Value and disregarding order

This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:

  1. The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.

  2. Finally, it has to be determined whether a partner already has purchased the recommended item.

The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.

EDIT The dcast() has been moved out of the lapply() function.

Prepare data for n:m join

library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]

# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
# rnp A B C D F K M
#1: 1 A B C D NA NA NA
#2: 2 NA NA C D F NA NA
#3: 3 NA NA NA NA NA K M

# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
# rnr A B C D K M recommended
#1: 1 A B NA NA NA NA G
#2: 2 A B NA NA NA NA D
#3: 3 NA NA C D NA M K
#4: 4 A B C NA NA NA D
#5: 5 A NA C NA NA NA M
#6: 6 NA NA NA NA K M E
#7: 7 NA NA NA NA NA M T
#8: 8 NA NA NA NA K NA M

Combine rules and purchases

combi <- rbindlist(
# implied loop over rules to find matching purchases for each rule
lapply(seq_len(nrow(rules)), function(i) {
# get col names except last col which is the recommendation
cols <- lr[rnr == i, value[-.N]]
# join single rule with all partners on relevant cols for this rule
wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
})
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)),
.SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
# rnp A B C D F K M rnr recommended already_purchased
#1: 1 A B C D NA NA NA 1 G No
#2: 1 A B C D NA NA NA 2 D Yes
#3: 1 A B C D NA NA NA 4 D Yes
#4: 1 A B C D NA NA NA 5 M No
#5: 3 NA NA NA NA NA K M 6 E No
#6: 3 NA NA NA NA NA K M 7 T No
#7: 3 NA NA NA NA NA K M 8 M Yes

In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.

Essentially, we are done now. However, it doesn't look like the desired output.

Final joins

tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
# Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1: Alpha A B C D 9 B A G NA G No
#2: Alpha A B C D 10 B A D NA D Yes
#3: Alpha A B C D 12 A B C D D Yes
#4: Alpha A B C D 12 C A M NA M No
#5: Zeta M K NA NA 23 K M E NA E No
#6: Zeta M K NA NA 12 M T NA NA T No
#7: Zeta M K NA NA 24 K M NA NA M Yes

Remove Last Non-NA Value from Dataframe

We can use max.col to find the last non-NA element per row, then with row/column indexing, set those elements to NA in the original dataset

df1[cbind(1:nrow(df1), max.col(!is.na(df1), 'last'))] <- NA
df1
# Col1 Col2 Col3 Col4 Col5 Col6 Col7
#1 10 A B <NA> <NA> <NA> <NA>
#2 12 B P V <NA> <NA> <NA>
#3 14 C I K H M <NA>
#4 55 N <NA> <NA> <NA> <NA> <NA>
#5 34 M N O P <NA> <NA>

Match in R while disregarding order

I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.

    library(tidyr);library(dplyr)

df1 <- read.table(text="Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA",header=TRUE, stringsAsFactors=FALSE)

df2 <- read.table(text="lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1",header=TRUE, stringsAsFactors=FALSE)

df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)

#melt with gather
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA

d22 <- df2 %>% gather(Rule, Value,starts_with("r")) #Long

res <- inner_join(d11,d22)

cbind(df1[res$df1_id,],df2[res$df2_id,])

df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1 1 A A1 A2 <NA> <NA> 2 10 A1 A3 <NA>
1.1 1 A A1 A2 <NA> <NA> 1 11 A2 A1 A9
2 2 B A2 B9 <NA> <NA> 1 11 A2 A1 A9
2.1 2 B A2 B9 <NA> <NA> 5 11 A2 B9 B1
2.2 2 B A2 B9 <NA> <NA> 3 11 B9 A2 D7
4 4 D Q1 Q3 Q4 <NA> 4 10 Q4 Q1 <NA>

Seeing if all values in one dataframe row exist in another dataframe

Here is a possible solution using base R.

Make sure everything is a character before continuing, i.e.

df[-1] <- lapply(df[-1], as.character)
df1[-c(1:2)] <- lapply(df1[-c(1:2)], as.character)

First we create two lists which contain vectors of the rowwise elements of each data frame. We then create a matrix with the length of elements from l2 are found in l1, If the length is 0 then it means they match. i.e,

l1 <- lapply(split(df[-1], seq(nrow(df))), function(i) i[!is.na(i)])
l2 <- lapply(split(df1[-c(1:2)], seq(nrow(df1))), function(i) i[!is.na(i)])

m1 <- sapply(l1, function(i) sapply(l2, function(j) length(setdiff(j, i))))
m1
# 1 2 3 4 5
#1 0 2 2 2 2
#2 2 2 2 2 2
#3 3 3 2 2 0
#4 0 1 0 1 1
#5 2 3 0 3 2
#6 1 0 1 1 1
#7 1 1 1 2 2

We then use that matrix to create a couple of coloumns in our original df. The first column rpt will indicate how many times each row has length 0 and use that as a number of repeats for each row. We also use it to filter out all the 0 lengths (i.e. the rows that do not have a match with df1). After expanding the data frame we create another variable; ATTR (same name as ATTR in df1) in order to use it for a merge. i.e.

df$rpt <- colSums(m1 == 0)
df <- df[df$rpt != 0,]
df <- df[rep(row.names(df), df$rpt),]
df$ATTR <- which(m1 == 0, arr.ind = TRUE)[,1]
df
# ColA ColB ColC ColD rpt ATTR
#1 10 A B L 2 1
#1.1 10 A B L 2 4
#2 11 N Q <NA> 1 6
#3 12 P J L 2 4
#3.1 12 P J L 2 5
#5 89 O J T 1 3

We then merge and order the two data frames,

final_df <- merge(df, df1, by = 'ATTR')

final_df[order(final_df$ColA),]
# ATTR ColA ColB ColC ColD rpt Att R1 R2 R3 R4
#1 1 10 A B L 2 45 A B <NA> <NA>
#3 4 10 A B L 2 65 L <NA> <NA> <NA>
#6 6 11 N Q <NA> 1 23 Q <NA> <NA> <NA>
#4 4 12 P J L 2 65 L <NA> <NA> <NA>
#5 5 12 P J L 2 20 P L J <NA>
#2 3 89 O J T 1 33 T J O <NA>

DATA

dput(df)
structure(list(ColA = c(10L, 11L, 12L, 43L, 89L), ColB = c("A",
"N", "P", "M", "O"), ColC = c("B", "Q", "J", "T", "J"), ColD = c("L",
NA, "L", NA, "T")), .Names = c("ColA", "ColB", "ColC", "ColD"
), row.names = c(NA, -5L), class = "data.frame")

dput(df1)
structure(list(ATTR = 1:7, Att = c(45L, 40L, 33L, 65L, 20L, 23L,
38L), R1 = c("A", "C", "T", "L", "P", "Q", "Q"), R2 = c("B",
"D", "J", NA, "L", NA, "L"), R3 = c(NA, NA, "O", NA, "J", NA,
NA), R4 = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_)), .Names = c("ATTR",
"Att", "R1", "R2", "R3", "R4"), row.names = c(NA, -7L), class = "data.frame")

Find the index position of the first non-NA value in an R vector?

Use a combination of is.na and which to find the non-NA index locations.

NonNAindex <- which(!is.na(z))
firstNonNA <- min(NonNAindex)

# set the next 3 observations to NA
is.na(z) <- seq(firstNonNA, length.out=3)

How to subset data in R without losing NA rows?

If we decide to use subset function, then we need to watch out:

For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.

So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`

Don't use directly (to be explained soon):

df2 <- df1[df1$Height < 40, ]

Example

df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4

df1[df1$Height < 40, ]

# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA

The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA

We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

x[ind | is.na(ind)]
# [1] 1 2 3

This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

Getting last non na value across rows in a pandas dataframe

You need last_valid_index with custom function, because if all values are NaN it return KeyError:

def f(x):
if x.last_valid_index() is None:
return np.nan
else:
return x[x.last_valid_index()]

df['status'] = df.apply(f, axis=1)
print (df)
1 2 3 4 5 6 7 8 9 \
0
2016-06-02 7.080 7.079 7.079 7.079 7.079 7.079 NaN NaN NaN
2016-06-08 7.053 7.053 7.053 7.053 7.053 7.054 NaN NaN NaN
2016-06-09 7.061 7.061 7.060 7.060 7.060 7.060 NaN NaN NaN
2016-06-14 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-06-15 7.066 7.066 7.066 7.066 NaN NaN NaN NaN NaN
2016-06-16 7.067 7.067 7.067 7.067 7.067 7.067 7.068 7.068 NaN
2016-06-21 7.053 7.053 7.052 NaN NaN NaN NaN NaN NaN
2016-06-22 7.049 7.049 NaN NaN NaN NaN NaN NaN NaN
2016-06-28 7.058 7.058 7.059 7.059 7.059 7.059 7.059 7.059 7.059

status
0
2016-06-02 7.079
2016-06-08 7.054
2016-06-09 7.060
2016-06-14 NaN
2016-06-15 7.066
2016-06-16 7.068
2016-06-21 7.052
2016-06-22 7.049
2016-06-28 7.059

Alternative solution - fillna with method ffill and select last column by iloc:

df['status'] = df.ffill(axis=1).iloc[:, -1]
print (df)
status
0
2016-06-02 7.079
2016-06-08 7.054
2016-06-09 7.060
2016-06-14 NaN
2016-06-15 7.066
2016-06-16 7.068
2016-06-21 7.052
2016-06-22 7.049
2016-06-28 7.059

getting means of matched columns in different dataframes

setNames(round((df1+df2+df3)/3,digit=2),paste0('c',1:3,'.mean'))


Related Topics



Leave a reply



Submit