Select Only the First Row When Merging Data Frames with Multiple Matches

Select only the first row when merging data frames with multiple matches

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
# id score state
# 1: 1 66 KS
# 2: 2 86 MN
# 3: 3 76 AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).

How to join data to only the first matching row with {data.table} in R

One way would be to turn the values to NA after join.

library(data.table)

d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3

# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2

Join data frames and select random row when there are multiple matches

Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]). For each match (by = .EACHI), sample one row (sample(.N, 1L)). Use this to index 'amount' and 'status'.

d1[d2, on = .(gender, year, code),
{ri <- sample(.N, 1L)
.(amount = amount[ri], status = status[ri])}, by = .EACHI]

# sample based on set.seed(1)
# gender year code amount status
# 1: M 2011 A 15 EMX
# 2: M 2011 A 15 EMX
# 3: F 2018 A 12 NOX
# 4: F 2015 B 11 NOX

Note that there is an open issue on Enhanced functionality of mult argument, i.e. how to handle cases when multiple rows in x match to the row in i. Currently, valid options are "all" (default), "first" or "last". But if/when the issue is implemented, mult = "random" (sample(.N, size = 1L)) may be used to select a random row (rows) among the matches.

Concatenating matches in a merge with multiple matches

We can use pivot_wider with left_join

library(tidyr)
library(dplyr)
library(data.table)
input_B %>%
mutate(rn = rowid(ID, year)) %>%
pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
left_join(input_A)

-output

# A tibble: 6 × 12
ID year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2001 A B <NA> 2 1 NA 0.481 0.139 NA bla
2 1 2002 A B <NA> 2 1 NA 0.910 0.900 NA bla
3 1 2003 A B <NA> 2 1 NA 0.685 0.536 NA bla
4 2 2001 A B C 1 1 2 0.0712 0.469 0.194 more bla
5 2 2002 A B C 1 1 2 0.656 0.295 0.0715 more bla
6 2 2003 A B C 1 1 2 0.695 0.210 0.627 more bla

Merging data frames in R without duplicating rows in x due to repeated values in y

I came up with this but I had to add 'stimuli' to the EMOJ df

EMOJ$stimuli <- 'A'

df1 <- merge(EMOJ, EYETRACK, by = c('session','stimuli'), all = TRUE)

Merge dataframes based on column, only keeping first match

Use drop_duplicates for first rows:

df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

If want add only one column faster is use map:

s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

Merge two python dataframes and add first match and stop before proceeding

Try to get the first entries of each same value of col_4 in df2 by .GroupBy.first() before merging with df1:

pd.merge(df1, df2.groupby('col_4', as_index=False).first(), on='col_4')

Result:

   ID_x  col_1_x  col_2_x  col_3_x  col_4  ID_y  col_1_y  col_2_y  col_3_y
0 1 1 6 11 apple 1 8 12 12
1 2 2 7 12 apple 1 8 12 12
2 3 3 8 13 apple 1 8 12 12
3 5 4 9 14 apple 1 8 12 12
4 9 5 10 15 apple 1 8 12 12

Select only participants with multiple rows

We can either use a frequency based on approach to filter the 'ID's having more than one observation after grouping by 'ID'

library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n() > 1) %>%
ungroup

Or in base R - also use subset to subset the rows where it checks for 'ID's that have Time value greater than 1

subset(df1, ID %in% ID[Time > 1])

data

df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 5L, 5L), score = c(1000000L, 
1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L
), Time = c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))


Related Topics



Leave a reply



Submit