Select Only the First Row When Merging Data Frames with Multiple Matches

Select only the first row when merging data frames with multiple matches

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).

How to join data to only the first matching row with {data.table} in R

One way would be to turn the values to NA after join.

library(data.table)

d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3

#   a b    d c
#1: 1 1 TRUE 4
#2: 1 1   NA 8
#3: 1 2   NA 2

Join data frames and select random row when there are multiple matches

Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]). For each match (by = .EACHI), sample one row (sample(.N, 1L)). Use this to index 'amount' and 'status'.

d1[d2, on = .(gender, year, code),
  {ri <- sample(.N, 1L)
  .(amount = amount[ri], status = status[ri])}, by = .EACHI]

# sample based on set.seed(1)
#    gender year code amount status
# 1:      M 2011    A     15    EMX
# 2:      M 2011    A     15    EMX
# 3:      F 2018    A     12    NOX
# 4:      F 2015    B     11    NOX

Note that there is an open issue on Enhanced functionality of mult argument, i.e. how to handle cases when multiple rows in x match to the row in i. Currently, valid options are "all" (default), "first" or "last". But if/when the issue is implemented, mult = "random" (sample(.N, size = 1L)) may be used to select a random row (rows) among the matches.

Concatenating matches in a merge with multiple matches

We can use pivot_wider with left_join

library(tidyr)
library(dplyr)
library(data.table)
input_B %>% 
  mutate(rn = rowid(ID, year)) %>%
  pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
  left_join(input_A)

-output

# A tibble: 6 × 12
     ID  year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
  <dbl> <dbl> <chr>  <chr>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>   <dbl>   <dbl> <chr>   
1     1  2001 A      B      <NA>           2         1        NA  0.481    0.139 NA      bla     
2     1  2002 A      B      <NA>           2         1        NA  0.910    0.900 NA      bla     
3     1  2003 A      B      <NA>           2         1        NA  0.685    0.536 NA      bla     
4     2  2001 A      B      C              1         1         2  0.0712   0.469  0.194  more bla
5     2  2002 A      B      C              1         1         2  0.656    0.295  0.0715 more bla
6     2  2003 A      B      C              1         1         2  0.695    0.210  0.627  more bla

Merging data frames in R without duplicating rows in x due to repeated values in y

I came up with this but I had to add 'stimuli' to the EMOJ df

EMOJ$stimuli <- 'A'

df1 <- merge(EMOJ, EYETRACK, by = c('session','stimuli'), all = TRUE)

Merge dataframes based on column, only keeping first match

Use drop_duplicates for first rows:

df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
   Index   Fruit   Taste
0      1   Apple   Tasty
1      2  Banana   Tasty
2      3   Peach  Rotten

If want add only one column faster is use map:

s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
   Index   Fruit   Taste
0      1   Apple   Tasty
1      2  Banana   Tasty
2      3   Peach  Rotten

Merge two python dataframes and add first match and stop before proceeding

Try to get the first entries of each same value of col_4 in df2 by .GroupBy.first() before merging with df1:

pd.merge(df1, df2.groupby('col_4', as_index=False).first(), on='col_4')

Result:

   ID_x  col_1_x  col_2_x  col_3_x  col_4  ID_y  col_1_y  col_2_y  col_3_y
0     1        1        6       11  apple     1        8       12       12
1     2        2        7       12  apple     1        8       12       12
2     3        3        8       13  apple     1        8       12       12
3     5        4        9       14  apple     1        8       12       12
4     9        5       10       15  apple     1        8       12       12

Select only participants with multiple rows

We can either use a frequency based on approach to filter the 'ID's having more than one observation after grouping by 'ID'

library(dplyr)
df1 %>%
   group_by(ID) %>%
   filter(n() > 1) %>%
   ungroup

Or in base R - also use subset to subset the rows where it checks for 'ID's that have Time value greater than 1

subset(df1, ID %in% ID[Time > 1])

data

df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 5L, 5L), score = c(1000000L, 
1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L
), Time = c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)), 
class = "data.frame", row.names = c(NA, 
-8L))

Select Only the First Row When Merging Data Frames with Multiple Matches