Select only the first row when merging data frames with multiple matches
Using data.table
along with mult = "first"
and nomatch = 0L
:
require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference
scores[data, mult = "first", on = "id", nomatch=0L]
# id score state
# 1: 1 66 KS
# 2: 2 86 MN
# 3: 3 76 AL
For each row on data
's id
column, the matching rows in scores
' id
column are found, and the first one alone is retained (because mult = "first"
). If there are no matches, they're removed (because of nomatch = 0L
).
How to join data to only the first matching row with {data.table} in R
One way would be to turn the values to NA
after join.
library(data.table)
d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3
# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
Join data frames and select random row when there are multiple matches
Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]
). For each match (by = .EACHI
), sample one row (sample(.N, 1L)
). Use this to index 'amount' and 'status'.
d1[d2, on = .(gender, year, code),
{ri <- sample(.N, 1L)
.(amount = amount[ri], status = status[ri])}, by = .EACHI]
# sample based on set.seed(1)
# gender year code amount status
# 1: M 2011 A 15 EMX
# 2: M 2011 A 15 EMX
# 3: F 2018 A 12 NOX
# 4: F 2015 B 11 NOX
Note that there is an open issue on Enhanced functionality of mult
argument, i.e. how to handle cases when multiple rows in x
match to the row in i
. Currently, valid options are "all"
(default), "first"
or "last"
. But if/when the issue is implemented, mult = "random"
(sample(.N, size = 1L)
) may be used to select a random row (rows) among the matches.
Concatenating matches in a merge with multiple matches
We can use pivot_wider
with left_join
library(tidyr)
library(dplyr)
library(data.table)
input_B %>%
mutate(rn = rowid(ID, year)) %>%
pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
left_join(input_A)
-output
# A tibble: 6 × 12
ID year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2001 A B <NA> 2 1 NA 0.481 0.139 NA bla
2 1 2002 A B <NA> 2 1 NA 0.910 0.900 NA bla
3 1 2003 A B <NA> 2 1 NA 0.685 0.536 NA bla
4 2 2001 A B C 1 1 2 0.0712 0.469 0.194 more bla
5 2 2002 A B C 1 1 2 0.656 0.295 0.0715 more bla
6 2 2003 A B C 1 1 2 0.695 0.210 0.627 more bla
Merging data frames in R without duplicating rows in x due to repeated values in y
I came up with this but I had to add 'stimuli' to the EMOJ df
EMOJ$stimuli <- 'A'
df1 <- merge(EMOJ, EYETRACK, by = c('session','stimuli'), all = TRUE)
Merge dataframes based on column, only keeping first match
Use drop_duplicates
for first rows:
df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
If want add only one column faster is use map
:
s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
Merge two python dataframes and add first match and stop before proceeding
Try to get the first entries of each same value of col_4
in df2
by .GroupBy.first()
before merging with df1
:
pd.merge(df1, df2.groupby('col_4', as_index=False).first(), on='col_4')
Result:
ID_x col_1_x col_2_x col_3_x col_4 ID_y col_1_y col_2_y col_3_y
0 1 1 6 11 apple 1 8 12 12
1 2 2 7 12 apple 1 8 12 12
2 3 3 8 13 apple 1 8 12 12
3 5 4 9 14 apple 1 8 12 12
4 9 5 10 15 apple 1 8 12 12
Select only participants with multiple rows
We can either use a frequency based on approach to filter
the 'ID's having more than one observation after grouping by 'ID'
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n() > 1) %>%
ungroup
Or in base R
- also use subset
to subset the rows where it checks for 'ID's that have Time
value greater than 1
subset(df1, ID %in% ID[Time > 1])
data
df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 5L, 5L), score = c(1000000L,
1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L, 1000000L
), Time = c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Related Topics
How to Get the Maximum Value by Group
Calculate Row-Wise Proportions
Using Rcpp Within Parallel Code via Snow to Make a Cluster
Why Does R Use Partial Matching
Reduce PDF File Size of Plots by Filtering Hidden Objects
Non-Equi Join Using Data.Table: Column Missing from the Output
Using Lists Inside Data.Table Columns
How to Update R Packages in Default Library on Windows 7
Perform a Semi-Join with Data.Table
Count Number of Zeros Per Row, and Remove Rows with More Than N Zeros
Split Text String in a Data.Table Columns
Growing a Data.Frame in a Memory-Efficient Manner
Extract Month and Year from Date in R
Disable Messages Upon Loading a Package
Find Duplicated Rows (Based on 2 Columns) in Data Frame in R