R - Keep First Observation Per Group Identified by Multiple Variables (Stata Equivalent "Bys Var1 Var2:Keep If _N == 1")

R - Keep first observation per group identified by multiple variables (Stata equivalent bys var1 var2 : keep if _n == 1)

I would order the data.frame at which point you can look into using by:

mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))

Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12

Or, starting from scratch, you can use data.table in the following way:

DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]

And, by extension, in base R:

Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
rank(x, ties.method="first")))
mydata[Ranks == 1, ]

Stata: Keep the first observation by group

The if (conditional) {do something} syntax is used in control flow rather than in defining variables. As you have it now Stata is only testing if A==1 in the first row. Try adding additional conditions using and (&) or or (|) statements. Try this:

bys id firm: keep if (_n==1 & A==0) | A==1

Track first observation based on several criteria in R

A solution using dplyr, tidyr, and data.table. case_when is handy to assign the condition. After that, drop rows with NA in Flag, and then assign run length ID in Flag2, filter the first row in Flag2, assign Flag2, and finally spread the data frame.

library(dplyr)
library(tidyr)
library(data.table)

dat2 <- dat %>%
mutate(Flag = case_when(
Enable == 0 & Deviation > Threshold & Status == 0 ~ "Start Event Time",
Enable == 1 & Deviation > Threshold & Status == 0 ~ "End Event x Time",
Enable == 1 & Deviation > Threshold & Status == 1 ~ "End Event z Time",
TRUE ~ NA_character_
)) %>%
drop_na(Flag) %>%
mutate(Flag2 = rleid(Flag)) %>%
group_by(Flag2) %>%
slice(1) %>%
ungroup() %>%
mutate(x=cumsum(Flag == "Start Event Time")) %>%
group_by(x) %>%
filter(!(duplicated(Flag) & (Flag =='End Event x Time' | Flag =='End Event z Time'))) %>%
spread(Flag, Timestamp, x) %>%
select(ID, `Start Event Time`, `End Event x Time`, `End Event z Time`)
dat2
# # A tibble: 3 x 4
# ID `Start Event Time` `End Event x Time` `End Event z Time`
# * <chr> <chr> <chr> <chr>
# 1 a 6/10/2015 10:20 6/10/2015 10:25 6/10/2015 10:40
# 2 a 6/10/2015 10:55 6/10/2015 11:00 6/10/2015 11:10
# 3 b 7/11/2015 11:25 7/12/2015 11:30 7/13/2015 11:35

Creating group ids by comparing values of two variables across rows: in R

You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to @r2evans.

In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.

In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.

I hope this might be helpful.

library(tidyverse)

toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))

toy_data$rn = 1:nrow(toy_data)

eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]

together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)

together <- subset(together, end1 == start2)

groups <- split(together$rn2, together$rn1)

for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}

min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)

lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)

Output

   name start   end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17

R: Removing (or grouping) redundant observations/duplicates

You could try

df %>% 
pivot_wider(names_from = genre,
names_prefix = "genre_",
values_from = genre) %>%
mutate(across(starts_with("genre"), ~+!is.na(.)))

instead of dummy_cols. This returns

# A tibble: 2 x 5
song_and_artist popularity genre_pop genre_rock genre_hiphop
<chr> <dbl> <int> <int> <int>
1 song_by_band 85 1 1 1
2 othersong_by_otherband 58 1 1 0

Check if for each unique value of var1, there is one observation where its value equals either var2 or var3 by group (var4) in R

Consider assigning marriage column with ave (in-line aggregation by groups) where max is used to return any TRUE values.

households <- within(households, {    
man <- ave(IDman %in% ID, cluster, FUN=max)
woman <- ave(IDwoman %in% ID, cluster, FUN=max)
marriage <- man == 1 & woman == 1

rm(man, woman)
})

households
# cluster ID IDman IDwoman marriage
# 1 a 1 1 5 TRUE
# 2 a 7 2 7 TRUE
# 3 a 18 3 9 TRUE
# 4 b 3 3 11 FALSE
# 5 b 3 3 12 FALSE
# 6 b 9 4 14 FALSE
# 7 c 10 10 19 TRUE
# 8 c 19 6 5 TRUE
# 9 c 25 10 19 TRUE

And for unique combinations, filter data frame accordingly by rows and columns, then run unique:

unique(households[households$marriage == TRUE,
c("cluster", "marriage")])

# cluster marriage
# 1 a TRUE
# 7 c TRUE

Stata: Keep only observations with minimum, maximum and median value of a given variable

This is not in the first instance a precision problem.

It is an inevitable problem when (1) the number of values is even and (2) the median is the mean of two central values that are different. Then the median itself is not a value in the dataset and will not be found by keep.

Consider a data set 1, 2, 3, 4. The median 2.5 is not in the data. This is very common; indeed it is what is expected with all values distinct and the number of observations even.

Other problems can arise because two or even three of the minimum, median and maximum could be equal to each other. This is not your present problem, but it can bite with other variables (e.g. indicator variables).

Precision problems are possible.

Here is a general solution purported to avoid all these difficulties.

If you collapse to min, median. max and then reshape you can avoid the problem. You will always get three results, even if they are numerically equal and/or not present in the data.

In the trivial example below, the identifier is needed only to appease reshape. In other problems, you might want to collapse using by() and then your identifier comes ready-made. However, you will be less likely to want to reshape in that case.

. clear

. set obs 4
number of observations (_N) was 0, now 4

. gen y = _n

. collapse (min)ymin=y (max)ymax=y (median)ymedian=y

. gen id = _n

. reshape long y, i(id) j(statistic) string
(note: j = max median min)

Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 1 -> 3
Number of variables 4 -> 3
j variable (3 values) -> statistic
xij variables:
ymax ymedian ymin -> y
-----------------------------------------------------------------------------

. list

+---------------------+
| id statis~c y |
|---------------------|
1. | 1 max 4 |
2. | 1 median 2.5 |
3. | 1 min 1 |
+---------------------+

All that said, having (lots of?) datasets with just three observations sounds poor data management strategy. Perhaps this is extracted from some larger question.

UPDATE

Here is another way to keep precisely 3 observations. Apart from the minimum and maximum, we use the rule that we keep the "low median", i.e. the lower of two values averaged for the median, when the number of observations is even, and a single value that is the median otherwise. (In Stephen Stigler's agreeable terminology, we can talk of "comedians" in the first case.)

. sysuse auto, clear
(1978 Automobile Data)

. sort mpg

. drop if missing(mpg)
(0 observations deleted)

. keep if inlist(_n, 1, cond(mod(_N, 2), ceil(_N/2), floor(_N/2)), _N)
(71 observations deleted)

. l mpg

+-----+
| mpg |
|-----|
1. | 12 |
2. | 20 |
3. | 41 |
+-----+

mod(_N, 2) is 1 if _N is odd and 0 if _N is even. The expression in cond() selects ceil(_N/2) if the number of observations is odd and floor(_N/2) if it is even.

Aggregating variables with similar first digit

clear 
input str1 person item11 item21 item14 item15 item25 item17
a 2 3 5 1 3 50
end

egen item1 = rowtotal(item1*)
egen item2 = rowtotal(item2*)

drop item1? item2?
list

+------------------------+
| person item1 item2 |
|------------------------|
1. | a 58 6 |
+------------------------+

Tracking changes per observation in a continuous variable

One way of analyzing this would be to do as Nick suggested and use destring area pricesqm. Note that in the following code, I added four lines to your data example so that there was an example of a masterplan-lotnumber changing over time:

clear

input str109 masterplan str66 lotnumber str40 area str30 pricesqm str42 transactiondate,
"/2022" " 46" "625" "260.8" "2004/01/24"
"/2485" " 261/2" "435" "103.4483" "2004/01/29"
"/2485" " 63" "625" "75.2" "2004/01/23"
"/3152" " 114" "500" "170" "2004/01/28"
"/3152" " 134" "993.05" "160.6163" "2004/01/06"
"/3152" " 141" "600" "131.44" "2005/01/28"
"/3152" " 159" "500" "154" "2003/01/28"
"/3152" " 161" "500" "155" "2002/01/29"
"/9998" " 999" "800" "155" "2003/02/28"
"/9998" " 999" "400" "155" "2004/03/15"
"/9999" " 999" "800" "155" "2004/02/28"
"/9999" " 999" "800" "155" "2005/03/15"

end

compress
destring area pricesqm, replace

*create a clean date from the transaction date and format for ease
gen trans_date_clean = daily(transactiondate, "YMD")
format trans_date_clean %tdnn/dd/YY

*create an id for each masterplan-lotnumber
sort masterplan lotnumber trans_date_clean
egen id = group(masterplan lot)

*create a flag that equals 1 if the id is the same as the previous
*id, the previous area is greater than 500, and the area of this
*observation is less than 500. This approach depends on the sort
*before the egen command above so that masterplan-lotnumbers are
*grouped together.
gen flag = 0
replace flag = 1 if id == id[_n-1] & area[_n-1] > 500 & area < 500

In addition, this is just one way to approach this problem. After this, you might need to take steps to ensure that you don't double count any lots if they change areas frequently, or you might need to add extra conditions to the flag if the area changes need to be within a certain time period. Alternatively, you could also consider reshaping the data wide by the id (the masterplan-lotnumber) and the transaction year (from the transaction date) and comparing the differences in housing areas between two years.



Related Topics



Leave a reply



Submit