Remove Ids That Occur X Times R

remove IDs that occur x times R

You can use table like this:

df[df$names %in% names(table(df$names))[table(df$names) >= 5],]

Omit ID's which occur less than x times with a combination of vectors

Using base R we calculate number of unique values per SpeciesID and select only those SpeciesID which occur greater than equal to 5 times.

df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ]

# SpeciesID IndID
#6 100 14-005
#7 100 14-005
#8 100 14-005
#9 100 14-006
#10 100 14-007
#11 100 14-007
#12 100 14-008
#13 100 14-009
#14 500 16-001
#15 500 16-001
#16 500 16-002
#17 500 16-002
#18 500 16-002
#19 500 16-003
#20 500 16-003
#21 500 16-004
#22 500 16-004
#23 500 16-005
#24 500 16-006
#25 500 16-006
#26 500 16-007

length(unique(x)) can also be replaced by n_distinct from dplyr

library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]

Or a complete dplyr solution which is more verbose could be

library(dplyr)
df %>%
group_by(SpeciesID) %>%
filter(n_distinct(IndID) >= 5)

Remove ID:s with only one observation in time in r

We can do this using a couple of options. With data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we get the number of rows (.N) and if that is greater than 1, get the Subset of Data.table (.SD)

library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2

Can use the same methodology with dplyr.

library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2

Or with base R, get the table of data.frame, check whether it is greater than 1, subset the names based on the logical index ('i1') and use it to subset the 'data.frame' using %in%.

 i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )

Removing rows of subsetted data that occur only once

One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.

library(dplyr)

filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup

Another way would be the following.

filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE)))


# id code
# <int> <int>
#1 12345 1092
#2 12345 1092
#3 90029 1092
#4 90029 1092
#5 90029 1092
#6 90029 5521
#7 90029 5521

Remove ID's based on their max value with dpylr in R

We can use a group by operation with any

library(dplyr)
test %>%
group_by(ID) %>%
filter(any(value > 0.1)) %>%
ungroup

-output

# A tibble: 4 x 3
# value time ID
# <dbl> <dbl> <dbl>
#1 0.2 0 3
#2 0.4 0 4
#3 0.05 1 3
#4 0.5 1 4

Deleting rows in a dataframe that reference IDs that do not exist in another (R)?

Here's a base R solution

elementdf[apply(elementdf[,-1], 1, function(x) all(x %in% nodedf$nid)),]

Explanation:

The apply works by "applying" a function (a custom one in this case) to each row (the variable x in the function) of the object elementdf. If we wanted to do this by columns we would change the 1 to a 2.

The function we are using looks at each element in x (a row in elementdf) and tests if it is also in nodedf. The %in% is a special function which returns a vector of logicals, an element for each in x. The all function returns TRUE if all elements are TRUE (meaning all of them are in nodedf) and FALSE otherwise.

So in the end, the apply statement will return a vector of logicals, depending on whether each row has elements found in nodedf.


To get the values in each row that are not in nodedf, you could do

apply(elementdf[,-1], 1, function(x) x[!(x %in% nodedf$nid)])

which you'll notice is already pretty similar to the line of code above. Except in this case, the apply statement will return a list. From the example you gave, it will a list of length 2 where the first element is numeric(0) and the second element is a vector containing 7. If you have multiple offenders in one row, each will be shown.


To remove the rows in nodedf which do not have references in elementdf, you could do

nodedf[nodedf$nid %in% unique(unlist(elementdf[,-1])),]

The unique(unlist(...)) part just grabs all the unique values in elementdf[,-1], converting them to a numeric vector.

Remove IDs with fewer than 9 unique observations

We can use n_distinct

To remove IDs with less than 9 unique observations

library(dplyr)

df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique

#[1] 2 4

Or

df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)

# ID
# <int>
#1 2
#2 4

For unique counts of each ID

df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))


# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1

Delete rows conditional on frequency of char variable in R

There's probably a host of solutions, but here's one using base R's ave:

mydata[with(mydata, !(ave(case=="a",id,FUN=sum)>=3) ),]

# id case value
#6 2 a 1
#7 2 a 1
#8 2 c 2
#9 2 c 2
#14 4 a 1
#15 4 b 1
#16 4 c 2
#17 4 a 2
#18 4 b 2


Related Topics



Leave a reply



Submit