remove IDs that occur x times R
You can use table
like this:
df[df$names %in% names(table(df$names))[table(df$names) >= 5],]
Omit ID's which occur less than x times with a combination of vectors
Using base R we calculate number of unique
values per SpeciesID
and select only those SpeciesID
which occur greater than equal to 5 times.
df[ave(df$IndID, df$SpeciesID, FUN = function(x) length(unique(x))) >= 5, ]
# SpeciesID IndID
#6 100 14-005
#7 100 14-005
#8 100 14-005
#9 100 14-006
#10 100 14-007
#11 100 14-007
#12 100 14-008
#13 100 14-009
#14 500 16-001
#15 500 16-001
#16 500 16-002
#17 500 16-002
#18 500 16-002
#19 500 16-003
#20 500 16-003
#21 500 16-004
#22 500 16-004
#23 500 16-005
#24 500 16-006
#25 500 16-006
#26 500 16-007
length(unique(x))
can also be replaced by n_distinct
from dplyr
library(dplyr)
df[ave(df$IndID, df$SpeciesID, FUN = n_distinct) >= 5, ]
Or a complete dplyr
solution which is more verbose could be
library(dplyr)
df %>%
group_by(SpeciesID) %>%
filter(n_distinct(IndID) >= 5)
Remove ID:s with only one observation in time in r
We can do this using a couple of options. With data.table
, convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'id', we get the number of rows (.N
) and if
that is greater than 1, get the Subset of Data.table (.SD
)
library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2
Can use the same methodology with dplyr
.
library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2
Or with base R
, get the table
of data.frame, check whether it is greater than 1, subset the names
based on the logical index ('i1') and use it to subset
the 'data.frame' using %in%
.
i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )
Removing rows of subsetted data that occur only once
One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.
library(dplyr)
filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup
Another way would be the following.
filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE)))
# id code
# <int> <int>
#1 12345 1092
#2 12345 1092
#3 90029 1092
#4 90029 1092
#5 90029 1092
#6 90029 5521
#7 90029 5521
Remove ID's based on their max value with dpylr in R
We can use a group by operation with any
library(dplyr)
test %>%
group_by(ID) %>%
filter(any(value > 0.1)) %>%
ungroup
-output
# A tibble: 4 x 3
# value time ID
# <dbl> <dbl> <dbl>
#1 0.2 0 3
#2 0.4 0 4
#3 0.05 1 3
#4 0.5 1 4
Deleting rows in a dataframe that reference IDs that do not exist in another (R)?
Here's a base R solution
elementdf[apply(elementdf[,-1], 1, function(x) all(x %in% nodedf$nid)),]
Explanation:
The apply
works by "applying" a function (a custom one in this case) to each row (the variable x
in the function) of the object elementdf
. If we wanted to do this by columns we would change the 1
to a 2
.
The function we are using looks at each element in x
(a row in elementdf
) and tests if it is also in nodedf
. The %in%
is a special function which returns a vector of logicals, an element for each in x
. The all
function returns TRUE
if all elements are TRUE
(meaning all of them are in nodedf
) and FALSE
otherwise.
So in the end, the apply statement will return a vector of logicals, depending on whether each row has elements found in nodedf
.
To get the values in each row that are not in nodedf
, you could do
apply(elementdf[,-1], 1, function(x) x[!(x %in% nodedf$nid)])
which you'll notice is already pretty similar to the line of code above. Except in this case, the apply
statement will return a list. From the example you gave, it will a list of length 2 where the first element is numeric(0)
and the second element is a vector containing 7. If you have multiple offenders in one row, each will be shown.
To remove the rows in nodedf
which do not have references in elementdf
, you could do
nodedf[nodedf$nid %in% unique(unlist(elementdf[,-1])),]
The unique(unlist(...))
part just grabs all the unique values in elementdf[,-1]
, converting them to a numeric vector.
Remove IDs with fewer than 9 unique observations
We can use n_distinct
To remove ID
s with less than 9 unique observations
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique
#[1] 2 4
Or
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)
# ID
# <int>
#1 2
#2 4
For unique counts of each ID
df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))
# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1
Delete rows conditional on frequency of char variable in R
There's probably a host of solutions, but here's one using base R's ave
:
mydata[with(mydata, !(ave(case=="a",id,FUN=sum)>=3) ),]
# id case value
#6 2 a 1
#7 2 a 1
#8 2 c 2
#9 2 c 2
#14 4 a 1
#15 4 b 1
#16 4 c 2
#17 4 a 2
#18 4 b 2
Related Topics
Add Empty Columns to a Dataframe with Specified Names from a Vector
Count Number of Columns by a Condition (>) for Each Row
How to Create Grouped Barplot with R
How to Name the "Row Names" Column in R
If/Else Constructs Inside and Outside Functions
Set Locale to System Default Utf-8
How to Replace Nan Value with Zero in a Huge Data Frame
Adding Text to a Grid.Table Plot
Print "Pretty" Tables for H2O Models in R
Automatically Delete Files/Folders
Scraping a Dynamic Ecommerce Page with Infinite Scroll
Using Cut and Quartile to Generate Breaks in R Function
Split a Column of Concatenated Comma-Delimited Data and Recode Output as Factors
How to Use Spread on Multiple Columns in Tidyr Similar to Dcast
R Split Numeric Vector at Position
R + Ggplot2 => Add Labels on Facet Pie Chart