How to select all factor variables in R
Some data:
insurance <- data.frame(
int = 1:5,
fact1 = letters[1:5],
fact2 = factor(1:5),
fact3 = LETTERS[3:7]
)
I would use sapply
like you did, but combined with is.factor
to return a logical vector:
is.fact <- sapply(insurance, is.factor)
# int fact1 fact2 fact3
# FALSE TRUE TRUE TRUE
Then use [
to extract these columns:
factors.df <- insurance[, is.fact]
# fact1 fact2 fact3
# 1 a 1 C
# 2 b 2 D
# 3 c 3 E
# 4 d 4 F
# 5 e 5 G
Finally, to get the levels, use lapply
:
lapply(factors.df, levels)
# $fact1
# [1] "a" "b" "c" "d" "e"
#
# $fact2
# [1] "1" "2" "3" "4" "5"
#
# $fact3
# [1] "C" "D" "E" "F" "G"
You might also find str(insurance)
interesting as a short summary.
R - select only factor columns of dataframe
#DATA
df = mtcars
colnames(df) = gsub("mpg","id",colnames(df))
df$am = as.factor(df$am)
df$gear = as.factor(df$gear)
df$id = as.factor(df$id)
#Filter out 'id' after selecting factors
df[,sapply(df, is.factor) & colnames(df) != "id"]
list all factor levels of a data.frame
Here are some options. We loop through the 'data' with sapply
and get the levels
of each column (assuming that all the columns are factor
class)
sapply(data, levels)
Or if we need to pipe (%>%
) it, this can be done as
library(dplyr)
data %>%
sapply(levels)
Or another option is summarise_each
from dplyr
where we specify the levels
within the funs
.
data %>%
summarise_each(funs(list(levels(.))))
How to apply proptable() to all the factor variables in R
Perhaps with sapply?
sapply(df, function(x) if("factor" %in% class(x)) {prop.table(table(x))})
E.g. when prop.table(table(df)) throws an error:
library(palmerpenguins)
prop.table(table(penguins))
#Error in table(penguins) : attempt to make a table with >= 2^31 elements
sapply(penguins, function(x) if("factor" %in% class(x)) {prop.table(table(x))})
#$species
#x
# Adelie Chinstrap Gentoo
#0.4418605 0.1976744 0.3604651
#$island
#x
# Biscoe Dream Torgersen
#0.4883721 0.3604651 0.1511628
#$bill_length_mm
#NULL
#$bill_depth_mm
#NULL
#$flipper_length_mm
#NULL
#$body_mass_g
#NULL
#$sex
#x
# female male
#0.4954955 0.5045045
#$year
#NULL
Subset dataset with several levels of a categorical variable
You can use %in%
.
This is a membership operator that you can use with a vector of the factor levels of cat.var
which you would like to retain rows for.
new_df <- subset(df, df$cat.var %in% c("level.1", "level.2"))
For example
df <- data.frame(fct = rep(letters[1:3], times = 2), nums = 1:6)
df
# This is our example data.frame
# fct nums
# 1 a 1
# 2 b 2
# 3 c 3
# 4 a 4
# 5 b 5
# 6 c 6
subset(df, df$fct %in% c("a", "b"))
# Subsetting on a factor using %in% returns the following output:
# fct nums
# 1 a 1
# 2 b 2
# 4 a 4
# 5 b 5
Note: Another option is to use the filter
function from dplyr
as follows
library(dplyr)
filter(df, fct %in% c("a", "b"))
This returns the same filtered (subsetted) dataframe.
Sampling data frames maintaining all levels of factor variables
There is nothing wrong with your code/approach. You do not have enough observations. There are lot of groups with only 1 row in them, which when sampled with 0.7 proportion rounds it down to 0. If you change the sample to 1000 rows, the same code works fine without error.
library(dplyr)
data <- tibble(y = rnorm(1000), x1 = rnorm(1000),
x2 = sample(letters, 1000, T), x3 = sample(LETTERS, 1000, T))
train_data <- data %>%
group_by(x2, x3) %>%
slice_sample(prop = 0.7)
test_data <- data %>% anti_join(train_data)
reg <- lm(y ~ x1 + x2 + x3, train_data)
predict(reg, newdata = test_data)
If in your real data you have groups with as low as only 1 row, you can sample them such that it selects max
of 1 or (0.7*number of rows in group).
train_data <- data %>% group_by(x2, x3) %>% sample_n(max(0.7*n(), 1))
(Used sample_n
here since I couldn't use n()
in slice_sample
).
Select specific levels of factor in R
From your text, it seems like you want to keep only those three levels -- if that's so, then you want:
CatchbySpecies <- CatchbySpecies [CatchbySpecies$Trap %in% c("Weka", "Rat", "Stoat"), ]
Key differences from your attempt:
- No exclamation at the front, unless you want the other ones
- Wrap the list of things you do want in the concatenate function,
c()
- Use
%in%
instead of==
, since you don't want to check whether the factors are equal to the whole vector ofc("Weka", "Rat", "Stoat")
, but rather whether the factor is one of those elements contained within - Add a comma at the end. You're imposing a condition on the rows, so use a comma to indicate when that's done, and demonstrate an empty column condition.
Let me know if you have questions!
EDIT: You mentioned not wanting to use droplevels()
and I wasn't quite sure why, but Ben Bolker helpfully pointed out that you may want to use it after doing this operation unless you want to retain the discarded factors in this variable for some reason. You can just edit the line as
CatchbySpecies <-
droplevels(CatchbySpecies [CatchbySpecies$Trap %in% c("Weka", "Rat", "Stoat"), ])
Related Topics
R: Row-Wise Dplyr::Mutate Using Function That Takes a Data Frame Row and Returns an Integer
Why Should Someone Use {} for Initializing an Empty Object in R
Ggplot2: Group X Axis Discrete Values into Subgroups
How to Have a New Line in a 'Bquote' Expression Used with 'Text'
What Is the "Embracing Operator" '{{ }}'
Adjusting the Node Size in Igraph Using a Matrix
Extracting Output from Principal Function in Psych Package as a Data Frame
How to Plot a Stacked Bar with Ggplot
Space Between Gpplot2 Horizontal Legend Elements
Convert Map Data to Data Frame Using Fortify {Ggplot2} for Spatial Objects in R
R Shiny: How to Write Loop for Observeevent
Dplyr: Grouping and Summarizing/Mutating Data with Rolling Time Windows
Add Font to R That Is Not in Extrafonts Library
Fastest Way to Read Large Excel Xlsx Files? to Parallelize or Not