How to Select All Factor Variables in R

How to select all factor variables in R

Some data:

insurance <- data.frame(
int = 1:5,
fact1 = letters[1:5],
fact2 = factor(1:5),
fact3 = LETTERS[3:7]
)

I would use sapply like you did, but combined with is.factor to return a logical vector:

is.fact <- sapply(insurance, is.factor)
# int fact1 fact2 fact3
# FALSE TRUE TRUE TRUE

Then use [ to extract these columns:

factors.df <- insurance[, is.fact]
# fact1 fact2 fact3
# 1 a 1 C
# 2 b 2 D
# 3 c 3 E
# 4 d 4 F
# 5 e 5 G

Finally, to get the levels, use lapply:

lapply(factors.df, levels)
# $fact1
# [1] "a" "b" "c" "d" "e"
#
# $fact2
# [1] "1" "2" "3" "4" "5"
#
# $fact3
# [1] "C" "D" "E" "F" "G"

You might also find str(insurance) interesting as a short summary.

R - select only factor columns of dataframe

#DATA
df = mtcars
colnames(df) = gsub("mpg","id",colnames(df))
df$am = as.factor(df$am)
df$gear = as.factor(df$gear)
df$id = as.factor(df$id)

#Filter out 'id' after selecting factors
df[,sapply(df, is.factor) & colnames(df) != "id"]

list all factor levels of a data.frame

Here are some options. We loop through the 'data' with sapply and get the levels of each column (assuming that all the columns are factor class)

sapply(data, levels)

Or if we need to pipe (%>%) it, this can be done as

library(dplyr)
data %>%
sapply(levels)

Or another option is summarise_each from dplyr where we specify the levels within the funs.

 data %>%
summarise_each(funs(list(levels(.))))

How to apply proptable() to all the factor variables in R

Perhaps with sapply?

sapply(df, function(x) if("factor" %in% class(x)) {prop.table(table(x))})

E.g. when prop.table(table(df)) throws an error:

library(palmerpenguins)

prop.table(table(penguins))
#Error in table(penguins) : attempt to make a table with >= 2^31 elements

sapply(penguins, function(x) if("factor" %in% class(x)) {prop.table(table(x))})
#$species
#x
# Adelie Chinstrap Gentoo
#0.4418605 0.1976744 0.3604651

#$island
#x
# Biscoe Dream Torgersen
#0.4883721 0.3604651 0.1511628

#$bill_length_mm
#NULL

#$bill_depth_mm
#NULL

#$flipper_length_mm
#NULL

#$body_mass_g
#NULL

#$sex
#x
# female male
#0.4954955 0.5045045

#$year
#NULL

Subset dataset with several levels of a categorical variable

You can use %in%.

This is a membership operator that you can use with a vector of the factor levels of cat.var which you would like to retain rows for.

new_df <- subset(df, df$cat.var %in% c("level.1", "level.2"))

For example

df <- data.frame(fct = rep(letters[1:3], times = 2), nums = 1:6)

df

# This is our example data.frame
# fct nums
# 1 a 1
# 2 b 2
# 3 c 3
# 4 a 4
# 5 b 5
# 6 c 6

subset(df, df$fct %in% c("a", "b"))

# Subsetting on a factor using %in% returns the following output:
# fct nums
# 1 a 1
# 2 b 2
# 4 a 4
# 5 b 5

Note: Another option is to use the filter function from dplyr as follows

library(dplyr)

filter(df, fct %in% c("a", "b"))

This returns the same filtered (subsetted) dataframe.

Sampling data frames maintaining all levels of factor variables

There is nothing wrong with your code/approach. You do not have enough observations. There are lot of groups with only 1 row in them, which when sampled with 0.7 proportion rounds it down to 0. If you change the sample to 1000 rows, the same code works fine without error.

library(dplyr)
data <- tibble(y = rnorm(1000), x1 = rnorm(1000),
x2 = sample(letters, 1000, T), x3 = sample(LETTERS, 1000, T))
train_data <- data %>%
group_by(x2, x3) %>%
slice_sample(prop = 0.7)

test_data <- data %>% anti_join(train_data)

reg <- lm(y ~ x1 + x2 + x3, train_data)
predict(reg, newdata = test_data)

If in your real data you have groups with as low as only 1 row, you can sample them such that it selects max of 1 or (0.7*number of rows in group).

train_data <- data %>% group_by(x2, x3) %>% sample_n(max(0.7*n(), 1))

(Used sample_n here since I couldn't use n() in slice_sample).

Select specific levels of factor in R

From your text, it seems like you want to keep only those three levels -- if that's so, then you want:

CatchbySpecies <- CatchbySpecies [CatchbySpecies$Trap %in% c("Weka", "Rat", "Stoat"), ]

Key differences from your attempt:

  1. No exclamation at the front, unless you want the other ones
  2. Wrap the list of things you do want in the concatenate function, c()
  3. Use %in% instead of ==, since you don't want to check whether the factors are equal to the whole vector of c("Weka", "Rat", "Stoat"), but rather whether the factor is one of those elements contained within
  4. Add a comma at the end. You're imposing a condition on the rows, so use a comma to indicate when that's done, and demonstrate an empty column condition.

Let me know if you have questions!


EDIT: You mentioned not wanting to use droplevels() and I wasn't quite sure why, but Ben Bolker helpfully pointed out that you may want to use it after doing this operation unless you want to retain the discarded factors in this variable for some reason. You can just edit the line as

CatchbySpecies <- 
droplevels(CatchbySpecies [CatchbySpecies$Trap %in% c("Weka", "Rat", "Stoat"), ])


Related Topics



Leave a reply



Submit