What Are R Levels

What are R levels?

When you use ?read.csv to read a file, the argument stringsAsFactors is set by default to TRUE, you just need to set it to false to not get this result. This should work:

data = read.csv("Documents/bet/I1.csv", sep=",", stringsAsFactors=FALSE)

Under the default, columns (variables) in the file that contain strings are assumed to be factors. A factor is a categorical variable that can take only one of a fixed, finite set of possibilities. Those possible categories are the levels. You can read about factors in the R Intro manual here, and this is another tutorial.

In addition, since you are using read.csv, adding the sep="," is redundant. It doesn't harm anything, but the comma is taken as the separator by default.

How levels() is different from labels() with respect to factors in R?

You can assign a label to a variable to give the variable a readable and understandable name. Calculations are done with the variable name not with the label of the variable. The label for the variable los can be "Length of stay".

Levels are important with factor variables. Each value of a factor variable is a level.
For example the factor variable eye_color can have the levels "blue", "brown" and "green".
So a factor variable may have a label, but has always levels (at least one).
Example:
The factor variable eye_color may have the label "Eye Color" and has the above mentioned 3 levels ("blue", "brown", and "green).

Why is levels() not assigning the wrong level to my data?

Update:

Ok then you could use factor instead of as.factor and set levels directly:

data$state_factor <- factor(data$state, levels=c("Not Started", "Just Beginning",
"25% Complete", "40% Complete", "Halfway Done",
"75% Complete", "Mostly Done", "Completed",
"Follow Up", "Final Follow Up"))

Output:

> head(data, 20)  
# A tibble: 20 × 4
group state count state_factor
<fct> <chr> <dbl> <fct>
1 A Not Started 100 Not Started
2 A Just Beginning 5 Just Beginning
3 A 25% Complete 4 25% Complete
4 A 40% Complete 445 40% Complete
5 A Halfway Done 67 Halfway Done
6 A 75% Complete 44 75% Complete
7 A Mostly Done 25 Mostly Done
8 A Completed 877 Completed
9 A Follow Up 240 Follow Up
10 A Final Follow Up 353 Final Follow Up
11 B Not Started 48 Not Started
12 B Just Beginning 51 Just Beginning
13 B 25% Complete 48 25% Complete
14 B 40% Complete 40 40% Complete
15 B Halfway Done 141 Halfway Done
16 B 75% Complete 34 75% Complete
17 B Mostly Done 50 Mostly Done
18 B Completed 45 Completed
19 B Follow Up 34 Follow Up
20 B Final Follow Up 35 Final Follow Up

Now they are not in alphabetical order:

> levels(data$state_factor)
[1] "Not Started" "Just Beginning" "25% Complete" "40% Complete" "Halfway Done" "75% Complete" "Mostly Done" "Completed"
[9] "Follow Up" "Final Follow Up"

Transition from R to Python: Where did my levels go?

R's levels() function will list all possible values of the variable, even if those values are not present in the data frame. Pandas doesn't behave in this way.

> df <- data.table(moreLabels = c('D', 'E', 'F'), numbers = c(1, 2, 3))
> df[, moreLabels := as.factor(moreLabels)]
> df[, levels(moreLabels)]
[1] "D" "E" "F"

> df[numbers > 1, ] # if we subset, we only see values "E" and "F"
moreLabels numbers
1: E 2
2: F 3

> df[numbers > 1, levels(moreLabels)]
[1] "D" "E" "F" # even though we would expect only "E" and "F"

If you are looking for unique values that appear in the column, use the pd.Series.unique() function.

>>> df['moreLabels'].unique()
['D', 'E', 'F']

R: Get ranking of factor levels by group

Use dplyr::dense_rank, or as.numeric(factor(Days, ordered = T)) in base R:

df %>% 
group_by(Number) %>%
mutate(Ranking = dense_rank(Days),
Ranking2 = as.numeric(factor(Days, ordered = T)))

output

# A tibble: 15 × 4
# Groups: Number [3]
Number Days Ranking Ranking2
<dbl> <dbl> <int> <dbl>
1 1 5 1 1
2 1 5 1 1
3 1 10 2 2
4 1 10 2 2
5 1 15 3 3
6 2 3 1 1
7 2 3 1 1
8 2 3 1 1
9 2 5 2 2
10 2 5 2 2
11 3 11 1 1
12 3 11 1 1
13 3 13 2 2
14 3 13 2 2
15 3 13 2 2

Factor levels by group

A data.table solution:

dt[, height_cat := cut(Height, breaks = c(0, 165, 180, 300), right = FALSE)]
dt[, height_f :=
factor(
paste(Sex, height_cat, sep = ":"),
levels = dt[, CJ(Sex, height_cat, unique = TRUE)][, paste(Sex, height_cat, sep = ":")]
)]

table(dt$height_f)
# F:[0,165) F:[165,180) F:[180,300) M:[0,165) M:[165,180) M:[180,300)
# 2 2 0 0 2 2


Related Topics



Leave a reply



Submit