What Are R Levels

What are R levels?

When you use ?read.csv to read a file, the argument stringsAsFactors is set by default to TRUE, you just need to set it to false to not get this result. This should work:

data = read.csv("Documents/bet/I1.csv", sep=",", stringsAsFactors=FALSE)

Under the default, columns (variables) in the file that contain strings are assumed to be factors. A factor is a categorical variable that can take only one of a fixed, finite set of possibilities. Those possible categories are the levels. You can read about factors in the R Intro manual here, and this is another tutorial.

In addition, since you are using read.csv, adding the sep="," is redundant. It doesn't harm anything, but the comma is taken as the separator by default.

How levels() is different from labels() with respect to factors in R?

You can assign a label to a variable to give the variable a readable and understandable name. Calculations are done with the variable name not with the label of the variable. The label for the variable los can be "Length of stay".

Levels are important with factor variables. Each value of a factor variable is a level.
For example the factor variable eye_color can have the levels "blue", "brown" and "green".
So a factor variable may have a label, but has always levels (at least one).
Example:
The factor variable eye_color may have the label "Eye Color" and has the above mentioned 3 levels ("blue", "brown", and "green).

Why is levels() not assigning the wrong level to my data?

Update:

Ok then you could use factor instead of as.factor and set levels directly:

data$state_factor <- factor(data$state, levels=c("Not Started", "Just Beginning",
                                                    "25% Complete", "40% Complete", "Halfway Done",
                                                    "75% Complete", "Mostly Done", "Completed",
                                                    "Follow Up", "Final Follow Up"))

Output:

> head(data, 20)  
# A tibble: 20 × 4
   group state           count state_factor   
   <fct> <chr>           <dbl> <fct>          
 1 A     Not Started       100 Not Started    
 2 A     Just Beginning      5 Just Beginning 
 3 A     25% Complete        4 25% Complete   
 4 A     40% Complete      445 40% Complete   
 5 A     Halfway Done       67 Halfway Done   
 6 A     75% Complete       44 75% Complete   
 7 A     Mostly Done        25 Mostly Done    
 8 A     Completed         877 Completed      
 9 A     Follow Up         240 Follow Up      
10 A     Final Follow Up   353 Final Follow Up
11 B     Not Started        48 Not Started    
12 B     Just Beginning     51 Just Beginning 
13 B     25% Complete       48 25% Complete   
14 B     40% Complete       40 40% Complete   
15 B     Halfway Done      141 Halfway Done   
16 B     75% Complete       34 75% Complete   
17 B     Mostly Done        50 Mostly Done    
18 B     Completed          45 Completed      
19 B     Follow Up          34 Follow Up      
20 B     Final Follow Up    35 Final Follow Up

Now they are not in alphabetical order:

> levels(data$state_factor)
 [1] "Not Started"     "Just Beginning"  "25% Complete"    "40% Complete"    "Halfway Done"    "75% Complete"    "Mostly Done"     "Completed"      
 [9] "Follow Up"       "Final Follow Up"

Transition from R to Python: Where did my levels go?

R's levels() function will list all possible values of the variable, even if those values are not present in the data frame. Pandas doesn't behave in this way.

> df <- data.table(moreLabels = c('D', 'E', 'F'), numbers = c(1, 2, 3))
> df[, moreLabels := as.factor(moreLabels)]
> df[, levels(moreLabels)]
[1] "D" "E" "F"

> df[numbers > 1, ]  # if we subset, we only see values "E" and "F"
   moreLabels numbers
1:          E       2
2:          F       3

> df[numbers > 1, levels(moreLabels)]
[1] "D" "E" "F"  # even though we would expect only "E" and "F"

If you are looking for unique values that appear in the column, use the pd.Series.unique() function.

>>> df['moreLabels'].unique()
['D', 'E', 'F']

R: Get ranking of factor levels by group

Use dplyr::dense_rank, or as.numeric(factor(Days, ordered = T)) in base R:

df %>% 
  group_by(Number) %>% 
  mutate(Ranking = dense_rank(Days),
         Ranking2 = as.numeric(factor(Days, ordered = T)))

output

# A tibble: 15 × 4
# Groups:   Number [3]
   Number  Days Ranking Ranking2
    <dbl> <dbl>   <int>    <dbl>
 1      1     5       1        1
 2      1     5       1        1
 3      1    10       2        2
 4      1    10       2        2
 5      1    15       3        3
 6      2     3       1        1
 7      2     3       1        1
 8      2     3       1        1
 9      2     5       2        2
10      2     5       2        2
11      3    11       1        1
12      3    11       1        1
13      3    13       2        2
14      3    13       2        2
15      3    13       2        2

Factor levels by group

A data.table solution:

dt[, height_cat := cut(Height, breaks = c(0, 165, 180, 300), right = FALSE)]
dt[, height_f := 
       factor(
         paste(Sex, height_cat, sep = ":"), 
         levels = dt[, CJ(Sex, height_cat, unique = TRUE)][, paste(Sex, height_cat, sep = ":")]
       )]

table(dt$height_f)
# F:[0,165) F:[165,180) F:[180,300)   M:[0,165) M:[165,180) M:[180,300) 
#         2           2           0           0           2           2