Why Does Subsetting a Column from a Data Frame VS. a Tibble Give Different Results

Why does subsetting a column from a data frame vs. a tibble give different results

The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.

  • By default, [.data.frame will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector.
  • [.tbl_df will never drop dimensions like this; it always returns a tbl.

In turn, as.character ignores the class of a tbl, treating it as a plain list. And as.character called on a list acts like deparse: the character representation it returns is R code that can be parsed and executed to reproduce the list.

The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.

If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]] or urls$final_domain.

Difference between tibbles and data frames when using selfwritten function

The main difference lies in how dataframes handle subsetting one-column dataframe by default vs that of tibbles.

When you subset a dataframe with only one column it drop's it's dimension and you get a vector back.

df1 <- data.frame(a, b)
df1[, 1]
#[1] "abc"

class(df1[, 1])
#[1] "character"

When you subset tibble with one column you get a tibble back.

df1 <- as_tibble(data.frame(a, b))
df1[, 1]
# A tibble: 1 x 1
# a
# <chr>
#1 abc
class(df1[, 1])
#[1] "tbl_df" "tbl" "data.frame"

Hence, for dataframes there are no column names in the data to be appended in the final output so you get only list names back.

df1 <- data.frame(a, b)
df2 <- data.frame(c, d)
list1 <- list(hello = df1, bye = df2)
sapply(list1, identical_length_values)
#hello bye
# TRUE FALSE

However, for tibbles the column names are still there so it gets appended in the result.

df1 <- as_tibble(data.frame(a, b))
df2 <- as_tibble(data.frame(c, d))
list1 <- list(hello = df1, bye = df2)
sapply(list1, identical_length_values)
#hello.a bye.c
# TRUE FALSE

If you use [[ instead of [ it will always return a vector back and you'll not face this issues.

identical_length_values <- function(x){
nchar(x[[1]]) == nchar(x[[2]])
}

Selecting a single column from a tibble still returns a tibble instead of a vector

try pull

sen <- df %>%
filter(my_dummy == 0) %>%
pull(col_name)

How to deal with tibbles when subsetting/indexing dataframe column in R?

Indexing a tibble is the same as indexing data.frames, except for the fact that data.frames attempt to return the lowest possible dimension, hence the following difference:

library(tibble)
df = data.frame(Measurement = c(2752,2756,2756,2740,2724,2536,2796,2800))
df_tib = as.tibble(df)

index = c(2,3,6,7)

Indexing dataframe:

df[index,]
# [1] 2756 2756 2536 2796

df_tib[index,]
# A tibble: 4 x 1
# Measurement
# <dbl>
# 1 2756
# 2 2756
# 3 2536
# 4 2796

Notice that df[index,] is coerced to a vector after indexing because data.frame sees that it is a dataframe with only one column. tibble does not make this coercion. To override this property, you can use drop=FALSE:

df[index,, drop=FALSE]
# Measurement
# 2 2756
# 3 2756
# 6 2536
# 7 2796

To get a vector after indexing, you actually want to index the column Measurement. This is done exactly the same with either data.frame or tibble:

df$Measurement[index]
# [1] 2756 2756 2536 2796

df_tib$Measurement[index]
# [1] 2756 2756 2536 2796

Structure of variables not recognised when dataframe is a tibble

The answer is to use [[ to subset the columns from tibble or a dataframe which would give you consistent results. To differentiate between dataframe and tibble let's call the tibble variable as df_tib and dataframe variable as df_dat.

df_tib <- df
df_dat <- data.frame(df)

is.numeric(df_tib[['a']])
#[1] TRUE
is.numeric(df_dat[['a']])
#[1] TRUE

is.factor(df_tib[['b']])
#[1] TRUE
is.factor(df_dat[['b']])
#[1] TRUE

The reason why the issue occurs is how they (dataframe and tibble) react while subsetting with [.

df_tib[, 'a']

# A tibble: 5 x 1
# a
# <dbl>
#1 -0.6
#2 -0.2
#3 1.6
#4 0.1
#5 0.1

df_dat[, 'a']
#[1] -0.6 -0.2 1.6 0.1 0.1

df_tib returns a tibble when you subset with [ whereas since you have a single column in df_dat it returns a vector. is.factor and is.numeric would always return FALSE on dataframe/tibble object.

What can a data frame do that a tibble cannot?

From the trouble with tibbles, you can read :

there isn’t really any trouble with tibbles

However,

Some older packages don’t work with tibbles because of their alternative subsetting method. They expect tib[,1]
to return a vector, when in fact it will now return another tibble.

This is what @Henrik pointed out in comments.

As an example, the length function won't return the same result:

library(tibble)
tibblecars <- as_tibble(mtcars)
tibblecars[,"cyl"]
#> # A tibble: 32 x 1
#> cyl
#> <dbl>
#> 1 6
#> 2 6
#> 3 4
#> 4 6
#> 5 8
#> 6 6
#> 7 8
#> 8 4
#> 9 4
#> 10 6
#> # ... with 22 more rows
length(tibblecars[,"cyl"])
#> [1] 1
mtcars[,"cyl"]
#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
length(mtcars[,"cyl"])
#> [1] 32

Other example :

  • base::reshape not working with tibbles

Invariants for subsetting and subassignment explains where the behaviour from tibble differs from data.frame.

These limitations being known, the solution given by Hadley in interacting with legacy code is:

A handful of functions don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:



Related Topics



Leave a reply



Submit