Why does subsetting a column from a data frame vs. a tibble give different results
The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.
- By default,
[.data.frame
will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector. [.tbl_df
will never drop dimensions like this; it always returns a tbl.
In turn, as.character
ignores the class of a tbl, treating it as a plain list. And as.character
called on a list acts like deparse
: the character representation it returns is R code that can be parsed and executed to reproduce the list.
The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.
If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]]
or urls$final_domain
.
Difference between tibbles and data frames when using selfwritten function
The main difference lies in how dataframes handle subsetting one-column dataframe by default vs that of tibbles.
When you subset a dataframe with only one column it drop's it's dimension and you get a vector back.
df1 <- data.frame(a, b)
df1[, 1]
#[1] "abc"
class(df1[, 1])
#[1] "character"
When you subset tibble with one column you get a tibble back.
df1 <- as_tibble(data.frame(a, b))
df1[, 1]
# A tibble: 1 x 1
# a
# <chr>
#1 abc
class(df1[, 1])
#[1] "tbl_df" "tbl" "data.frame"
Hence, for dataframes there are no column names in the data to be appended in the final output so you get only list names back.
df1 <- data.frame(a, b)
df2 <- data.frame(c, d)
list1 <- list(hello = df1, bye = df2)
sapply(list1, identical_length_values)
#hello bye
# TRUE FALSE
However, for tibbles the column names are still there so it gets appended in the result.
df1 <- as_tibble(data.frame(a, b))
df2 <- as_tibble(data.frame(c, d))
list1 <- list(hello = df1, bye = df2)
sapply(list1, identical_length_values)
#hello.a bye.c
# TRUE FALSE
If you use [[
instead of [
it will always return a vector back and you'll not face this issues.
identical_length_values <- function(x){
nchar(x[[1]]) == nchar(x[[2]])
}
Selecting a single column from a tibble still returns a tibble instead of a vector
try pull
sen <- df %>%
filter(my_dummy == 0) %>%
pull(col_name)
How to deal with tibbles when subsetting/indexing dataframe column in R?
Indexing a tibble
is the same as indexing data.frame
s, except for the fact that data.frame
s attempt to return the lowest possible dimension, hence the following difference:
library(tibble)
df = data.frame(Measurement = c(2752,2756,2756,2740,2724,2536,2796,2800))
df_tib = as.tibble(df)
index = c(2,3,6,7)
Indexing dataframe:
df[index,]
# [1] 2756 2756 2536 2796
df_tib[index,]
# A tibble: 4 x 1
# Measurement
# <dbl>
# 1 2756
# 2 2756
# 3 2536
# 4 2796
Notice that df[index,]
is coerced to a vector after indexing because data.frame
sees that it is a dataframe with only one column. tibble
does not make this coercion. To override this property, you can use drop=FALSE
:
df[index,, drop=FALSE]
# Measurement
# 2 2756
# 3 2756
# 6 2536
# 7 2796
To get a vector after indexing, you actually want to index the column Measurement
. This is done exactly the same with either data.frame
or tibble
:
df$Measurement[index]
# [1] 2756 2756 2536 2796
df_tib$Measurement[index]
# [1] 2756 2756 2536 2796
Structure of variables not recognised when dataframe is a tibble
The answer is to use [[
to subset the columns from tibble or a dataframe which would give you consistent results. To differentiate between dataframe and tibble let's call the tibble variable as df_tib
and dataframe variable as df_dat
.
df_tib <- df
df_dat <- data.frame(df)
is.numeric(df_tib[['a']])
#[1] TRUE
is.numeric(df_dat[['a']])
#[1] TRUE
is.factor(df_tib[['b']])
#[1] TRUE
is.factor(df_dat[['b']])
#[1] TRUE
The reason why the issue occurs is how they (dataframe and tibble) react while subsetting with [
.
df_tib[, 'a']
# A tibble: 5 x 1
# a
# <dbl>
#1 -0.6
#2 -0.2
#3 1.6
#4 0.1
#5 0.1
df_dat[, 'a']
#[1] -0.6 -0.2 1.6 0.1 0.1
df_tib
returns a tibble when you subset with [
whereas since you have a single column in df_dat
it returns a vector. is.factor
and is.numeric
would always return FALSE
on dataframe/tibble object.
What can a data frame do that a tibble cannot?
From the trouble with tibbles, you can read :
there isn’t really any trouble with tibbles
However,
Some older packages don’t work with tibbles because of their alternative subsetting method. They expect tib[,1]
to return a vector, when in fact it will now return another tibble.
This is what @Henrik pointed out in comments.
As an example, the length
function won't return the same result:
library(tibble)
tibblecars <- as_tibble(mtcars)
tibblecars[,"cyl"]
#> # A tibble: 32 x 1
#> cyl
#> <dbl>
#> 1 6
#> 2 6
#> 3 4
#> 4 6
#> 5 8
#> 6 6
#> 7 8
#> 8 4
#> 9 4
#> 10 6
#> # ... with 22 more rows
length(tibblecars[,"cyl"])
#> [1] 1
mtcars[,"cyl"]
#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
length(mtcars[,"cyl"])
#> [1] 32
Other example :
base::reshape
not working with tibbles
Invariants for subsetting and subassignment explains where the behaviour from tibble
differs from data.frame
.
These limitations being known, the solution given by Hadley in interacting with legacy code is:
A handful of functions don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:
Related Topics
Aggregating Unique Values in Columns to Single Dataframe "Cell"
What Is Your Preferred Style for Naming Variables in R
Hyperlink Bar Chart in Highcharter
Fixing a Multiple Warning "Unknown Column"
Topic Models: Cross Validation with Loglikelihood or Perplexity
Is There a Difference Between the R Functions Fitted() and Predict()
How to Get the Second Sub Element of Every Element in a List
Regression with Heteroskedasticity Corrected Standard Errors
How to Run a High Pass or Low Pass Filter on Data Points in R
How to Sum Data.Frame Column Values
Arranging Rows in Custom Order Using Dplyr
Ggplot2 PDF Import in Adobe Illustrator Missing Font Adobepistd
Linking Intel's Math Kernel Library (Mkl) to R on Windows
Significance Level Added to Matrix Correlation Heatmap Using Ggplot2