Retain Attributes When Using Gather from Tidyr (Attributes Are Not Identical)

Retain attributes when using gather from tidyr (attributes are not identical)

You could just convert your dates to character then convert them back to dates at the end:

(person <- df %>% 
select(hh_id, bday_01:gender_02) %>%
mutate_each(funs(as.character), contains('bday')) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value) %>%
mutate(bday=ymd(bday)))

hh_id per_num bday gender
1 1 01 2015-03-09 M
2 1 02 1985-09-11 F
3 2 01 1989-02-11 F
4 2 02 2000-08-15 F

Alternatively, if you use Date instead of POSIXct, you could do something like this:

(person <- df %>% 
select(hh_id, bday_01:gender_02) %>%
gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
mutate(bday=as.Date(bday)) %>%
mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>%
filter(per_num1 == per_num2) %>%
rename(per_num=per_num1) %>%
select(-per_num2))

Edit

The warning you're seeing:

Warning: attributes are not identical across measure variables; they will be dropped

arises from gathering the gender columns, which are factors and have different level vectors (see str(df)). If you were to convert the gender columns to character or if you were to synchronize their levels with something like,

df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))

then you will see that the warning goes away when you execute

person <- df %>% 
select(hh_id, bday_01:gender_02) %>%
gather(key, value, contains('gender'))

Using gather to tidy dataset in R- attributes are not identical

I think you were close, you just misplaced the sep argument:

gather(df9, pt.num.type, value, 2:17)
separate(pt.num.type, c("type", "pt.num"), sep=1)

Using dplyr you could do something like:

df9 %>% 
gather(pt.num.type, value, 2:5) %>%
separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
group_by(GeneID, type) %>%
summarise(sum = sum(value))

# GeneID type sum
# 1 A2M D 989
# 2 A2M T 1033
# 3 ABL1 D 464
# 4 ABL1 T 170
# 5 ACP1 D 1036
# 6 ACP1 T 738

Then if you're trying to get the ratio (depending on how you are separating), you could do something like:

df9 %>% 
gather(pt.num.type, value, 2:5) %>%
separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
spread(type, value) %>%
mutate(Ratio = D/T)

# GeneID pt.num D T Ratio
# 1 A2M 1 887 88 10.0795455
# 2 A2M 2 102 945 0.1079365
# 3 ABL1 1 212 16 13.2500000
# 4 ABL1 2 252 154 1.6363636
# 5 ACP1 1 126 13 9.6923077
# 6 ACP1 2 910 725 1.2551724

error with tidyr::gather() when I have unique names

The second and third argument is the names of key and value column to be created in output. Having two columns with the same name is odd and doesn't work well with other functions of tidyr or dplyr. I suggest giving other names for new columns. Therefore, you can try:

sample2 <- gather(sample, period, value, Y2012:Y2016)

Tidyr's gather() with NAs

The data is not being converted to strings, it is dropping back to the integer representation of the seconds since 1970-01-01, which is what the original Date values in df represent:

x <- df$bday_01
x
#[1] "2015-03-09 UTC" "2015-03-09 UTC"
attributes(x) <- NULL
x
#[1] 1425859200 1425859200

The warning message gives you a hint to a way around it:

attributes are not identical across measure variables; they will be
dropped

So, try:

attributes(df$bday_03) <- attributes(df$bday_02)
gather(df, person_num, bday, starts_with("bday_0"))

# hh_id person_num bday
#1 1 bday_01 2015-03-09
#2 2 bday_01 2015-03-09
#3 1 bday_02 1985-09-11
#4 2 bday_02 1985-09-11
#5 1 bday_03 <NA>
#6 2 bday_03 <NA>

Using gather from tidyr changes my regression results

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA 1 2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05669 0.82128 0.17484 NA

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928

Tidying dataset by gathering multiple columns?

With melt from data.table (see ?patterns):

library(data.table)

melt(setDT(df), measure = patterns("^qID", "^time_taken"),
value.name = c("qID", "time_taken"))

Result:

   age gender     education previous_comp_exp tutorial_time variable  qID time_taken
1: 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206
2: 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343
3: 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920
4: 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616

or with tidyr:

library(dplyr)
library(tidyr)

df %>%
gather(variable, value, qID.1:time_taken.2) %>%
mutate(variable = sub("\\.\\d$", "", variable)) %>%
group_by(variable) %>%
mutate(ID = row_number()) %>%
spread(variable, value, convert = TRUE) %>%
select(-ID)

Result:

# A tibble: 4 x 7
age gender education previous_comp_exp tutorial_time qID time_taken
<int> <fctr> <fctr> <fctr> <dbl> <chr> <dbl>
1 18 Male Undergraduate casual_gamer 62.17926 sor9 39.61206
2 18 Male Undergraduate casual_gamer 62.17926 sor8 19.48920
3 24 Male Undergraduate casual_gamer 85.01288 sor9 50.92343
4 24 Male Undergraduate casual_gamer 85.01288 sor8 16.15616

Note:

For the tidyr method, convert=TRUE is used to convert time_taken back to numeric, since it was coerced to character when gathered with the qID columns.

Data:

df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L,
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L,
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926,
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"),
time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L,
1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892,
16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp",
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))


Related Topics



Leave a reply



Submit