Retain Attributes When Using Gather from Tidyr (Attributes Are Not Identical)

Retain attributes when using gather from tidyr (attributes are not identical)

You could just convert your dates to character then convert them back to dates at the end:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      mutate_each(funs(as.character), contains('bday')) %>%
      gather(key, value, -hh_id) %>%
      separate(key, c("key", "per_num"), sep = "_") %>%
      spread(key, value) %>%
      mutate(bday=ymd(bday)))

  hh_id per_num       bday gender
1     1      01 2015-03-09      M
2     1      02 1985-09-11      F
3     2      01 1989-02-11      F
4     2      02 2000-08-15      F

Alternatively, if you use Date instead of POSIXct, you could do something like this:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
      gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
      mutate(bday=as.Date(bday)) %>%
      mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>%
      filter(per_num1 == per_num2) %>%
      rename(per_num=per_num1) %>%
      select(-per_num2))

Edit

The warning you're seeing:

Warning: attributes are not identical across measure variables; they will be dropped

arises from gathering the gender columns, which are factors and have different level vectors (see str(df)). If you were to convert the gender columns to character or if you were to synchronize their levels with something like,

df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))

then you will see that the warning goes away when you execute

person <- df %>% 
        select(hh_id, bday_01:gender_02) %>% 
        gather(key, value, contains('gender'))

Using gather to tidy dataset in R- attributes are not identical

I think you were close, you just misplaced the sep argument:

gather(df9, pt.num.type, value, 2:17)
separate(pt.num.type, c("type", "pt.num"), sep=1)

Using dplyr you could do something like:

df9 %>% 
  gather(pt.num.type, value, 2:5) %>%
  separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
  group_by(GeneID, type) %>%
  summarise(sum = sum(value))

#   GeneID type  sum
# 1    A2M    D  989
# 2    A2M    T 1033
# 3   ABL1    D  464
# 4   ABL1    T  170
# 5   ACP1    D 1036
# 6   ACP1    T  738

Then if you're trying to get the ratio (depending on how you are separating), you could do something like:

df9 %>% 
  gather(pt.num.type, value, 2:5) %>%
  separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
  spread(type, value) %>%
  mutate(Ratio = D/T)

#   GeneID pt.num   D   T      Ratio
# 1    A2M      1 887  88 10.0795455
# 2    A2M      2 102 945  0.1079365
# 3   ABL1      1 212  16 13.2500000
# 4   ABL1      2 252 154  1.6363636
# 5   ACP1      1 126  13  9.6923077
# 6   ACP1      2 910 725  1.2551724

error with tidyr::gather() when I have unique names

The second and third argument is the names of key and value column to be created in output. Having two columns with the same name is odd and doesn't work well with other functions of tidyr or dplyr. I suggest giving other names for new columns. Therefore, you can try:

sample2 <- gather(sample, period, value, Y2012:Y2016)

Tidyr's gather() with NAs

The data is not being converted to strings, it is dropping back to the integer representation of the seconds since 1970-01-01, which is what the original Date values in df represent:

x <- df$bday_01
x
#[1] "2015-03-09 UTC" "2015-03-09 UTC"
attributes(x) <- NULL
x
#[1] 1425859200 1425859200

The warning message gives you a hint to a way around it:

attributes are not identical across measure variables; they will be
dropped

So, try:

attributes(df$bday_03) <- attributes(df$bday_02)
gather(df, person_num, bday, starts_with("bday_0"))

#  hh_id person_num       bday
#1     1    bday_01 2015-03-09
#2     2    bday_01 2015-03-09
#3     1    bday_02 1985-09-11
#4     2    bday_02 1985-09-11
#5     1    bday_03       <NA>
#6     2    bday_03       <NA>

Using gather from tidyr changes my regression results

The underlying reason for this unexpected change is that dplyr (dplyr, not tidyr) changes the default method of the lag function. The gather function calls dplyr::select_vars, which loads dplyr via namespace and overwrites lag.default.

The dynlm function internally calls lag when you use L in the formula. The method dispatch then finds lag.default. When dplyr is loaded via namespace (it does not even need to be attached), the lag.default from dplyr is found.

The two lag functions are fundamentally different. In a new R session, you will find the following difference:

lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA  1  2

So the solution is fairly simple. Just overwrite the lag.default function yourself.

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928      

lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
## Start = 1951, End = 1993
## 
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##  (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
##     -0.05669       0.82128       0.17484            NA  

lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)

## Time series regression with "ts" data:
##   Start = 1952, End = 1993
## 
## Call:
##   dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## 
## Coefficients:
##   (Intercept)        log(X)     log(L(X))  log(L(X, 2))  
## -0.05476       0.83870       0.01818       0.13928

Tidying dataset by gathering multiple columns?

With melt from data.table (see ?patterns):

library(data.table)

melt(setDT(df), measure = patterns("^qID", "^time_taken"),
     value.name = c("qID", "time_taken"))

Result:

   age gender     education previous_comp_exp tutorial_time variable  qID time_taken
1:  18   Male Undergraduate      casual_gamer      62.17926        1 sor9   39.61206
2:  24   Male Undergraduate      casual_gamer      85.01288        1 sor9   50.92343
3:  18   Male Undergraduate      casual_gamer      62.17926        2 sor8   19.48920
4:  24   Male Undergraduate      casual_gamer      85.01288        2 sor8   16.15616

or with tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(variable, value, qID.1:time_taken.2) %>%
  mutate(variable = sub("\\.\\d$", "", variable)) %>%
  group_by(variable) %>%
  mutate(ID = row_number()) %>%
  spread(variable, value, convert = TRUE) %>%
  select(-ID)

Result:

# A tibble: 4 x 7
    age gender     education previous_comp_exp tutorial_time   qID time_taken
  <int> <fctr>        <fctr>            <fctr>         <dbl> <chr>      <dbl>
1    18   Male Undergraduate      casual_gamer      62.17926  sor9   39.61206
2    18   Male Undergraduate      casual_gamer      62.17926  sor8   19.48920
3    24   Male Undergraduate      casual_gamer      85.01288  sor9   50.92343
4    24   Male Undergraduate      casual_gamer      85.01288  sor8   16.15616

Note:

For the tidyr method, convert=TRUE is used to convert time_taken back to numeric, since it was coerced to character when gathered with the qID columns.

Data:

df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L, 
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L, 
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926, 
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"), 
    time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L, 
    1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892, 
    16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp", 
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))

Retain Attributes When Using Gather from Tidyr (Attributes Are Not Identical)