Retain attributes when using gather from tidyr (attributes are not identical)
You could just convert your dates to character then convert them back to dates at the end:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
mutate_each(funs(as.character), contains('bday')) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value) %>%
mutate(bday=ymd(bday)))
hh_id per_num bday gender
1 1 01 2015-03-09 M
2 1 02 1985-09-11 F
3 2 01 1989-02-11 F
4 2 02 2000-08-15 F
Alternatively, if you use Date
instead of POSIXct
, you could do something like this:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
mutate(bday=as.Date(bday)) %>%
mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>%
filter(per_num1 == per_num2) %>%
rename(per_num=per_num1) %>%
select(-per_num2))
Edit
The warning you're seeing:
Warning: attributes are not identical across measure variables; they will be dropped
arises from gathering the gender columns, which are factors and have different level vectors (see str(df)
). If you were to convert the gender columns to character or if you were to synchronize their levels with something like,
df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))
then you will see that the warning goes away when you execute
person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(key, value, contains('gender'))
Using gather to tidy dataset in R- attributes are not identical
I think you were close, you just misplaced the sep
argument:
gather(df9, pt.num.type, value, 2:17)
separate(pt.num.type, c("type", "pt.num"), sep=1)
Using dplyr
you could do something like:
df9 %>%
gather(pt.num.type, value, 2:5) %>%
separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
group_by(GeneID, type) %>%
summarise(sum = sum(value))
# GeneID type sum
# 1 A2M D 989
# 2 A2M T 1033
# 3 ABL1 D 464
# 4 ABL1 T 170
# 5 ACP1 D 1036
# 6 ACP1 T 738
Then if you're trying to get the ratio (depending on how you are separating), you could do something like:
df9 %>%
gather(pt.num.type, value, 2:5) %>%
separate(pt.num.type, c("type", "pt.num"), sep=1) %>%
spread(type, value) %>%
mutate(Ratio = D/T)
# GeneID pt.num D T Ratio
# 1 A2M 1 887 88 10.0795455
# 2 A2M 2 102 945 0.1079365
# 3 ABL1 1 212 16 13.2500000
# 4 ABL1 2 252 154 1.6363636
# 5 ACP1 1 126 13 9.6923077
# 6 ACP1 2 910 725 1.2551724
error with tidyr::gather() when I have unique names
The second and third argument is the names of key and value column to be created in output. Having two columns with the same name is odd and doesn't work well with other functions of tidyr
or dplyr
. I suggest giving other names for new columns. Therefore, you can try:
sample2 <- gather(sample, period, value, Y2012:Y2016)
Tidyr's gather() with NAs
The data is not being converted to strings, it is dropping back to the integer representation of the seconds since 1970-01-01, which is what the original Date
values in df
represent:
x <- df$bday_01
x
#[1] "2015-03-09 UTC" "2015-03-09 UTC"
attributes(x) <- NULL
x
#[1] 1425859200 1425859200
The warning message gives you a hint to a way around it:
attributes are not identical across measure variables; they will be
dropped
So, try:
attributes(df$bday_03) <- attributes(df$bday_02)
gather(df, person_num, bday, starts_with("bday_0"))
# hh_id person_num bday
#1 1 bday_01 2015-03-09
#2 2 bday_01 2015-03-09
#3 1 bday_02 1985-09-11
#4 2 bday_02 1985-09-11
#5 1 bday_03 <NA>
#6 2 bday_03 <NA>
Using gather from tidyr changes my regression results
The underlying reason for this unexpected change is that dplyr
(dplyr
, not tidyr
) changes the default method of the lag
function. The gather
function calls dplyr::select_vars
, which loads dplyr
via namespace and overwrites lag.default
.
The dynlm
function internally calls lag
when you use L
in the formula. The method dispatch then finds lag.default
. When dplyr
is loaded via namespace (it does not even need to be attached), the lag.default
from dplyr
is found.
The two lag functions are fundamentally different. In a new R session, you will find the following difference:
lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA 1 2
So the solution is fairly simple. Just overwrite the lag.default
function yourself.
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1951, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05669 0.82128 0.17484 NA
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
Tidying dataset by gathering multiple columns?
With melt
from data.table
(see ?patterns
):
library(data.table)
melt(setDT(df), measure = patterns("^qID", "^time_taken"),
value.name = c("qID", "time_taken"))
Result:
age gender education previous_comp_exp tutorial_time variable qID time_taken
1: 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206
2: 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343
3: 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920
4: 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616
or with tidyr
:
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, qID.1:time_taken.2) %>%
mutate(variable = sub("\\.\\d$", "", variable)) %>%
group_by(variable) %>%
mutate(ID = row_number()) %>%
spread(variable, value, convert = TRUE) %>%
select(-ID)
Result:
# A tibble: 4 x 7
age gender education previous_comp_exp tutorial_time qID time_taken
<int> <fctr> <fctr> <fctr> <dbl> <chr> <dbl>
1 18 Male Undergraduate casual_gamer 62.17926 sor9 39.61206
2 18 Male Undergraduate casual_gamer 62.17926 sor8 19.48920
3 24 Male Undergraduate casual_gamer 85.01288 sor9 50.92343
4 24 Male Undergraduate casual_gamer 85.01288 sor8 16.15616
Note:
For the tidyr
method, convert=TRUE
is used to convert time_taken
back to numeric
, since it was coerced to character when gather
ed with the qID
columns.
Data:
df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L,
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L,
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926,
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"),
time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L,
1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892,
16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp",
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))
Related Topics
Ggplot2': Label Values of Barplot That Uses 'Fun.Y="Mean"' of 'Stat_Summary'
Weighted Means by Group and Column
Converting a "Map" Object to a "Spatialpolygon" Object
R - Reading Lines from a .Txt-File After a Specific Line
Warning: Replacing Previous Import 'Head' When Loading 'Utils' in R
Subtract Every Column from Each Other Column in a R Data.Table
R: Further Subset a Selection Using the Pipe %>% and Placeholder
Makecluster Function in R Snow Hangs Indefinitely
In R, How to Find the Optimal Variable to Maximize or Minimize Correlation Between Several Datasets
R, Sweave, Latex - Escape Variables to Be Printed in Latex
How to Perform a Pairwise T.Test in R Across Multiple Independent Vectors
Flexdashboard - Change Title Bar Color
Select Last Row by Group for All Columns Data.Table
R: Ggplot2 Make Two Geom_Tile Plots Have Equal Height
How to Move the Bibliography in Markdown/Pandoc