R: Pivoting using 'spread' function
Not knowing what you've tried, I would suggest:
spread(df1, Type, Points)
# ID NUM A B C D E F
# 1 DJ45 1 9.2 NA NA NA NA NA
# 2 DJ45 2 NA NA NA NA NA 60.8
# 3 DJ45 3 NA NA 22.9 NA NA NA
# 4 DJ46 1 NA 1012.7 NA NA NA NA
# 5 DJ46 2 NA NA NA 18.7 NA NA
# 6 DJ46 3 11.1 NA NA NA NA NA
# 7 DJ47 1 NA NA NA NA 67.2 NA
# 8 DJ47 2 NA NA 63.1 NA NA NA
# 9 DJ47 3 NA NA NA NA NA 16.7
# 10 DJ48 1 NA NA NA 58.4 NA NA
If you are getting an error about duplicate identifiers, it is because the combination of "ID" and "Num" in your actual data have one or more duplicate entries (in your sample data, they don't).
If that is the case, you need to add another column to make them unique.
Adding dplyr
into the chain, it might be something like:
df1 %>%
group_by(ID, NUM) %>%
mutate(id2 = sequence(n())) %>%
spread(Type, Points)
Demo of assumed error:
df2 <- rbind(df1, df1[1:3, ]) ## Duplicate the first three rows
spread(df2, Type, Points)
# Error: Duplicate identifiers for rows (1, 11), (3, 13), (2, 12)
library(dplyr)
df2 %>%
group_by(ID, NUM) %>%
mutate(id2 = sequence(n())) %>%
spread(Type, Points)
# Source: local data frame [13 x 9]
#
# ID NUM id2 A B C D E F
# 1 DJ45 1 1 9.2 NA NA NA NA NA
# 2 DJ45 1 2 9.2 NA NA NA NA NA
# 3 DJ45 2 1 NA NA NA NA NA 60.8
# 4 DJ45 2 2 NA NA NA NA NA 60.8
# 5 DJ45 3 1 NA NA 22.9 NA NA NA
# 6 DJ45 3 2 NA NA 22.9 NA NA NA
# 7 DJ46 1 1 NA 1012.7 NA NA NA NA
# 8 DJ46 2 1 NA NA NA 18.7 NA NA
# 9 DJ46 3 1 11.1 NA NA NA NA NA
# 10 DJ47 1 1 NA NA NA NA 67.2 NA
# 11 DJ47 2 1 NA NA 63.1 NA NA NA
# 12 DJ47 3 1 NA NA NA NA NA 16.7
# 13 DJ48 1 1 NA NA NA 58.4 NA NA
Replacing spread() with pivot_wider()
Based on the dropbox input data, some of the steps were already done. We can make some steps more compact by utilizing the select_helpers
i.e. if we have a range of columns to select, use :
, similarly in pivot_longer
, we can also specify the columns not to be selected with -
. With pivot_wider
, make sure to specify the arguments (names_from
, values_from
) as there are other arguments as well and without specifying arguments, it could match the arguments in the order of occurence
library(dplyr)
library(tidyr)
hiphop %>%
group_by(sex)%>%
summarise_at(vars(intl:unclassifiable), mean) %>%
pivot_longer(cols = -sex) %>%
pivot_wider(names_from = sex, values_from = value) %>%
mutate(genredifference = abs(Female-Male))%>%
arrange(genredifference)%>%
top_n(3)
# A tibble: 3 x 4
# name Female Male genredifference
# <chr> <dbl> <dbl> <dbl>
#1 country 0.786 0.392 0.394
#2 vocal 0.880 1.57 0.688
#3 rock 1.93 3.06 1.13
turn pivot_wider() into spread()
One option is to remove the columns 'n', 'prop' before the spread
statement as including them would create unique rows with that column values as well
library(dplyr)
library(tidyr)
test %>%
count(x, y) %>%
group_by(x) %>%
mutate(prop = prop.table(n)) %>%
mutate(v1 = paste0(n, ' (', round(prop, 2), ')')) %>%
select(-n, -prop) %>%
spread(y, v1)
# A tibble: 4 x 3
# Groups: x [4]
# x a b
# <dbl> <chr> <chr>
#1 1 2 (1) <NA>
#2 2 3 (0.75) 1 (0.25)
#3 3 <NA> 3 (1)
#4 4 <NA> 1 (1)
Or using base R
tbl <- table(test)
tbl[] <- paste0(tbl, "(", prop.table(tbl, 1), ")")
Transpose, spread or pivot a table with 0 or 1 depending on the value of a column
We can use fastDummies::dummy_cols()
library(fastDummies)
dummy_cols(a, select_columns = 'B')
When using a pivot function can I return a sum of values that were spread?
We can create a column after the summarise
with mutate
and then do the pivot_wider
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise(Count1 = n()) %>%
mutate(Count = sum(Count1)) %>%
pivot_wider(names_from = cyl, values_from = Count1)
# A tibble: 1 x 4
# Count `4` `6` `8`
# <int> <int> <int> <int>
#1 32 11 7 14
Spread in SparklyR / pivot in Spark
In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather
the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather
- spread
logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.
R: Pivot the rows into columns and use N/A's for missing values
You are looking for a basic "long" to "wide" reshaping process.
In base R, you can use the notorious reshape
. For this type of data, the syntax is quite straightforward:
reshape(df1, direction = "wide", idvar = "NUM", timevar = "Type")
# NUM Points.A Points.F Points.C Points.B Points.D Points.E
# 1 45 9.2 60.8 22.9 1012.7 NA NA
# 5 48 NA NA NA NA 18.7 NA
# 6 50 11.1 NA NA NA NA NA
# 7 66 NA 16.7 63.1 NA NA 67.2
# 10 68 NA NA NA NA 58.4 NA
You can also use the "tidyr" package, for several functions just wrap reshape2
but uses different syntax. In this case, the syntax would be:
> library(tidyr)
> spread(df1, Type, Points)
Trouble pivoting in pandas (spread in R)
Your last try (with unstack
) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:
>>> df.set_index(['dt','site_id','eu']).unstack('eu')
kw
eu FGE WSH
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5
But again, your way looks fine to me and is pretty much the same as what @piRSquared did (except their answer adds some more code to get rid of the multi-index).
I think the problem with pivot
is that you can only pass a single variable, not a list? Anyway, this works for me:
>>> df.set_index(['dt','site_id']).pivot(columns='eu')
For pivot_table
, the main issue is that 'kw' is an object/character and pivot_table
will attempt to aggregate with numpy.mean
by default. You probably got the error message: "DataError: No numeric types to aggregate".
But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command
>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')
Alternatively you could change the aggregation function:
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu',
aggfunc=sum )
That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.
Note, however, that pivot_table's
aggfunc
requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.
Related Topics
How to Split a Vector by Delimiter
How to Determine If a Url Object Returns '404 Not Found'
How to Merge Two Data Frame Based on Partial String Match with R
Puzzled by Xlim/Ylim Behavior in R
Using Tidy Eval for Multiple Dplyr Filter Conditions
R:Function to Generate a Mixture Distribution
How to Edit Column Names in Datatable Function When Running R Shiny App
Assign Color to 2 Different Geoms and Get 2 Different Legends
Placement of Error Bars in Barplot Using Ggplot2
Understanding Bandwidth Smoothing in Ggplot2
How to Know a Dimension of Matrix or Vector in R
R: How to Get a Sum of Two Distributions
Twitter Throws Forbidden Error After Entering Twitter API Pin