R: Pivoting Using 'Spread' Function

R: Pivoting using 'spread' function

Not knowing what you've tried, I would suggest:

spread(df1, Type, Points)
# ID NUM A B C D E F
# 1 DJ45 1 9.2 NA NA NA NA NA
# 2 DJ45 2 NA NA NA NA NA 60.8
# 3 DJ45 3 NA NA 22.9 NA NA NA
# 4 DJ46 1 NA 1012.7 NA NA NA NA
# 5 DJ46 2 NA NA NA 18.7 NA NA
# 6 DJ46 3 11.1 NA NA NA NA NA
# 7 DJ47 1 NA NA NA NA 67.2 NA
# 8 DJ47 2 NA NA 63.1 NA NA NA
# 9 DJ47 3 NA NA NA NA NA 16.7
# 10 DJ48 1 NA NA NA 58.4 NA NA

If you are getting an error about duplicate identifiers, it is because the combination of "ID" and "Num" in your actual data have one or more duplicate entries (in your sample data, they don't).

If that is the case, you need to add another column to make them unique.

Adding dplyr into the chain, it might be something like:

df1 %>%
group_by(ID, NUM) %>%
mutate(id2 = sequence(n())) %>%
spread(Type, Points)

Demo of assumed error:

df2 <- rbind(df1, df1[1:3, ]) ## Duplicate the first three rows
spread(df2, Type, Points)
# Error: Duplicate identifiers for rows (1, 11), (3, 13), (2, 12)

library(dplyr)

df2 %>%
group_by(ID, NUM) %>%
mutate(id2 = sequence(n())) %>%
spread(Type, Points)
# Source: local data frame [13 x 9]
#
# ID NUM id2 A B C D E F
# 1 DJ45 1 1 9.2 NA NA NA NA NA
# 2 DJ45 1 2 9.2 NA NA NA NA NA
# 3 DJ45 2 1 NA NA NA NA NA 60.8
# 4 DJ45 2 2 NA NA NA NA NA 60.8
# 5 DJ45 3 1 NA NA 22.9 NA NA NA
# 6 DJ45 3 2 NA NA 22.9 NA NA NA
# 7 DJ46 1 1 NA 1012.7 NA NA NA NA
# 8 DJ46 2 1 NA NA NA 18.7 NA NA
# 9 DJ46 3 1 11.1 NA NA NA NA NA
# 10 DJ47 1 1 NA NA NA NA 67.2 NA
# 11 DJ47 2 1 NA NA 63.1 NA NA NA
# 12 DJ47 3 1 NA NA NA NA NA 16.7
# 13 DJ48 1 1 NA NA NA 58.4 NA NA

Replacing spread() with pivot_wider()

Based on the dropbox input data, some of the steps were already done. We can make some steps more compact by utilizing the select_helpers i.e. if we have a range of columns to select, use :, similarly in pivot_longer, we can also specify the columns not to be selected with -. With pivot_wider, make sure to specify the arguments (names_from, values_from) as there are other arguments as well and without specifying arguments, it could match the arguments in the order of occurence

library(dplyr)
library(tidyr)
hiphop %>%
group_by(sex)%>%
summarise_at(vars(intl:unclassifiable), mean) %>%
pivot_longer(cols = -sex) %>%
pivot_wider(names_from = sex, values_from = value) %>%
mutate(genredifference = abs(Female-Male))%>%
arrange(genredifference)%>%
top_n(3)
# A tibble: 3 x 4
# name Female Male genredifference
# <chr> <dbl> <dbl> <dbl>
#1 country 0.786 0.392 0.394
#2 vocal 0.880 1.57 0.688
#3 rock 1.93 3.06 1.13

turn pivot_wider() into spread()

One option is to remove the columns 'n', 'prop' before the spread statement as including them would create unique rows with that column values as well

library(dplyr)
library(tidyr)
test %>%
count(x, y) %>%
group_by(x) %>%
mutate(prop = prop.table(n)) %>%
mutate(v1 = paste0(n, ' (', round(prop, 2), ')')) %>%
select(-n, -prop) %>%
spread(y, v1)
# A tibble: 4 x 3
# Groups: x [4]
# x a b
# <dbl> <chr> <chr>
#1 1 2 (1) <NA>
#2 2 3 (0.75) 1 (0.25)
#3 3 <NA> 3 (1)
#4 4 <NA> 1 (1)

Or using base R

tbl <- table(test)
tbl[] <- paste0(tbl, "(", prop.table(tbl, 1), ")")

Transpose, spread or pivot a table with 0 or 1 depending on the value of a column

We can use fastDummies::dummy_cols()

library(fastDummies)

dummy_cols(a, select_columns = 'B')

When using a pivot function can I return a sum of values that were spread?

We can create a column after the summarise with mutate and then do the pivot_wider

library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise(Count1 = n()) %>%
mutate(Count = sum(Count1)) %>%
pivot_wider(names_from = cyl, values_from = Count1)
# A tibble: 1 x 4
# Count `4` `6` `8`
# <int> <int> <int> <int>
#1 32 11 7 14

Spread in SparklyR / pivot in Spark

In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.

With data defined like this:

df <- copy_to(sc, iris, overwrite = TRUE) 

gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)

long <- df %>% 
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")

and then group and aggregate:

long %>% 
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))

with result as:

# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0

Given reduced size of the output it is also fine to collect the result after aggregation

agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()

and apply your gather - spread logic on the result:

agg %>% 
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0

In fact the latter approach should be superior performance-wise in this particular case.

R: Pivot the rows into columns and use N/A's for missing values

You are looking for a basic "long" to "wide" reshaping process.

In base R, you can use the notorious reshape. For this type of data, the syntax is quite straightforward:

reshape(df1, direction = "wide", idvar = "NUM", timevar = "Type")
# NUM Points.A Points.F Points.C Points.B Points.D Points.E
# 1 45 9.2 60.8 22.9 1012.7 NA NA
# 5 48 NA NA NA NA 18.7 NA
# 6 50 11.1 NA NA NA NA NA
# 7 66 NA 16.7 63.1 NA NA 67.2
# 10 68 NA NA NA NA 58.4 NA

You can also use the "tidyr" package, for several functions just wrap reshape2 but uses different syntax. In this case, the syntax would be:

> library(tidyr)
> spread(df1, Type, Points)

Trouble pivoting in pandas (spread in R)

Your last try (with unstack) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:

>>> df.set_index(['dt','site_id','eu']).unstack('eu')

kw
eu FGE WSH
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5

But again, your way looks fine to me and is pretty much the same as what @piRSquared did (except their answer adds some more code to get rid of the multi-index).

I think the problem with pivot is that you can only pass a single variable, not a list? Anyway, this works for me:

>>> df.set_index(['dt','site_id']).pivot(columns='eu')

For pivot_table, the main issue is that 'kw' is an object/character and pivot_table will attempt to aggregate with numpy.mean by default. You probably got the error message: "DataError: No numeric types to aggregate".

But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command

>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')

Alternatively you could change the aggregation function:

>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu', 
aggfunc=sum )

That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.

Note, however, that pivot_table's aggfunc requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.



Related Topics



Leave a reply



Submit