Reshaping Data Frame with Duplicates

Pandas: reshaping a dataframe with duplicate entries

Use groupby with aggregate by join, then reshape by unstack:

d = lambda x: ', '.join(x.astype(str))
df = df.groupby(['deathtype', 'deaths'])['height'].agg(d).unstack(fill_value='0')
print (df)
deaths                     1     4     14    17
deathtype                                      
AMS          4900, 5150, 5300     0     0     0
Avalanche                5700  5600  5350  5800
Unexplained        8500, 8560     0     0     0

Detail:

print (df.groupby(['deathtype', 'deaths'])['height'].agg(lambda x: ', '.join(x.astype(str))))
deathtype    deaths
AMS          1         4900, 5150, 5300
Avalanche    1                     5700
             4                     5600
             14                    5350
             17                    5800
Unexplained  1               8500, 8560
Name: height, dtype: object

Another solution with pivot_table:

df = df.pivot_table(index='deathtype', 
                    columns='deaths', 
                    values='height', 
                    fill_value='0', 
                    aggfunc=lambda x: ', '.join(x.astype(str)))

Reshape Pandas Dataframe with duplicate Index

You can use pivot_table which accepts multiple columns as index, values and columns:

df.pivot_table("Value", ["CountryName", "Year"], "IndicatorCode").reset_index()

Sample Image

Some explanation:

The parameters passed here are by positions, i.e, they are in the order of values, index, and columns or:

df.pivot_table(values = "Value", index = ["CountryName", "Year"], columns = "IndicatorCode").reset_index()

The values are what fill the cells of the final data frame, the index are the columns that get deduplicated and remain as columns in the result, the columns variables are ones that get pivoted to column headers in the result.

Pandas how to reshape a dataframe containing duplicated values for columns

Looks like the perfect application for pandas pivot_table.

Worth highlighting that pivot_table uses numpy mean as aggregation function (in case there are multiple observations with same index & column. So it implicitly requires numbers (int/floats) as values by default.

Let frame be the pandas dataframe containing your data:

import pandas as pd

cc = ['chr', 'value', 'region']
vals = [['chr22', 1, '21-77'],
       ['chr6',     3,   '12-65'],
       ['chr3',     5,   '73-81'],
       ['chr3',     8,   '91-96']]

frame = pd.DataFrame(vals, columns = cc)

result = pd.pivot_table(frame,
                        values = 'value', index = ['chr'], columns = ['region'],
                        fill_value = 0)

Pandas: Reshaping Long Data to Wide with duplicated columns

Option 1:

df.astype(str).groupby(['indx', 'param'])['value'].agg(','.join).unstack()

Output:

param    a      b      c    d
indx                         
11     100  54,65     65  NaN
12     789     24    NaN   98
13      24     27  75,35  NaN

Option 2

df_out = df.set_index(['indx', 'param', df.groupby(['indx','param']).cumcount()])['value'].unstack([1,2])
df_out.columns = [f'{i}_{j}' if j != 0 else f'{i}' for i, j in df_out.columns]
df_out.reset_index()

Output:

   indx      a     b   b_1     c     d   c_1
0    11  100.0  54.0  65.0  65.0   NaN   NaN
1    12  789.0  24.0   NaN   NaN  98.0   NaN
2    13   24.0  27.0   NaN  75.0   NaN  35.0

Transforming a dataframe from long to wide with Duplicate identifiers in R

If you check the original data frame, there are duplicates with respect to Station and Day:

df.summary <- group_by(df, Station, Date) %>% count()
df.summary[which(df.summary$n > 1), ]

# A tibble: 396 x 3
# Groups:   Station, Date [396]
   Station       Date     n
    <fctr>     <fctr> <int>
 1 DEBW001 2001-01-01     2
 2 DEBW001 2001-02-01     2
 3 DEBW001 2001-03-01     2
 4 DEBW001 2001-04-01     2
 5 DEBW001 2001-05-01     2
 6 DEBW001 2001-06-01     2
 7 DEBW001 2001-07-01     2
 8 DEBW001 2001-08-01     2
 9 DEBW001 2001-09-01     2
10 DEBW001 2001-10-01     2
# ... with 386 more rows

It depends on how you want to treat these duplicates. Suppose you want to take the mean of the duplicated values:

df2 <- reshape2::melt(df, id.vars=c("Station", "Date"), variable.name="Day")
df3 <- reshape2::dcast(df2, Date+Day~Station, value.var="value", fun.aggregate=mean)

The resulting data frame looks like this:

df3[1:10, 1:10]
         Date   Day AT0ACH1 AT0ENK1 AT0ILL1 AT0PIL1 AT0SIG1 AT0SON1 AT0STO1 AT0VOR1
1  2001-01-01 Day01  53.696  44.727  47.826  40.955  85.500  94.455  64.739  62.455
2  2001-01-01 Day02  42.048  28.609  39.435  42.435  78.000  89.261      NA  71.348
3  2001-01-01 Day03  38.565  28.957  19.522  28.304  72.500  88.625      NA  47.130
4  2001-01-01 Day04  39.304  23.739  16.522  20.870  85.625  95.870      NA  52.913
5  2001-01-01 Day05  67.375  29.864  22.421  21.174  82.875  93.087      NA  61.652
6  2001-01-01 Day06  58.478  32.478  28.708  26.870  67.043  79.391      NA  55.957
7  2001-01-01 Day07  49.652  21.217  29.042  48.609  55.870  76.174      NA  52.435
8  2001-01-01 Day08  48.217  16.739  27.591  41.217  59.522  79.435      NA  55.696
9  2001-01-01 Day09  52.000  30.391  44.542  46.783  67.609  82.583      NA  54.455
10 2001-01-01 Day10  37.087  33.174  28.522  30.182  80.750  94.478      NA  52.818

Reshape/Transpose duplicated rows to column in R studio

you can use pivot_wider from tidyr package

library(dplyr)
library(tidyr)

# your data
df <- tribble(
  ~ID, ~Name, ~Adress, ~Score, ~Requirement, ~Status,
  1, "John", "CA", 1, "Internet", "OK", 
  1, "John", "CA", 1, "TV", "Not OK", 
  1, "John", "CA", 1, "Household", "OK", 
  2, "Ann", "LA", 3, "Internet", "Not OK", 
  2, "Ann", "LA", 3, "TV", "Follow up")

df <- df %>% 
  pivot_wider(names_from = Requirement, values_from = Status)

Pandas Dataframe Reshape/Pivot - Duplicate Values in Index Error

It seems in your real data are duplicates, see sample:

print (df)
         Date   name  objects
0  2005-11-17   Pete        6
1  2014-02-04    Rob        3
2  2012-02-13    Rob        2
3  2004-12-16  Julia        4
4  2012-02-13   Mike        9 <-duplicates for 2012-02-13 and Mike
5  2012-02-13   Mike       18 <-duplicates for 2012-02-13 and Mike

Solution are pivot_table with some aggregate function, default is np.mean but can be changed to sum, 'meadian', first, last.

df = df.pivot_table(index='Date', columns='name', values='objects', aggfunc=np.mean)
print (df)
name        Julia  Mike  Pete  Rob
Date                              
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN  13.5   NaN  2.0 <-13.5 is mean
2014-02-04    NaN   NaN   NaN  3.0

Another solution with groupby, aggregate function and unstack:

df = df.groupby(['Date','name'])['objects'].mean().unstack()
print (df)
name        Julia  Mike  Pete  Rob
Date                              
2004-12-16    4.0   NaN   NaN  NaN
2005-11-17    NaN   NaN   6.0  NaN
2012-02-13    NaN  13.5   NaN  2.0
2014-02-04    NaN   NaN   NaN  3.0

For checking duplicated is possible use duplicated with boolean indexing:

df = df[df.duplicated(subset=['Date','name'], keep=False)]
print (df)
         Date  name  objects
4  2012-02-13  Mike        9
5  2012-02-13  Mike       18

reshape2 cast data frame with some duplicate values

The rows are being aggregated as dcast can't distinguish them based upon the formula provided. If you want to maintain the original values then you'll need to include a field to identify the duplicates uniquely. To continue your code...

df2$group <- rep(1:2,12)
dcast(df2,site+year+group~variable)

Clearly this code is a bit simplistic (in particular your data must be sorted by 'group' with no missing values) but it should serve to demonstrate how to preserve the original values.

How to Reshape Long to Wide While Preserving Some Variables in R

Another option only using tidyverse:

library(tidyverse)
#Code
df %>% group_by(id) %>% mutate(idv=paste0('type.',1:n())) %>%
  pivot_wider(names_from = idv,values_from=type)

Output:

# A tibble: 2 x 5
# Groups:   id [2]
  id    dates     type.1 type.2 type.3
  <chr> <chr>     <chr>  <chr>  <chr> 
1 1000  10/5/2019 A      B      B     
2 1001  9/17/2020 C      C      A

Or using row_number() (credits to @r2evans):

#Code 2
df %>% group_by(id) %>% mutate(idv=paste0('type.',row_number())) %>%
  pivot_wider(names_from = idv,values_from=type)

Output:

# A tibble: 2 x 5
# Groups:   id [2]
  id    dates     type.1 type.2 type.3
  <chr> <chr>     <chr>  <chr>  <chr> 
1 1000  10/5/2019 A      B      B     
2 1001  9/17/2020 C      C      A

Reshaping Data Frame with Duplicates

Pandas: reshaping a dataframe with duplicate entries

Reshape Pandas Dataframe with duplicate Index

Pandas how to reshape a dataframe containing duplicated values for columns

Pandas: Reshaping Long Data to Wide with duplicated columns

Option 1:

Option 2

Transforming a dataframe from long to wide with Duplicate identifiers in R

Reshape/Transpose duplicated rows to column in R studio

Pandas Dataframe Reshape/Pivot - Duplicate Values in Index Error

reshape2 cast data frame with some duplicate values

How to Reshape Long to Wide While Preserving Some Variables in R

Related Topics

Leave a reply