Reshaping Dataset

How to reshape a dataset in order to make it sequential?

It is more relevant to do this kind of processing in the dataset layer. Indeed, what you are looking to implement there is "given a dataset index index return the corresponding input and its label". In your case you are dealing with a sequence as input, so something like this makes sense for your __getitem__ to return a sequence of images.

The data loader will automatically collate the data such that you get (batch_size, seq_len, channel, height, width) for your input, and (batch_size, seq_len) for your label (or (batch_size,) if there is meant to be a single label per sequence).

reshaping the dataset in python

One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
index = ['Account', 'lookup'],
names_to = ('Year', '.value'),
names_pattern = r"(FY\d+)(.+)")

Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15

Another option is to use stack:

temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()

Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15

You can also pull it off with pd.wide_to_long after reshaping the columns:

index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
.columns
.str.split('(FY\d+)')
.str[::-1]
.str.join('')
)
(pd.wide_to_long(
temp.reset_index(),
stubnames = ['USD', 'local'],
i = index,
j = 'Year',
suffix = '.+')
.reset_index()
)

Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15

Reshaping dataset in Python

Using my as_strided recipe window_nd from here:

input = np.random.rand(15, 5)
current_output = input.reshape(-1, 5, 5) #I think?
expected_output = window_nd(input, 5, steps = 1, axis = 0)

steps and axis parameters aren't technically needed in this case but are included for clarity.

Reshape of dataset (Time Series) after filtering?

The signal before and after filtering is of same length and you append multiple signals (of length 9000) so you get a 167-long list of signals that are 9000 points long. Why are you expecting to get a 1D array? You get a list of lists ...

import numpy as np
from sktime.transformations.series.outlier_detection import HampelFilter

# toy filter function
def hampel_filter(sig): # it is good style to save upper- and camel-case names for classes
return HampelFilter().fit_transform(sig)

# generate toy data
data = np.random.rand(1, 3, 90) # shape(data): (1, 3, 90)
print(f'np.shape(data.shape): {data.shape}')

mydata = data[0, :] # shape(mydata): (3, 90)
print(f'shape(mydata): {mydata.shape}')

mydata_filtered = []
for signal in mydata: # shape(signal): (90, 1) => this is a 2D array! (a vector)
print(f'shape(signal): {np.shape(signal)}')
signal_filtered = hampel_filter(signal)
mydata_filtered.append(signal_filtered)
print(f'shape(mydata_filtered): {np.shape(mydata_filtered)}')

and you'll get:

np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90, 1)

you can flatten the filtered signal in the hampel_filter if you need to have an array returned, you would get:

np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90)

How can I reshape a long dataset into a short data set with multiple variables

To make the case clearer, I tried to create a second row with dummy data that follows the pattern of data in the first row:

dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))

dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2

A base R way

dat |> 
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))

# The result

# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269

A tidyverse way

library(tidyverse)

dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))

#The result

# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253

A data.table way (recommended for memory efficiency)

library(data.table)

dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

# The result

# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA

Updates

Based on your explanation, the tidyverse way basically works on the right direction. The only problem is the duplicated rows that have NAs in some of its columns and you want them to collapse into a single row. This is easy to do with fill() and distinct() functions. The only problem in your example is that the UPDATEDID changed from 1,2,3,4 to 1 with no explanation. Hence, for now, I assume that we can ignore the UPDATEDID (you can create a new column for it later) and we need only to consider BRIEF_ID.

yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250, 
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))

yourdf # I change the space to '_' to make it easier to control

UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e+12 0.02843241 NA NA
2 2 1.99911e+12 NA 0.01089223 NA
3 3 1.99911e+12 NA 0.01089223 0.006081761
4 4 1.99911e+12 0.02843241 NA 0.006081761

yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()


# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e+12 0.02843241 0.01089223 0.006081761

Then, the complete step would be:

dat |> 
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()


Related Topics



Leave a reply



Submit