Reshaping Dataset

How to reshape a dataset in order to make it sequential?

It is more relevant to do this kind of processing in the dataset layer. Indeed, what you are looking to implement there is "given a dataset index index return the corresponding input and its label". In your case you are dealing with a sequence as input, so something like this makes sense for your __getitem__ to return a sequence of images.

The data loader will automatically collate the data such that you get (batch_size, seq_len, channel, height, width) for your input, and (batch_size, seq_len) for your label (or (batch_size,) if there is meant to be a single label per sequence).

reshaping the dataset in python

One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
     index = ['Account', 'lookup'], 
     names_to = ('Year', '.value'), 
     names_pattern = r"(FY\d+)(.+)")

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     JP  FY11  5000     10
2   Sales     CA  FY12  5000   4800
3   Sales     JP  FY12  6500     15

Another option is to use stack:

temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15

You can also pull it off with pd.wide_to_long after reshaping the columns:

index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
                .columns
                .str.split('(FY\d+)')
                .str[::-1]
                .str.join('')
               )
(pd.wide_to_long(
      temp.reset_index(), 
      stubnames = ['USD', 'local'], 
      i = index, 
      j = 'Year', 
      suffix = '.+')
.reset_index()
)

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15

Reshaping dataset in Python

Using my as_strided recipe window_nd from here:

input = np.random.rand(15, 5)
current_output = input.reshape(-1, 5, 5)  #I think?
expected_output = window_nd(input, 5, steps = 1, axis = 0)

steps and axis parameters aren't technically needed in this case but are included for clarity.

Reshape of dataset (Time Series) after filtering?

The signal before and after filtering is of same length and you append multiple signals (of length 9000) so you get a 167-long list of signals that are 9000 points long. Why are you expecting to get a 1D array? You get a list of lists ...

import numpy as np
from sktime.transformations.series.outlier_detection import HampelFilter

# toy filter function
def hampel_filter(sig):  # it is good style to save upper- and camel-case names for classes
    return HampelFilter().fit_transform(sig)

# generate toy data
data = np.random.rand(1, 3, 90)  # shape(data): (1, 3, 90)
print(f'np.shape(data.shape): {data.shape}')

mydata = data[0, :]  # shape(mydata): (3, 90)
print(f'shape(mydata): {mydata.shape}')

mydata_filtered = []
for signal in mydata:  # shape(signal): (90, 1) => this is a 2D array! (a vector)
    print(f'shape(signal): {np.shape(signal)}')
    signal_filtered = hampel_filter(signal)
    mydata_filtered.append(signal_filtered)
print(f'shape(mydata_filtered): {np.shape(mydata_filtered)}')

and you'll get:

np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90, 1)

you can flatten the filtered signal in the hampel_filter if you need to have an array returned, you would get:

np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90)

How can I reshape a long dataset into a short data set with multiple variables

To make the case clearer, I tried to create a second row with dummy data that follows the pattern of data in the first row:

dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277", 
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902, 
0.25326902), LDR_Topic = c("supervises collective followers very closely", 
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA, 
-2L))

dat
  UPDATEDID       BRIEF_ID   gamma  LDR_SUM                                    LDR_Topic LDR_7Code
1        16 04999120040277 879.744 0.153269 supervises collective followers very closely         1
2        17 14999120040277 779.744 0.253269                          does something else         2

A base R way

dat |> 
  reshape(direction = "wide", 
          idvar  = "UPDATEDID",
          timevar ="LDR_Topic",
          v.names = "LDR_SUM")|>
  subset(select = -c(gamma, LDR_7Code))

# The result

#  UPDATEDID       BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1        16 04999120040277                                             0.153269                          NA
#2        17 14999120040277                                                   NA                    0.253269

A tidyverse way

library(tidyverse)

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

#The result

# A tibble: 2 × 4
#  UPDATEDID BRIEF_ID       `supervises collective followers very closely` `does something else`
#      <dbl> <chr>                                                   <dbl>                 <dbl>
#1        16 04999120040277                                          0.153                NA    
#2        17 14999120040277                                         NA                     0.253

A data.table way (recommended for memory efficiency)

library(data.table)

dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

# The result

#   UPDATEDID       BRIEF_ID does something else supervises collective followers very closely
#1:        16 04999120040277                  NA                                     0.153269
#2:        17 14999120040277            0.253269                                           NA

Updates

Based on your explanation, the tidyverse way basically works on the right direction. The only problem is the duplicated rows that have NAs in some of its columns and you want them to collapse into a single row. This is easy to do with fill() and distinct() functions. The only problem in your example is that the UPDATEDID changed from 1,2,3,4 to 1 with no explanation. Hence, for now, I assume that we can ignore the UPDATEDID (you can create a new column for it later) and we need only to consider BRIEF_ID.

yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250, 
1999110036250, 1999110036250), acquired_resources = c(0.02843241, 
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233, 
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))

yourdf   # I change the space to '_' to make it easier to control

  UPDATEDID    BRIEF_ID acquired_resources distributed_resources    enhanced
1         1 1.99911e+12         0.02843241                    NA          NA
2         2 1.99911e+12                 NA            0.01089223          NA
3         3 1.99911e+12                 NA            0.01089223 0.006081761
4         4 1.99911e+12         0.02843241                    NA 0.006081761

yourdf[,-1] |>
     fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()
    

# The result
     BRIEF_ID acquired_resources distributed_resources    enhanced
1 1.99911e+12         0.02843241            0.01089223 0.006081761

Then, the complete step would be:

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code)) |>
 fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()