How to reshape a dataset in order to make it sequential?
It is more relevant to do this kind of processing in the dataset layer. Indeed, what you are looking to implement there is "given a dataset index index
return the corresponding input and its label". In your case you are dealing with a sequence as input, so something like this makes sense for your __getitem__
to return a sequence of images.
The data loader will automatically collate the data such that you get (batch_size, seq_len, channel, height, width)
for your input, and (batch_size, seq_len)
for your label (or (batch_size,)
if there is meant to be a single label per sequence).
reshaping the dataset in python
One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value
placeholder ---> the .value
determines which parts of the columns remain as headers:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = ['Account', 'lookup'],
names_to = ('Year', '.value'),
names_pattern = r"(FY\d+)(.+)")
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
Another option is to use stack:
temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
You can also pull it off with pd.wide_to_long
after reshaping the columns:
index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
.columns
.str.split('(FY\d+)')
.str[::-1]
.str.join('')
)
(pd.wide_to_long(
temp.reset_index(),
stubnames = ['USD', 'local'],
i = index,
j = 'Year',
suffix = '.+')
.reset_index()
)
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
Reshaping dataset in Python
Using my as_strided
recipe window_nd
from here:
input = np.random.rand(15, 5)
current_output = input.reshape(-1, 5, 5) #I think?
expected_output = window_nd(input, 5, steps = 1, axis = 0)
steps
and axis
parameters aren't technically needed in this case but are included for clarity.
Reshape of dataset (Time Series) after filtering?
The signal before and after filtering is of same length and you append multiple signals (of length 9000) so you get a 167-long list of signals that are 9000 points long. Why are you expecting to get a 1D array? You get a list of lists ...
import numpy as np
from sktime.transformations.series.outlier_detection import HampelFilter
# toy filter function
def hampel_filter(sig): # it is good style to save upper- and camel-case names for classes
return HampelFilter().fit_transform(sig)
# generate toy data
data = np.random.rand(1, 3, 90) # shape(data): (1, 3, 90)
print(f'np.shape(data.shape): {data.shape}')
mydata = data[0, :] # shape(mydata): (3, 90)
print(f'shape(mydata): {mydata.shape}')
mydata_filtered = []
for signal in mydata: # shape(signal): (90, 1) => this is a 2D array! (a vector)
print(f'shape(signal): {np.shape(signal)}')
signal_filtered = hampel_filter(signal)
mydata_filtered.append(signal_filtered)
print(f'shape(mydata_filtered): {np.shape(mydata_filtered)}')
and you'll get:
np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90, 1)
you can flatten
the filtered signal in the hampel_filter
if you need to have an array returned, you would get:
np.shape(data.shape): (1, 3, 90)
shape(mydata): (3, 90)
shape(signal): (90,)
shape(signal): (90,)
shape(signal): (90,)
shape(mydata_filtered): (3, 90)
How can I reshape a long dataset into a short data set with multiple variables
To make the case clearer, I tried to create a second row with dummy data that follows the pattern of data in the first row:
dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))
dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2
A base R way
dat |>
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))
# The result
# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269
A tidyverse way
library(tidyverse)
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
#The result
# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253
A data.table way (recommended for memory efficiency)
library(data.table)
dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
# The result
# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA
Updates
Based on your explanation, the tidyverse
way basically works on the right direction. The only problem is the duplicated rows that have NA
s in some of its columns and you want them to collapse into a single row. This is easy to do with fill()
and distinct()
functions. The only problem in your example is that the UPDATEDID
changed from 1,2,3,4
to 1
with no explanation. Hence, for now, I assume that we can ignore the UPDATEDID
(you can create a new column for it later) and we need only to consider BRIEF_ID
.
yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250,
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))
yourdf # I change the space to '_' to make it easier to control
UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e+12 0.02843241 NA NA
2 2 1.99911e+12 NA 0.01089223 NA
3 3 1.99911e+12 NA 0.01089223 0.006081761
4 4 1.99911e+12 0.02843241 NA 0.006081761
yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e+12 0.02843241 0.01089223 0.006081761
Then, the complete step would be:
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
Related Topics
Dist Function with Large Number of Points
Cannot Install Stringi Since Xcode Command Line Tools Update
Increasing Whitespace Between Legend Items in Ggplot2
How to Check If Multiple Strings Exist in Another String
R Bnlearn Eval Inside Function
Writing a Function to Calculate the Mean of Columns in a Dataframe in R
Reconstruct a Categorical Variable from Dummies in R
Map Array of Strings to an Array of Integers
R - Check If String Contains Dates Within Specific Date Range
Convert Byte Encoding to Unicode
Split a Column to Multiple Columns
R: Miscellaneous Errors While Trying to Plot Graphs
Equivalent of Which in Scraping
Web Scraping Data Table with R Rvest