How to Get This Data Structure in R

data structure with vectors as elements in R

What you want is list-columns. It's a little difficult to build them from the start, but not so hard to add them later.

### won't work
dat <- data.frame(a=c("ICC-1","IIC-2"), range=list(1:10, 10:30))
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
# arguments imply differing number of rows: 10, 21

### this does work
dat <- data.frame(a=c("ICC-1","IIC-2"))
dat$range <- list(1:10, 10:30)
dat
# a range
# 1 ICC-1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 IIC-2 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30

(It is feasible to add 1:10 as a quoted expression if you'd prefer, but that takes more care in follow-on processing that I did not want to assume.)

Does R have a Set data structure?

To remove multiple occurrences of a value within a vector use duplicated()

an example would be

x <- c(1,2,3,3,4,5,5,6)
x[!duplicated(x)]
# [1] 1 2 3 4 5 6

This is returning all values of x which are not (!) duplicated.

This will also work for more complex data structures like data.frames. See ?duplicated for further information.

unique(x) provides all values occurring in the vector.

table(x) shows the unqiue values and their number of occurrences in vector x

table(x)
# x
# 1 2 3 4 5 6
# 1 1 2 1 2 1

Is there any dictionary-like structure in R

in R a named list is the nearest thing to a dictionary or hashed array or whatever any other language calls it.

Construct with the list function and extract/assign elements with the $ operator:

> SLIST = list(S1=c("x_S1_x","x_S1_x"), S2="xx_S2_xx", S3="xx_S3_xx")
> SLIST$S1
[1] "x_S1_x" "x_S1_x"

Check data structure and impute missing values

library(dplyr)
df %>% mutate_if(is.factor, as.character) -> df1

#imputation function
impute <- function(x){
missing_perc <- sum(is.na(x))/length(x) * 100
return(ifelse(missing_perc > 40, NA,
ifelse(is.character(x), names(sort(-table(x[!is.na(x)])))[1], mean(x[!is.na(x)]))))
}
impute_val <- sapply(df1, impute)

#impute missing values
df1[] <- Map(function(x, y) replace(x, is.na(x), y), df1, impute_val)
#drop rows where column has missing percentage > 40
df1 <- na.omit(df1)

#final data
df1

Output is:

  Comp Month  Sales             Qtr1             Qtr2             Qtr3             Qtr4             Qtr5
2 F Feb Medium 65.4017299879342 66.0814035916701 13.8528823154047 21.5696093859151 18.2194353546947
4 S Apr High 89.403684460558 74.2279292317107 55.5751067353413 51.869766949676 9.31410894263536
6 S June High 11.7533272597939 11.6908136522397 12.5517533393577 95.4095394117758 36.061190161854
8 T Aug Low 7.48507694806904 77.5027731899172 42.0926807913929 11.0406906111166 17.137353355065
Qtr6
2 82.1378237567842
4 27.7001850772649
6 88.5877252323553
8 23.5045042354614

Sample data:

structure(list(Comp = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("F", "S", "T"), class = "factor"), Month = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 9L), .Label = c("Apr", "Aug", "Feb",
"Jan", "July", "June", "March", "May", "Sept"), class = "factor"),
Sales = structure(c(2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L, 2L), .Label = c("High",
"Low", "Medium"), class = "factor"), Qtr1 = c(43.4887288603932,
65.4017299879342, NA, 89.403684460558, NA, 11.7533272597939,
50.5520776147023, 7.48507694806904, NA), Qtr2 = c(NA, 66.0814035916701,
NA, 74.2279292317107, NA, 11.6908136522397, NA, 77.5027731899172,
NA), Qtr3 = c(5.68129089660943, 13.8528823154047, 35.6186878867447,
55.5751067353413, 6.98710139840841, 12.5517533393577, 8.91167896334082,
42.0926807913929, NA), Qtr4 = c(22.5347936619073, 21.5696093859151,
NA, 51.869766949676, NA, 95.4095394117758, 16.6109931422397,
11.0406906111166, 56.1983718769625), Qtr5 = c(5.67050215322524,
18.2194353546947, 88.5992815019563, 9.31410894263536, 77.7505977777764,
36.061190161854, 51.1230558156967, 17.137353355065, NA),
Qtr6 = c(27.9433359391987, 82.1378237567842, NA, 27.7001850772649,
NA, 88.5877252323553, 50.3849557833746, 23.5045042354614,
74.2521224310622)), .Names = c("Comp", "Month", "Sales",
"Qtr1", "Qtr2", "Qtr3", "Qtr4", "Qtr5", "Qtr6"), row.names = c(NA,
-9L), class = "data.frame")

Data structure for time-series analysis in R

There is no black and white answer: both object types have their strengths for different purposes (Although I would almost always use data.table instead of data.frame in asking your question, because you get far more capabilities.). I personally use both interchangeably in research, but generally keep the original raw underlying data in xts format to begin with (tick or OHLC bar data in xts objects).

Both object types are fast, with computationally intensive code written in C.

If the dimensions (length or width) of your time series are not large, you can easily transfer back and forth (e.g. data.table("index" = index(xtsobj), coredata(xtsobj)) at the security level, and then merge data.tables if you wish to combine securities for cross-sectional types of modeling. I typically switch back and forth between both object types for time series that I work with

xts objects must use all columns of the same type (numeric or character are the common types), which can be a limitation if you have categorical variables mixed with numerical data (you can map the categorical variables to numeric values to get around this, but that is extra work and can reduce clarity when modeling your data).

xts makes merging time series data (with merge), particularly at different time frequencies together, very straightforward to do. It also works very nicely with building moving window technical indicators in TTR and quantmod. You can also utilize quantmod (chart_Series and add_TA) and xts plotting tools (see ?plot.xts) to visualise out of the box candlestick/OHLC bar data. xts makes aggregating tick data into OHLC bar data, and changing the frequency of bar data series (e.g. from 5 min bars to 1 hour bars, or to daily bars) very straightforward with useful functions like to.period, period.apply and endpoints (and it is fast doing it using C code).

If you are going to build prediction models (many linear regressions, or more complex models) with many categorical variables in your prediction models (e.g. sector of security, sentiment categories) that you do not want to map to numbers, it may be better to work with data.table. Many prediction models in R (and unsupervised methods like clustering) require data to be in data.frame format, in which case storing/saving/loading your data in data.table/data.frame format might make more sense if your end goal is prediction modelling. VAR models in the var/urca R packages also use data.frame format. Although it is noted that many prediction models (via caret etc) require data to be in numeric matrix format, which you can easily extract from xts objects using coredata(xtsobj) (converting data.frame data to matrix format is typically straightforward too though).

If your data sets are really big (each security holds n GBs of price data in memory for large n), and you want to do repeated aggregations by groups (e.g. computed mean/sd of returns by month and symbol or month and sector, you'll probably find data.table more natural to work with), you'll probably find data.table more efficient it is designed to handle large amounts of data in memory/RAM and will tend to do less copying than xts operations.

How to address data in a hierarchical data structure in R?

[[ can only return a single element. I thought [[ would have thrown an error because of that, not the error you are seeing, but reading ?"[" tells what R does with a call such as yours and explains the behaviour (from ?"["):

Recursive (list-like) objects:
....

 ‘[[’ can be applied recursively to lists, so that if the single
index ‘i’ is a vector of length ‘p’, ‘alist[[i]]’ is equivalent to
‘alist[[i1]]...[[ip]]’ providing all but the final indexing
results in a list.

The reason for your error is this:

> study$results[[c(1,2)]]
[1] -12 -1 3 10 23

which indicates that R really did this

> study$results[[1]][[2]]
[1] -12 -1 3 10 23

i.e. return the second component (column) of the first data frame, which is an atomic vector because R drops the empty dimension. $ can not be used on atomic vectors hence the error.

If you want to iterate over the list that is study$results, lapply() or sapply() are your friends:

> lapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
[[1]]
[1] 23

[[2]]
[1] 21

> sapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
[1] 23 21

If you popped names on the components in $results you'd get them in the output too:

> names(study$results) <- study$region
> lapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
$Hamburg
[1] 23

$Bremen
[1] 21

> sapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
Hamburg Bremen
23 21

which is easier to use and then you don't need the $region component if you wish.

How to get this dcast'able long table in R?

The OP has edited his question and is supplying the data as a data.frame:

dat.df <- structure(list(ave_max = c(15L, 6L), ave = c(6L, NA), lepo = c(4L, NA)), 
.Names = c("ave_max", "ave", "lepo"), class = "data.frame",
row.names = c(NA, -2L))

dat.df
# ave_max ave lepo
#1 15 6 4
#2 6 NA NA
class(dat.df)
#[1] "data.frame"

He is now asking to transform this data.frame into a matrix which is similar to the one used as input data in this answer.

This can be achieved by using data.table:

library(data.table)   # CRAN version 1.10.4 used
# transpose the input data frame, use rowid() to create columns,
# remove a character column to ensure matrix will be of type integer,
# finally, coerce to matrix
dat.m2 <- as.matrix(
data.table::dcast(
data.table::melt(setDT(dat.df), measure.vars = names(dat.df)),
variable ~ rowid(variable)
)[, variable := NULL]
)
# add row names, remove column names
dimnames(dat.m2) <- list(names(dat.df), NULL)

dat.m2
# [,1] [,2]
#ave_max 15 6
#ave 6 NA
#lepo 4 NA

str(dat.m2)
# int [1:3, 1:2] 15 6 4 6 NA NA
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "ave_max" "ave" "lepo"
# ..$ : NULL

class(dat.m2)
#[1] "matrix"

Edit: I've amended above code to use the double colon operator to explicitely state the namespace from which melt() and dcast() should be taken. Normally, this wouldn't be necessary as data.table is already loaded. However, the OP is reporting issues which might be caused by package reshape2 being loaded after data.table. The data.table package has it's own faster implementations of reshape2::dcast() and reshape2::melt(). When both packages have been loaded for some reason name clashes might occur.

what data structure does model formula operator in R create?

Don't know if this helps, but: it's a language object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"

> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>

If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~, +, () , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.

  • f[[1]] is ~
  • f[[2]] is a (the first argument, i.e. the LHS of the formula)
  • f[[3]] is b + (c+d)
  • f[[3]][[1]] is +
  • f[[3]][[2]] is b

... and so on.

The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.

This is also discussed (more opaquely) in the R Language Manual, e.g.

  • Expression objects
  • Direct manipulation of language objects

@user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created



Related Topics



Leave a reply



Submit