data structure with vectors as elements in R
What you want is list-columns. It's a little difficult to build them from the start, but not so hard to add them later.
### won't work
dat <- data.frame(a=c("ICC-1","IIC-2"), range=list(1:10, 10:30))
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
# arguments imply differing number of rows: 10, 21
### this does work
dat <- data.frame(a=c("ICC-1","IIC-2"))
dat$range <- list(1:10, 10:30)
dat
# a range
# 1 ICC-1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 IIC-2 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
(It is feasible to add 1:10
as a quoted expression if you'd prefer, but that takes more care in follow-on processing that I did not want to assume.)
Does R have a Set data structure?
To remove multiple occurrences of a value within a vector use duplicated()
an example would be
x <- c(1,2,3,3,4,5,5,6)
x[!duplicated(x)]
# [1] 1 2 3 4 5 6
This is returning all values of x
which are not (!
) duplicated.
This will also work for more complex data structures like data.frames
. See ?duplicated
for further information.
unique(x)
provides all values occurring in the vector.
table(x)
shows the unqiue values and their number of occurrences in vector x
table(x)
# x
# 1 2 3 4 5 6
# 1 1 2 1 2 1
Is there any dictionary-like structure in R
in R a named list is the nearest thing to a dictionary or hashed array or whatever any other language calls it.
Construct with the list
function and extract/assign elements with the $
operator:
> SLIST = list(S1=c("x_S1_x","x_S1_x"), S2="xx_S2_xx", S3="xx_S3_xx")
> SLIST$S1
[1] "x_S1_x" "x_S1_x"
Check data structure and impute missing values
library(dplyr)
df %>% mutate_if(is.factor, as.character) -> df1
#imputation function
impute <- function(x){
missing_perc <- sum(is.na(x))/length(x) * 100
return(ifelse(missing_perc > 40, NA,
ifelse(is.character(x), names(sort(-table(x[!is.na(x)])))[1], mean(x[!is.na(x)]))))
}
impute_val <- sapply(df1, impute)
#impute missing values
df1[] <- Map(function(x, y) replace(x, is.na(x), y), df1, impute_val)
#drop rows where column has missing percentage > 40
df1 <- na.omit(df1)
#final data
df1
Output is:
Comp Month Sales Qtr1 Qtr2 Qtr3 Qtr4 Qtr5
2 F Feb Medium 65.4017299879342 66.0814035916701 13.8528823154047 21.5696093859151 18.2194353546947
4 S Apr High 89.403684460558 74.2279292317107 55.5751067353413 51.869766949676 9.31410894263536
6 S June High 11.7533272597939 11.6908136522397 12.5517533393577 95.4095394117758 36.061190161854
8 T Aug Low 7.48507694806904 77.5027731899172 42.0926807913929 11.0406906111166 17.137353355065
Qtr6
2 82.1378237567842
4 27.7001850772649
6 88.5877252323553
8 23.5045042354614
Sample data:
structure(list(Comp = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L), .Label = c("F", "S", "T"), class = "factor"), Month = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 9L), .Label = c("Apr", "Aug", "Feb",
"Jan", "July", "June", "March", "May", "Sept"), class = "factor"),
Sales = structure(c(2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L, 2L), .Label = c("High",
"Low", "Medium"), class = "factor"), Qtr1 = c(43.4887288603932,
65.4017299879342, NA, 89.403684460558, NA, 11.7533272597939,
50.5520776147023, 7.48507694806904, NA), Qtr2 = c(NA, 66.0814035916701,
NA, 74.2279292317107, NA, 11.6908136522397, NA, 77.5027731899172,
NA), Qtr3 = c(5.68129089660943, 13.8528823154047, 35.6186878867447,
55.5751067353413, 6.98710139840841, 12.5517533393577, 8.91167896334082,
42.0926807913929, NA), Qtr4 = c(22.5347936619073, 21.5696093859151,
NA, 51.869766949676, NA, 95.4095394117758, 16.6109931422397,
11.0406906111166, 56.1983718769625), Qtr5 = c(5.67050215322524,
18.2194353546947, 88.5992815019563, 9.31410894263536, 77.7505977777764,
36.061190161854, 51.1230558156967, 17.137353355065, NA),
Qtr6 = c(27.9433359391987, 82.1378237567842, NA, 27.7001850772649,
NA, 88.5877252323553, 50.3849557833746, 23.5045042354614,
74.2521224310622)), .Names = c("Comp", "Month", "Sales",
"Qtr1", "Qtr2", "Qtr3", "Qtr4", "Qtr5", "Qtr6"), row.names = c(NA,
-9L), class = "data.frame")
Data structure for time-series analysis in R
There is no black and white answer: both object types have their strengths for different purposes (Although I would almost always use data.table
instead of data.frame
in asking your question, because you get far more capabilities.). I personally use both interchangeably in research, but generally keep the original raw underlying data in xts
format to begin with (tick or OHLC bar data in xts
objects).
Both object types are fast, with computationally intensive code written in C.
If the dimensions (length or width) of your time series are not large, you can easily transfer back and forth (e.g. data.table("index" = index(xtsobj), coredata(xtsobj)
) at the security level, and then merge data.tables
if you wish to combine securities for cross-sectional types of modeling. I typically switch back and forth between both object types for time series that I work with
xts
objects must use all columns of the same type (numeric
or character
are the common types), which can be a limitation if you have categorical variables mixed with numerical data (you can map the categorical variables to numeric values to get around this, but that is extra work and can reduce clarity when modeling your data).
xts
makes merging time series data (with merge
), particularly at different time frequencies together, very straightforward to do. It also works very nicely with building moving window technical indicators in TTR
and quantmod
. You can also utilize quantmod
(chart_Series
and add_TA
) and xts
plotting tools (see ?plot.xts
) to visualise out of the box candlestick/OHLC bar data. xts
makes aggregating tick data into OHLC bar data, and changing the frequency of bar data series (e.g. from 5 min bars to 1 hour bars, or to daily bars) very straightforward with useful functions like to.period
, period.apply
and endpoints
(and it is fast doing it using C code).
If you are going to build prediction models (many linear regressions, or more complex models) with many categorical variables in your prediction models (e.g. sector of security, sentiment categories) that you do not want to map to numbers, it may be better to work with data.table
. Many prediction models in R (and unsupervised methods like clustering) require data to be in data.frame
format, in which case storing/saving/loading your data in data.table/data.frame
format might make more sense if your end goal is prediction modelling. VAR
models in the var
/urca
R packages also use data.frame
format. Although it is noted that many prediction models (via caret
etc) require data to be in numeric matrix
format, which you can easily extract from xts
objects using coredata(xtsobj)
(converting data.frame
data to matrix
format is typically straightforward too though).
If your data sets are really big (each security holds n
GBs of price data in memory for large n
), and you want to do repeated aggregations by groups (e.g. computed mean/sd of returns by month and symbol or month and sector, you'll probably find data.table more natural to work with), you'll probably find data.table
more efficient it is designed to handle large amounts of data in memory/RAM and will tend to do less copying than xts
operations.
How to address data in a hierarchical data structure in R?
[[
can only return a single element. I thought [[
would have thrown an error because of that, not the error you are seeing, but reading ?"["
tells what R does with a call such as yours and explains the behaviour (from ?"["
):
Recursive (list-like) objects:
....‘[[’ can be applied recursively to lists, so that if the single
index ‘i’ is a vector of length ‘p’, ‘alist[[i]]’ is equivalent to
‘alist[[i1]]...[[ip]]’ providing all but the final indexing
results in a list.
The reason for your error is this:
> study$results[[c(1,2)]]
[1] -12 -1 3 10 23
which indicates that R really did this
> study$results[[1]][[2]]
[1] -12 -1 3 10 23
i.e. return the second component (column) of the first data frame, which is an atomic vector because R drops the empty dimension. $
can not be used on atomic vectors hence the error.
If you want to iterate over the list that is study$results
, lapply()
or sapply()
are your friends:
> lapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
[[1]]
[1] 23
[[2]]
[1] 21
> sapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
[1] 23 21
If you popped names on the components in $results
you'd get them in the output too:
> names(study$results) <- study$region
> lapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
$Hamburg
[1] 23
$Bremen
[1] 21
> sapply(study$results, function(y) max(y[, "maxTemp"], na.rm = TRUE))
Hamburg Bremen
23 21
which is easier to use and then you don't need the $region
component if you wish.
How to get this dcast'able long table in R?
The OP has edited his question and is supplying the data as a data.frame:
dat.df <- structure(list(ave_max = c(15L, 6L), ave = c(6L, NA), lepo = c(4L, NA)),
.Names = c("ave_max", "ave", "lepo"), class = "data.frame",
row.names = c(NA, -2L))
dat.df
# ave_max ave lepo
#1 15 6 4
#2 6 NA NA
class(dat.df)
#[1] "data.frame"
He is now asking to transform this data.frame into a matrix which is similar to the one used as input data in this answer.
This can be achieved by using data.table
:
library(data.table) # CRAN version 1.10.4 used
# transpose the input data frame, use rowid() to create columns,
# remove a character column to ensure matrix will be of type integer,
# finally, coerce to matrix
dat.m2 <- as.matrix(
data.table::dcast(
data.table::melt(setDT(dat.df), measure.vars = names(dat.df)),
variable ~ rowid(variable)
)[, variable := NULL]
)
# add row names, remove column names
dimnames(dat.m2) <- list(names(dat.df), NULL)
dat.m2
# [,1] [,2]
#ave_max 15 6
#ave 6 NA
#lepo 4 NA
str(dat.m2)
# int [1:3, 1:2] 15 6 4 6 NA NA
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "ave_max" "ave" "lepo"
# ..$ : NULL
class(dat.m2)
#[1] "matrix"
Edit: I've amended above code to use the double colon operator to explicitely state the namespace from which melt()
and dcast()
should be taken. Normally, this wouldn't be necessary as data.table
is already loaded. However, the OP is reporting issues which might be caused by package reshape2
being loaded after data.table
. The data.table
package has it's own faster implementations of reshape2::dcast()
and reshape2::melt()
. When both packages have been loaded for some reason name clashes might occur.
what data structure does model formula operator in R create?
Don't know if this helps, but: it's a language
object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"
> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~
, +
, (
) , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.
f[[1]]
is~
f[[2]]
isa
(the first argument, i.e. the LHS of the formula)f[[3]]
isb + (c+d)
f[[3]][[1]]
is+
f[[3]][[2]]
isb
... and so on.
The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.
This is also discussed (more opaquely) in the R Language Manual, e.g.
- Expression objects
- Direct manipulation of language objects
@user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created
Related Topics
How to Add a Legend for the Secondary Axis Ggplot
Filtering a Dataframe Showing Only Duplicates
Select N Rows Above and Below Match
Chi Square Test for Each Row in Data Frame
Changing Line Color in Ggplot Based on Slope
In R Data.Frame, Promote Rownames to Actual Column
Install R Packages in Azure Ml
Uri Routing for Shinydashboard Using Shiny.Router
Combining Grid.Table and Base Package Plots in R Figure
How to Use User Input to Obtain a Data.Frame from My Environment in Shiny
Scale Value Inside of Aes_String()
Follow-Up: Generalizing a Data.Frame Subsetting Function 2