how to loop through list and create separate dataframes in R
Your existing code creates an object called migr
, and assigns it a string with the name of the data.frame you want to create. Then you overwrite the the migr
object with the data.frame that you pull from Census. Each iteration of the loop, you overwrite migr
, which is why only the data from the last iteration of the loop is saved, and then only as a data.frame named migr
.
Instead, you need to use the assign
command to assign the data you pull from Census to the value stored in migr
, as follows:
library(censusapi)
states <- c("01","02")
for(i in 1:length(states)) {
region = str_glue("state:{states[i]}")
migr = str_glue("migr2010_{states[i]}")
assign(
x = migr,
value = getCensus(name = "acs/flows", vintage = 2010,
key = "*myAPIkey*",
vars = c("MOVEDNET", "MOVEDIN", "MOVEDOUT", "AGE"),
region = "county:*", regionin = region)
)
}
Edit
As others have mentioned, it may be easier to work with a list of data.frames, rather than creating several in the global environment. The easiest way to create that is using lapply
, as follows:
migr2010 <- lapply(
paste0("state:", c("01", "02")), # replaces region in the original
getCensus,
name = "acs/flows",
vintage = 2010,
key = "*myAPIkey*",
vars = c("MOVEDNET", "MOVEDIN", "MOVEDOUT", "AGE"),
region = "county:*"
)
Then, if you want to create a single data.frame out of those, you could use dplyr::bind_rows(migr2010)
, data.table::rbindlist(migr2010)
, or do.call(rbind, migr2010)
(although do.call
is much slower than the other two).
Loop through a list of dataframes to create dataframes in R
You should give your demo
data frame definitely an "ID"
column as well! Then you do not have to hope that the demographics are correctly assigned to the observations, especially if the script is still changing during the work process. That may easily be done using transform
(I simply use the consecutive ID's 1:3
here in the example).
res <- lapply(list(df1, df2, df3, df4), merge, transform(demo, ID=1:3))
res
# [[1]]
# ID b c df sex age vital_sts
# 1 1 x gh z m 30 a
# 2 2 y fg x m 50 a
# 3 3 z xv y f 62 d
#
# [[2]]
# ID v hg fd sex age vital_sts
# 1 1 a yty z m 30 a
# 2 2 mm zc x m 50 a
# 3 3 xc cx y f 62 d
#
# [[3]]
# ID t j sd sex age vital_sts
# 1 1 ae ewr z m 30 a
# 2 2 yw zd x m 50 a
# 3 3 zs x y f 62 d
#
# [[4]]
# ID u k f sex age vital_sts
# 1 1 df df z m 30 a
# 2 2 y zs x m 50 a
# 3 3 z xf y f 62 d
If you have gazillions of data frames in your workspace, as it looks like, you may list by pattern using mget(ls(pattern=))
. (Or better yet, change your code to get them in a list in the first place.)
lapply(mget(ls(pat='^df\\d+')), merge, transform(demo, ID=1:3))
Edit
If I understand you correctly, according to your comment you have a large data frame DAT
from which you want to assemble smaller data frames of variable groups and merge the demo
to them. In this case I would put the variable names of these groups in a named list vgroups
. Next, lapply
over it to simultaneously subset dat
with "ID"
c
oncatenated and merge
it to demo
.
demo
still should have an "ID"
, because you don't want to trust, all rows are sorted in the same order, just consider for example sort(c(3, 10, 1, 100))
vs. sort(as.character(c(3, 10, 1, 100)))
or omitted rows for whatever reason etc.
demo <- transform(demo, ID=1:3) ## identify demo observations
vgroups <- list(g1=c("b", "c", "df"), g2=c("v", "hg", "fd"), g3=c("t", "j", "sd"),
g4=c("u", "k", "f"))
res1 <- lapply(vgroups, \(x) merge(demo, DAT[, c('ID', x)], by="ID"))
## saying by ID is even more save --^
res1
# $g1
# ID sex age vital_sts b c df
# 1 1 m 30 a x gh z
# 2 2 m 50 a y fg x
# 3 3 f 62 d z xv y
#
# $g2
# ID sex age vital_sts v hg fd
# 1 1 m 30 a a yty z
# 2 2 m 50 a mm zc x
# 3 3 f 62 d xc cx y
#
# $g3
# ID sex age vital_sts t j sd
# 1 1 m 30 a ae ewr z
# 2 2 m 50 a yw zd x
# 3 3 f 62 d zs x y
#
# $g4
# ID sex age vital_sts u k f
# 1 1 m 30 a df df z
# 2 2 m 50 a y zs x
# 3 3 f 62 d z xf y
Access individual data frames:
res1$g1
# ID sex age vital_sts b c df
# 1 1 m 30 a x gh z
# 2 2 m 50 a y fg x
# 3 3 f 62 d z xv y
If you still want the individual data frames in your environment, use list2env
:
list2env(res1)
ls()
# [1] "DAT" "demo" "res1" "vgroups"
Data:
DAT <- structure(list(ID = 1:3, b = c("x", "y", "z"), c = c("gh", "fg",
"xv"), df = c("z", "x", "y"), f = c("z", "x", "y"), fd = c("z",
"x", "y"), hg = c("yty", "zc", "cx"), j = c("ewr", "zd", "x"),
k = c("df", "zs", "xf"), sd = c("z", "x", "y"), t = c("ae",
"yw", "zs"), u = c("df", "y", "z"), v = c("a", "mm", "xc"
), x1 = c("gs", "gs", "gs"), x2 = c("cs", "cs", "cs"), x3 = c("tv",
"tv", "tv"), x4 = c("fb", "fb", "fb")), row.names = c(NA,
-3L), class = "data.frame")
demo <- data.frame(sex = c('m', 'm', 'f'), age = c('30', '50', '62'), vital_sts = c('a', 'a', 'd'))
Using a loop to create multiple data frames in R
You can save your data.frames into a list by setting up the function as follows:
getstats<- function(games){
listofdfs <- list() #Create a list in which you intend to save your df's.
for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's
#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
listofdfs[[i]] <- df # save your dataframes into the list
}
return(listofdfs) #Return the list of dataframes.
}
gameids<- as.character(c(0021500580:0021500593))
getstats(games = gameids)
Please note that I could not test this because the URLs do not seem to be working properly. I get the connection error below:
Error in file(con, "r") : cannot open the connection
How to loop through a list and create a data frame
This worked for me with a bunch of CSV files with mock data.
Team <- list.files("c:\\Test\\Teams\\", full.names=TRUE)
Team_Split <- data.frame()
print(Team)
for (Team_File in Team) {
xl <-
read.csv(Team_File) #Reads the csv from the first file path
y <- ncol(x1) #creates object with number of columns
#If statement to standarise number of columns so can bind
if (y == "37") {
x1 <- Add_5_Col(x1)
} else if (y == "38") {
x1 <- Add_4_Col(x1)
} else if (y == "39") {
x1 <- Add_3_Col(x1)
}
# Sets Team_Split to xl if it's the first set of data
# or binds Team_Split and xl
print(xl)
if (nrow(Team_Split) == 0) {
Team_Split <- xl
} else {
Team_Split <- rbind(Team_Split, xl)
}
}
print(Team_Split)
Looping through list of data frames in R
> df1 <- data.frame("Row One"=x, "Row Two"=y)
> df2 <- data.frame("Row Two"=y,"Row One"=x)
> dfList <- list(df1,df2)
> lapply(dfList, function(x) {
names(x)[ grep("One", names(x))] <- "R1"
names(x)[ grep("Two", names(x))] <- "R2"
x} )
[[1]]
R1 R2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
[[2]]
R2 R1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
How do I create multiple dataframes from a result in a for loop in R?
Don't do it in a loop !! It is done completely different. I'll show you step by step.
My first step is to prepare a function that will generate data similar to yours.
library(tidyverse)
dens = function(year, n) tibble(
PLOT = paste("HI", sample(1:(n/7), n, replace = T)),
SIZE = runif(n, 0.1, 3),
DENSITY = sample(seq(50,200, by=50), n, replace = T),
SEEDYR = year-1,
SAMPYR = year,
AGE = sample(1:5, n, replace = T),
SHOOTS = runif(n, 0.1, 3)
)
Let's see how it works and generate some sample data frames
set.seed(123)
density.2007 = dens(2007, 120)
density.2008 = dens(2008, 88)
density.2009 = dens(2009, 135)
density.2010 = dens(2010, 156)
The density.2007
data frame looks like this
# A tibble: 120 x 7
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 HI 15 1.67 200 2006 2007 4 1.80
2 HI 14 0.270 150 2006 2007 2 2.44
3 HI 3 0.856 50 2006 2007 3 0.686
4 HI 10 1.25 200 2006 2007 5 1.43
5 HI 11 0.673 50 2006 2007 5 1.40
6 HI 5 2.51 150 2006 2007 3 2.23
7 HI 14 0.543 150 2006 2007 2 2.17
8 HI 5 2.43 200 2006 2007 5 2.51
9 HI 9 1.69 100 2006 2007 4 2.67
10 HI 3 2.02 50 2006 2007 2 2.86
# ... with 110 more rows
Now they need to be combined into one frame
df = density.2007 %>%
bind_rows(density.2008) %>%
bind_rows(density.2009) %>%
bind_rows(density.2010)
output
# A tibble: 499 x 7
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 HI 15 1.67 200 2006 2007 4 1.80
2 HI 14 0.270 150 2006 2007 2 2.44
3 HI 3 0.856 50 2006 2007 3 0.686
4 HI 10 1.25 200 2006 2007 5 1.43
5 HI 11 0.673 50 2006 2007 5 1.40
6 HI 5 2.51 150 2006 2007 3 2.23
7 HI 14 0.543 150 2006 2007 2 2.17
8 HI 5 2.43 200 2006 2007 5 2.51
9 HI 9 1.69 100 2006 2007 4 2.67
10 HI 3 2.02 50 2006 2007 2 2.86
# ... with 489 more rows
In the next step, count how many times each value of the PLOT
variable occurs
PLOT.count = df %>%
group_by(PLOT) %>%
summarise(PLOT.n = n()) %>%
arrange(PLOT.n)
ouptut
# A tibble: 22 x 2
PLOT PLOT.n
<chr> <int>
1 HI 20 3
2 HI 22 5
3 HI 21 7
4 HI 18 12
5 HI 2 19
6 HI 1 20
7 HI 15 20
8 HI 17 21
9 HI 6 22
10 HI 11 23
# ... with 12 more rows
In the penultimate step, let's append these counters to the original data frame
df = df %>% left_join(PLOT.count, by="PLOT")
output
# A tibble: 499 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 15 1.67 200 2006 2007 4 1.80 20
2 HI 14 0.270 150 2006 2007 2 2.44 32
3 HI 3 0.856 50 2006 2007 3 0.686 27
4 HI 10 1.25 200 2006 2007 5 1.43 25
5 HI 11 0.673 50 2006 2007 5 1.40 23
6 HI 5 2.51 150 2006 2007 3 2.23 38
7 HI 14 0.543 150 2006 2007 2 2.17 32
8 HI 5 2.43 200 2006 2007 5 2.51 38
9 HI 9 1.69 100 2006 2007 4 2.67 26
10 HI 3 2.02 50 2006 2007 2 2.86 27
# ... with 489 more rows
Now filter it at will
df %>% filter(PLOT.n > 30)
ouptut
# A tibble: 139 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 14 0.270 150 2006 2007 2 2.44 32
2 HI 5 2.51 150 2006 2007 3 2.23 38
3 HI 14 0.543 150 2006 2007 2 2.17 32
4 HI 5 2.43 200 2006 2007 5 2.51 38
5 HI 8 0.598 50 2006 2007 1 1.70 34
6 HI 7 1.94 50 2006 2007 4 1.61 35
7 HI 14 2.91 50 2006 2007 4 0.215 32
8 HI 7 0.846 150 2006 2007 4 0.506 35
9 HI 7 2.38 150 2006 2007 3 1.34 35
10 HI 7 2.62 100 2006 2007 3 0.167 35
# ... with 129 more rows
Or this way
df %>% filter(PLOT.n == min(PLOT.n))
df %>% filter(PLOT.n == median(PLOT.n))
df %>% filter(PLOT.n == max(PLOT.n))
output
# A tibble: 3 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 20 0.392 200 2009 2010 1 0.512 3
2 HI 20 0.859 150 2009 2010 5 2.62 3
3 HI 20 0.882 200 2009 2010 5 1.06 3
> df %>% filter(PLOT.n == median(PLOT.n))
# A tibble: 26 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 9 1.69 100 2006 2007 4 2.67 26
2 HI 9 2.20 50 2006 2007 4 1.49 26
3 HI 9 0.587 200 2006 2007 3 1.13 26
4 HI 9 1.27 50 2006 2007 1 2.55 26
5 HI 9 1.56 150 2006 2007 3 2.01 26
6 HI 9 0.198 100 2006 2007 3 2.08 26
7 HI 9 2.72 150 2007 2008 3 0.421 26
8 HI 9 0.251 200 2007 2008 2 0.328 26
9 HI 9 1.83 50 2007 2008 1 0.192 26
10 HI 9 1.97 100 2007 2008 1 0.900 26
# ... with 16 more rows
> df %>% filter(PLOT.n == max(PLOT.n))
# A tibble: 38 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 5 2.51 150 2006 2007 3 2.23 38
2 HI 5 2.43 200 2006 2007 5 2.51 38
3 HI 5 2.06 100 2006 2007 5 1.93 38
4 HI 5 1.25 150 2006 2007 4 2.29 38
5 HI 5 2.29 200 2006 2007 1 2.97 38
6 HI 5 0.789 150 2006 2007 2 1.59 38
7 HI 5 1.11 100 2007 2008 4 2.61 38
8 HI 5 2.38 150 2007 2008 4 2.95 38
9 HI 5 2.67 200 2007 2008 3 1.77 38
10 HI 5 2.63 100 2007 2008 1 1.90 38
# ... with 28 more rows
R for loop: creating data frames using split?
It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split
and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.
Using by
this would look like as :
by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result
result
Usage of 'for loop' in R to split a dataframe into several dataframes
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
If you must use a for()
loop...
Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Choosing the "best" answer
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performance
library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance
> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance
> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
...and the winner is:
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
Related Topics
Rstudio Suddenly Stopped Showing Plots in the Plot Pane
How to Append a Sequential Number for Every Element in a Data Frame
Calculate the Area Under a Curve
Split Comma-Separated Strings in a Column into Separate Rows
Combine Two Data Frames by Rows (Rbind) When They Have Different Sets of Columns
Extract the Maximum Value Within Each Group in a Dataframe
Select Groups Based on Number of Unique/Distinct Values
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable
Convert Categorical Variables to Numeric in R
Change R Default Library Path Using .Libpaths in Rprofile.Site Fails to Work
Dynamically Select Data Frame Columns Using $ and a Character Value
Formatting Decimal Places in R
Apply a Function to Every Specified Column in a Data.Table and Update by Reference