How to Loop Through List and Create Separate Dataframes in R

how to loop through list and create separate dataframes in R

Your existing code creates an object called migr, and assigns it a string with the name of the data.frame you want to create. Then you overwrite the the migr object with the data.frame that you pull from Census. Each iteration of the loop, you overwrite migr, which is why only the data from the last iteration of the loop is saved, and then only as a data.frame named migr.

Instead, you need to use the assign command to assign the data you pull from Census to the value stored in migr, as follows:

library(censusapi)

states <- c("01","02")
for(i in 1:length(states)) {
region = str_glue("state:{states[i]}")
migr = str_glue("migr2010_{states[i]}")
assign(
x = migr,
value = getCensus(name = "acs/flows", vintage = 2010,
key = "*myAPIkey*",
vars = c("MOVEDNET", "MOVEDIN", "MOVEDOUT", "AGE"),
region = "county:*", regionin = region)
)
}

Edit

As others have mentioned, it may be easier to work with a list of data.frames, rather than creating several in the global environment. The easiest way to create that is using lapply, as follows:

 migr2010 <- lapply(
paste0("state:", c("01", "02")), # replaces region in the original
getCensus,
name = "acs/flows",
vintage = 2010,
key = "*myAPIkey*",
vars = c("MOVEDNET", "MOVEDIN", "MOVEDOUT", "AGE"),
region = "county:*"
)

Then, if you want to create a single data.frame out of those, you could use dplyr::bind_rows(migr2010), data.table::rbindlist(migr2010), or do.call(rbind, migr2010) (although do.call is much slower than the other two).

Loop through a list of dataframes to create dataframes in R

You should give your demo data frame definitely an "ID" column as well! Then you do not have to hope that the demographics are correctly assigned to the observations, especially if the script is still changing during the work process. That may easily be done using transform (I simply use the consecutive ID's 1:3 here in the example).

res <- lapply(list(df1, df2, df3, df4), merge, transform(demo, ID=1:3))
res
# [[1]]
# ID b c df sex age vital_sts
# 1 1 x gh z m 30 a
# 2 2 y fg x m 50 a
# 3 3 z xv y f 62 d
#
# [[2]]
# ID v hg fd sex age vital_sts
# 1 1 a yty z m 30 a
# 2 2 mm zc x m 50 a
# 3 3 xc cx y f 62 d
#
# [[3]]
# ID t j sd sex age vital_sts
# 1 1 ae ewr z m 30 a
# 2 2 yw zd x m 50 a
# 3 3 zs x y f 62 d
#
# [[4]]
# ID u k f sex age vital_sts
# 1 1 df df z m 30 a
# 2 2 y zs x m 50 a
# 3 3 z xf y f 62 d

If you have gazillions of data frames in your workspace, as it looks like, you may list by pattern using mget(ls(pattern=)). (Or better yet, change your code to get them in a list in the first place.)

lapply(mget(ls(pat='^df\\d+')), merge, transform(demo, ID=1:3))

Edit

If I understand you correctly, according to your comment you have a large data frame DAT from which you want to assemble smaller data frames of variable groups and merge the demo to them. In this case I would put the variable names of these groups in a named list vgroups. Next, lapply over it to simultaneously subset dat with "ID" concatenated and merge it to demo.

demo still should have an "ID", because you don't want to trust, all rows are sorted in the same order, just consider for example sort(c(3, 10, 1, 100)) vs. sort(as.character(c(3, 10, 1, 100))) or omitted rows for whatever reason etc.

demo <- transform(demo, ID=1:3)  ## identify demo observations

vgroups <- list(g1=c("b", "c", "df"), g2=c("v", "hg", "fd"), g3=c("t", "j", "sd"),
g4=c("u", "k", "f"))

res1 <- lapply(vgroups, \(x) merge(demo, DAT[, c('ID', x)], by="ID"))
## saying by ID is even more save --^
res1
# $g1
# ID sex age vital_sts b c df
# 1 1 m 30 a x gh z
# 2 2 m 50 a y fg x
# 3 3 f 62 d z xv y
#
# $g2
# ID sex age vital_sts v hg fd
# 1 1 m 30 a a yty z
# 2 2 m 50 a mm zc x
# 3 3 f 62 d xc cx y
#
# $g3
# ID sex age vital_sts t j sd
# 1 1 m 30 a ae ewr z
# 2 2 m 50 a yw zd x
# 3 3 f 62 d zs x y
#
# $g4
# ID sex age vital_sts u k f
# 1 1 m 30 a df df z
# 2 2 m 50 a y zs x
# 3 3 f 62 d z xf y

Access individual data frames:

res1$g1
# ID sex age vital_sts b c df
# 1 1 m 30 a x gh z
# 2 2 m 50 a y fg x
# 3 3 f 62 d z xv y

If you still want the individual data frames in your environment, use list2env:

list2env(res1)
ls()
# [1] "DAT" "demo" "res1" "vgroups"

Data:

DAT <- structure(list(ID = 1:3, b = c("x", "y", "z"), c = c("gh", "fg", 
"xv"), df = c("z", "x", "y"), f = c("z", "x", "y"), fd = c("z",
"x", "y"), hg = c("yty", "zc", "cx"), j = c("ewr", "zd", "x"),
k = c("df", "zs", "xf"), sd = c("z", "x", "y"), t = c("ae",
"yw", "zs"), u = c("df", "y", "z"), v = c("a", "mm", "xc"
), x1 = c("gs", "gs", "gs"), x2 = c("cs", "cs", "cs"), x3 = c("tv",
"tv", "tv"), x4 = c("fb", "fb", "fb")), row.names = c(NA,
-3L), class = "data.frame")

demo <- data.frame(sex = c('m', 'm', 'f'), age = c('30', '50', '62'), vital_sts = c('a', 'a', 'd'))

Using a loop to create multiple data frames in R

You can save your data.frames into a list by setting up the function as follows:

getstats<- function(games){

listofdfs <- list() #Create a list in which you intend to save your df's.

for(i in 1:length(games)){ #Loop through the numbers of ID's instead of the ID's

#You are going to use games[i] instead of i to get the ID
url<- paste("http://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&
EndRange=14400&GameID=",games[i],"&RangeType=2&Season=2015-16&SeasonType=
Regular+Season&StartPeriod=1&StartRange=0000",sep = "")
json_data<- fromJSON(paste(readLines(url), collapse=""))
df<- data.frame(json_data$resultSets[1, "rowSet"])
names(df)<-unlist(json_data$resultSets[1,"headers"])
listofdfs[[i]] <- df # save your dataframes into the list
}

return(listofdfs) #Return the list of dataframes.
}

gameids<- as.character(c(0021500580:0021500593))
getstats(games = gameids)

Please note that I could not test this because the URLs do not seem to be working properly. I get the connection error below:

Error in file(con, "r") : cannot open the connection

How to loop through a list and create a data frame

This worked for me with a bunch of CSV files with mock data.

Team <- list.files("c:\\Test\\Teams\\", full.names=TRUE)

Team_Split <- data.frame()
print(Team)

for (Team_File in Team) {
xl <-
read.csv(Team_File) #Reads the csv from the first file path
y <- ncol(x1) #creates object with number of columns
#If statement to standarise number of columns so can bind
if (y == "37") {
x1 <- Add_5_Col(x1)
} else if (y == "38") {
x1 <- Add_4_Col(x1)
} else if (y == "39") {
x1 <- Add_3_Col(x1)
}
# Sets Team_Split to xl if it's the first set of data
# or binds Team_Split and xl
print(xl)

if (nrow(Team_Split) == 0) {
Team_Split <- xl
} else {
Team_Split <- rbind(Team_Split, xl)
}
}

print(Team_Split)

Looping through list of data frames in R

> df1 <- data.frame("Row One"=x, "Row Two"=y)
> df2 <- data.frame("Row Two"=y,"Row One"=x)
> dfList <- list(df1,df2)
> lapply(dfList, function(x) {
names(x)[ grep("One", names(x))] <- "R1"
names(x)[ grep("Two", names(x))] <- "R2"
x} )
[[1]]
R1 R2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

[[2]]
R2 R1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

How do I create multiple dataframes from a result in a for loop in R?

Don't do it in a loop !! It is done completely different. I'll show you step by step.
My first step is to prepare a function that will generate data similar to yours.

library(tidyverse)

dens = function(year, n) tibble(
PLOT = paste("HI", sample(1:(n/7), n, replace = T)),
SIZE = runif(n, 0.1, 3),
DENSITY = sample(seq(50,200, by=50), n, replace = T),
SEEDYR = year-1,
SAMPYR = year,
AGE = sample(1:5, n, replace = T),
SHOOTS = runif(n, 0.1, 3)
)

Let's see how it works and generate some sample data frames

set.seed(123)
density.2007 = dens(2007, 120)
density.2008 = dens(2008, 88)
density.2009 = dens(2009, 135)
density.2010 = dens(2010, 156)

The density.2007 data frame looks like this

# A tibble: 120 x 7
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 HI 15 1.67 200 2006 2007 4 1.80
2 HI 14 0.270 150 2006 2007 2 2.44
3 HI 3 0.856 50 2006 2007 3 0.686
4 HI 10 1.25 200 2006 2007 5 1.43
5 HI 11 0.673 50 2006 2007 5 1.40
6 HI 5 2.51 150 2006 2007 3 2.23
7 HI 14 0.543 150 2006 2007 2 2.17
8 HI 5 2.43 200 2006 2007 5 2.51
9 HI 9 1.69 100 2006 2007 4 2.67
10 HI 3 2.02 50 2006 2007 2 2.86
# ... with 110 more rows

Now they need to be combined into one frame

df = density.2007 %>% 
bind_rows(density.2008) %>%
bind_rows(density.2009) %>%
bind_rows(density.2010)

output

# A tibble: 499 x 7
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 HI 15 1.67 200 2006 2007 4 1.80
2 HI 14 0.270 150 2006 2007 2 2.44
3 HI 3 0.856 50 2006 2007 3 0.686
4 HI 10 1.25 200 2006 2007 5 1.43
5 HI 11 0.673 50 2006 2007 5 1.40
6 HI 5 2.51 150 2006 2007 3 2.23
7 HI 14 0.543 150 2006 2007 2 2.17
8 HI 5 2.43 200 2006 2007 5 2.51
9 HI 9 1.69 100 2006 2007 4 2.67
10 HI 3 2.02 50 2006 2007 2 2.86
# ... with 489 more rows

In the next step, count how many times each value of the PLOT variable occurs

PLOT.count = df %>% 
group_by(PLOT) %>%
summarise(PLOT.n = n()) %>%
arrange(PLOT.n)

ouptut

# A tibble: 22 x 2
PLOT PLOT.n
<chr> <int>
1 HI 20 3
2 HI 22 5
3 HI 21 7
4 HI 18 12
5 HI 2 19
6 HI 1 20
7 HI 15 20
8 HI 17 21
9 HI 6 22
10 HI 11 23
# ... with 12 more rows

In the penultimate step, let's append these counters to the original data frame

df = df %>% left_join(PLOT.count, by="PLOT")

output

# A tibble: 499 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 15 1.67 200 2006 2007 4 1.80 20
2 HI 14 0.270 150 2006 2007 2 2.44 32
3 HI 3 0.856 50 2006 2007 3 0.686 27
4 HI 10 1.25 200 2006 2007 5 1.43 25
5 HI 11 0.673 50 2006 2007 5 1.40 23
6 HI 5 2.51 150 2006 2007 3 2.23 38
7 HI 14 0.543 150 2006 2007 2 2.17 32
8 HI 5 2.43 200 2006 2007 5 2.51 38
9 HI 9 1.69 100 2006 2007 4 2.67 26
10 HI 3 2.02 50 2006 2007 2 2.86 27
# ... with 489 more rows

Now filter it at will

df %>% filter(PLOT.n > 30)

ouptut

# A tibble: 139 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 14 0.270 150 2006 2007 2 2.44 32
2 HI 5 2.51 150 2006 2007 3 2.23 38
3 HI 14 0.543 150 2006 2007 2 2.17 32
4 HI 5 2.43 200 2006 2007 5 2.51 38
5 HI 8 0.598 50 2006 2007 1 1.70 34
6 HI 7 1.94 50 2006 2007 4 1.61 35
7 HI 14 2.91 50 2006 2007 4 0.215 32
8 HI 7 0.846 150 2006 2007 4 0.506 35
9 HI 7 2.38 150 2006 2007 3 1.34 35
10 HI 7 2.62 100 2006 2007 3 0.167 35
# ... with 129 more rows

Or this way

df %>% filter(PLOT.n == min(PLOT.n))
df %>% filter(PLOT.n == median(PLOT.n))
df %>% filter(PLOT.n == max(PLOT.n))

output

# A tibble: 3 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 20 0.392 200 2009 2010 1 0.512 3
2 HI 20 0.859 150 2009 2010 5 2.62 3
3 HI 20 0.882 200 2009 2010 5 1.06 3
> df %>% filter(PLOT.n == median(PLOT.n))
# A tibble: 26 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 9 1.69 100 2006 2007 4 2.67 26
2 HI 9 2.20 50 2006 2007 4 1.49 26
3 HI 9 0.587 200 2006 2007 3 1.13 26
4 HI 9 1.27 50 2006 2007 1 2.55 26
5 HI 9 1.56 150 2006 2007 3 2.01 26
6 HI 9 0.198 100 2006 2007 3 2.08 26
7 HI 9 2.72 150 2007 2008 3 0.421 26
8 HI 9 0.251 200 2007 2008 2 0.328 26
9 HI 9 1.83 50 2007 2008 1 0.192 26
10 HI 9 1.97 100 2007 2008 1 0.900 26
# ... with 16 more rows
> df %>% filter(PLOT.n == max(PLOT.n))
# A tibble: 38 x 8
PLOT SIZE DENSITY SEEDYR SAMPYR AGE SHOOTS PLOT.n
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
1 HI 5 2.51 150 2006 2007 3 2.23 38
2 HI 5 2.43 200 2006 2007 5 2.51 38
3 HI 5 2.06 100 2006 2007 5 1.93 38
4 HI 5 1.25 150 2006 2007 4 2.29 38
5 HI 5 2.29 200 2006 2007 1 2.97 38
6 HI 5 0.789 150 2006 2007 2 1.59 38
7 HI 5 1.11 100 2007 2008 4 2.61 38
8 HI 5 2.38 150 2007 2008 4 2.95 38
9 HI 5 2.67 200 2007 2008 3 1.77 38
10 HI 5 2.63 100 2007 2008 1 1.90 38
# ... with 28 more rows

R for loop: creating data frames using split?

It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.

Using by this would look like as :

by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result

result

Usage of 'for loop' in R to split a dataframe into several dataframes

An easy way to do this is to create a factor vector by appending the string sys to the id numbers, and using it to split the data. There is no need to use a for() loop to produce the desired output, since the result of split() is a list of data frames when the input to be split is a data frame.

The value of the factor is used to name each element in the list generated by split(). In the case of the OP, since sysid is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split().

Using the data from the OP we'll illustrate how to use the sysid column to create a factor variable that combines the string sys with the id values, and split it into a list of data frames that can be accessed by name.

rawData <- "Date      sysid   power   temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"

data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)

splitData <- split(data,sysidName)

splitData

...and the output:

> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18

$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11

>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)
[1] "sys1" "sys2"
>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>

If you must use a for() loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList

...and the output:

> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18

$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

split() performance

library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>

for() performance

> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>

by() performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.



Related Topics



Leave a reply



Submit