Creating a for Loop to Subset Data on R

Advice on a loop function to subset data according to variables

Based on your description, I assume your data looks something like this:

country_year <- c("Australia_2013", "Australia_2014", "Bangladesh_2013")
health <- matrix(nrow = 3, ncol = 3, data = runif(9))
dataset <- data.frame(rbind(country_year, health), row.names = NULL, stringsAsFactors = FALSE)

dataset
# X1 X2 X3
#1 Australia_2013 Australia_2014 Bangladesh_2013
#2 0.665947273839265 0.677187719382346 0.716064820764586
#3 0.499680359382182 0.514755881391466 0.178317369660363
#4 0.730102791683748 0.666969108628109 0.0719663293566555

First, move your row 1 (e.g., Australia_2013, Australia_2014 etc.) to the column names, and then apply the loop to create country-based data frames.

library(dplyr)

# move header
dataset2 <- dataset %>%
`colnames<-`(dataset[1,]) %>% # uses row 1 as column names
slice(-1) %>% # removes row 1 from data
mutate_all(type.convert) # converts data to appropriate type

# apply loop
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country, select(dataset2, starts_with(country))) # makes subsets
}

Regarding the loop,

gsub("_\\d+", "", colnames(dataset2)) extracts the country names by replacing "_[year]" with nothing (i.e., removing it), and the unique() function that is applied extracts one of each country name.

assign(country, select(dataset2, starts_with(country))) creates a variable named after the country and this country variable only contains the columns from dataset2 that start with the country name.

Edit: Responding to Comment

The question in the comment was asking how to add row-wise summaries (e.g., rowSums(), rowMeans()) as new columns in the country-based data frames, while using this for-loop.

Here is one solution that requires minimal changes:

for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country,
select(dataset2, starts_with(country)) %>% # makes subsets
mutate( # creates new columns
rowSums = rowSums(select(., starts_with(country))),
rowMeans = rowMeans(select(., starts_with(country)))
)
)
}

mutate() adds new columns to a dataset.

select(., starts_with(country)) selects columns that start with the country name from the current object (represented as . in the function).

How to write a loop in R to create multiple different subsets of data based on column names?

Base function combn is ideal for this. You can get all combinations 2 by 2 of the remaining column names and call a function on each of those combinations.

First, some data.

set.seed(1234)
df1 <- matrix(rnorm(5*(4+5)), nrow = 5)
df1 <- as.data.frame(df1)

Now the code. Note that I will just keep the first 4 columns common, not 9. And you should change the default value of function fun argument DF = df1 to DF = yourdata.

first_cols <- 1:4

fun <- function(nms, DF = df1, fc = first_cols){
cols <- c(names(DF)[fc], nms)
outfile <- paste(nms, collapse = 'x')
outfile <- paste(outfile, 'txt', sep = '.')
write.table(DF[cols], outfile,
row.names = FALSE, col.names = FALSE,
quote = FALSE, sep = ' ')
cols
}
combn(names(df1)[-first_cols], 2, fun)

How to create a loop which creates multiple subset dataframes from a larger data frame?

Your code works fine. Just remove list so you create a vector of color names and not a list. If you only want distinct values, use unique.

mydata <- data.frame(x = c(1,2,3), y = c('a','b','c'), z = c('red','red','yellow'))

colors <- unique(mydata$z)

for (i in 1:length(colors)) {
assign(paste0("mydata_",i), subset(mydata, z == colors[[i]]))
}

R: loop through data frame extracting subset of data depending on date

is this what you want ?
df_list <- split(data, as.factor(data$date))

R: Subset data using for-loop

No!

That's not the way it works in R. ;) You want to use vectorized code because it's much more concise and faster (in R). Here are two solutions:

df = subset(CBS, `Wijken en buurten` %in% c("Oud-Overdie", "Overdie-West", "Overdie-Oost", "Oosterhout", "De Hoef III en IV"))

df = CBS[CBS$`Wijken en buurten` %in% c("Oud-Overdie", "Overdie-West", "Overdie-Oost", "Oosterhout", "De Hoef III en IV"),]

Subsetting a data set inside for loop

It is generally not advisable to use assign in R. Yes the function is available, but its use is not recommended. I believe the results you are looking could be generated in a much simpler manner.

The lapply command performs the same function as the for loops above.

#out<- #your dataframe of data

#define an array of string valuse
iter<-c("COD1", "COD2", "COD3")
#create a list of dataframes of the subsets
ans<-lapply(iter, function(x) {subset(out, TestId==x)})
#rename the list elements
names(ans)<-iter

#to access each subset any of the listed methods:
ans[[1]]
ans["COD1"]
ans$COD1
ans[iter[1]]

Subset data frame within a for loop

Don't use assign use a list instead!

# for loop approach
results = list()
for(nm in names(data)[-1]) { # omit the first column
results[[nm]] = data[data[[nm]] %in% "Y", "Column I want", drop = FALSE]
}

# lapply approach
results = lapply(data[-1], function(col) data[col %in% "Y", "Column I want", drop = FALSE])

The drop = FALSE arguments makes sure you get 1-column data frames, not vectors, as the result.

As for the issue in your approach, names[i] is just a string, so you're testing if, say, "var2" == "Y", which is false.



Related Topics



Leave a reply



Submit