How to remove certain columns in multiple data frames in R?
Suppose you have several data.frames dat1
, dat2
, dat3
, etc. Instead of working with individual datasets, place them in a list
and do the processing. After the removal of first column, if you still need the original data.frame
object to reflect the change (not advised as you can do all the analysis within the list itself), use list2env
.
lst <- mget(ls(pattern='^dat\\d+'))
list2env(lapply(lst,`[`,-1), envir=.GlobalEnv)
dat1
If you have different dataset names D1
, C1
, datC
, newDat
etc with no clear common patterns (not clear from the question), then you could still create a list manually (extreme cases)
lst1 <- list(D1=D1, C1=C1, datc=datC, newDat=newDat)
and do the list2env(lapply(...
Or read all the files (if all the files are in the working directory) directly into a list and process it.
files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,-1)
data
set.seed(24)
dat1 <- as.data.frame(matrix(sample(1:40, 5*3, replace=TRUE), ncol=5))
dat2 <- as.data.frame(matrix(sample(20, 3*5, replace=TRUE), ncol=3))
dat3 <- as.data.frame(matrix(sample(80, 2*10, replace=TRUE), ncol=2))
Removing column/s from multiple data-frames
First it's not very wise to use variable names with whitespace. Use points instead, like "Group.1"
.
Second, while c("Group 1","Group 2")
works, -c("Group 1","Group 2")
won't, because you're attempting to do arithmetics with characters: - "Group.1"
doesn't make much sense, does it?
So we can use %in%
which is kind of a matching operator together with which()
which gives the indices of the elements.
remover <- function(input) {
input <- input[, -which(names(input) %in% c("Group.1", "Group.2"))]
return(input)
}
x1 <- remover(df1)
> x1
X1 X2
1 4 10
2 5 11
3 6 12
There's another option to achieve the negation of the colnames instead of -
. We can use !
.
remover2 <- function(input) {
input <- input[, which(!names(input) %in% c("Group.1", "Group.2"))]
return(input)
}
There might be a slight speed advantage using !
.
Unit: microseconds
expr min lq mean median uq max neval cld
remover(df1) 26.789 28.065 30.27720 28.321 28.831 18419.152 1e+05 a
remover2(df1) 26.534 27.810 30.09483 28.065 28.576 6056.713 1e+05 a
> all.equal(x1, x2)
[1] TRUE
Data:
df1 <- data.frame(matrix(1:12,
3, 4,
dimnames=list(NULL, c("Group.1", "X1",
"Group.2", "X2"))))
Drop data frame columns by name
There's also the subset
command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by @hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
Removing specific columns from multiple data frames (.tab) and then merging them in R
First lets make a list of fake files
fakefile <- 'a\tb\tc\td
1\t2\t3\t4'
# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))
## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4
# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)
## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...
In principle, all solutions will have the same underlying work as a list
and then merge concept (although the merge might be different here and there).
Solution 1 - If you can rely on the order of column 1
If you can rely on the ordering of the columns, then you dont really need to
read columns 1 and 4 of each file, but just col 4 and bind them.
# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]
# Reading cols 4 in all files
# We first make a function that does our tasks (reading and removing cols)
reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}
# Then we use lapply to use that function on each elment of our list
cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)
## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4
# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)
# And finally add column 1 to it
data.frame(col1, cols4_mat)
## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4
Solution 2 - If you can not rely in the order
The implementation is easier but it is a lot slower in most situations
# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)
# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)
# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4
# you will need to modify the column names btw ...
Bonus - And the most concise solution ...
Depending on how big your data sets are, you might want to ignore the extra
columns from the start (instead of reading them and then removing them).
You can use fread from the data.table package to do that for you.
reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}
my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)
R remove common columns in dataframes
Using set operations viz. union
intersect
and setdiff
on names
of both dfs, we may do this
df1 <- read.table(header = T, text = 'a b c g
1 0 0 0
2 0 0 1')
df2 <- read.table(header = T, text = 'a b d e f
1 1 0 0 0
2 0 0 0 1')
# uncommon column names
x <- setdiff(union(names(df1), names(df2)), intersect(names(df1), names(df2)))
cbind(df1[names(df1) %in% x], df2[names(df2) %in% x])
#> c g d e f
#> 1 0 0 0 0 0
#> 2 0 1 0 0 1
Created on 2021-06-15 by the reprex package (v2.0.0)
Remove rows until columns are identical over multiple data frames
We assume that the question is how to get the values of Type
that are common to 4 data frames each of which has a Type
column containing unique values.
Form a list L
of the data frames, extract the Type
column using lapply
and [
and iterate merge
over that using Reduce
:
L <- list(w, x, y, z)
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
Reduce(merge, L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"
or replace last line with this giving the same result except for order:
Reduce(intersect, L.Type)$Type
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect
:
intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
## [1] "26809D" "28503C" "360254" "69298N" "32708V"
Another example
The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD
is a built-in data frame has 6 rows. We assign it to X
and rename the columns so that the first one has the name Type
. Then for i
equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type
common to all 4. The result correctly shows that 5 and 7 are the only common Type
values.
# set up input L, a list of 4 data frames
X <- BOD
names(X) <- c("Type", "X")
L <- lapply(1:4, function(i) X[-i, ])
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
Reduce(merge, L.Type)$Type
## [1] 5 7
R: for-loop solution to deleting columns from multiple data frames
If you have a data frame, you index rows and columns with
data.frame[row, column]
So, data.frame[2,7])
will give you the value of the 2nd row in the 7th column. I guess you were looking for
temp <- temp[, c(2,7)]
or, if temp
is a list of data frames
temp <- lapply(temp, function(x) x[, c(2,7)])
So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...)
. If I understand your example right, you don't need any loop-command, if you use lapply
.
Related Topics
Convert to Local Time Zone Using Latitude and Longitude
How to Add Columnn Titles in a Sankey Chart Networkd3
Applying Function (Ks.Test) Between Two Data Frames Column-Wise in R
Fill Missing Values Rowwise (Right/Left)
Shiny Sliderinput from Max to Min
Trouble with Strings with <U+0092> Unicode Characters
Ggplot2 Violin Plot: Fill Central 95% Only
R Markdown Add Tag to Head of HTML Output
Extract Only Folder Name Right Before Filename from Full Path
Multiplication of Large Integers
R: Check If Value from Dataframe Is Within Range Other Dataframe
R:Binary Matrix for All Possible Unique Results
Barplot with Multiple Columns in R
How to Use Different Font Sizes in Ggplot Facet Wrap Labels
How to Get The Intersection Point of Two Vector
How to Set Contrasts for My Variable in Regression Analysis with R