How to Remove Certain Columns in Multiple Data Frames in R

How to remove certain columns in multiple data frames in R?

Suppose you have several data.frames dat1, dat2, dat3, etc. Instead of working with individual datasets, place them in a list and do the processing. After the removal of first column, if you still need the original data.frame object to reflect the change (not advised as you can do all the analysis within the list itself), use list2env.

 lst <- mget(ls(pattern='^dat\\d+'))
list2env(lapply(lst,`[`,-1), envir=.GlobalEnv)

dat1

If you have different dataset names D1, C1, datC, newDat etc with no clear common patterns (not clear from the question), then you could still create a list manually (extreme cases)

 lst1 <- list(D1=D1, C1=C1, datc=datC, newDat=newDat)

and do the list2env(lapply(...

Or read all the files (if all the files are in the working directory) directly into a list and process it.

 files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,-1)

data

set.seed(24)
dat1 <- as.data.frame(matrix(sample(1:40, 5*3, replace=TRUE), ncol=5))
dat2 <- as.data.frame(matrix(sample(20, 3*5, replace=TRUE), ncol=3))
dat3 <- as.data.frame(matrix(sample(80, 2*10, replace=TRUE), ncol=2))

Removing column/s from multiple data-frames

First it's not very wise to use variable names with whitespace. Use points instead, like "Group.1".

Second, while c("Group 1","Group 2") works, -c("Group 1","Group 2") won't, because you're attempting to do arithmetics with characters: - "Group.1" doesn't make much sense, does it?

So we can use %in% which is kind of a matching operator together with which() which gives the indices of the elements.

remover <- function(input) {
input <- input[, -which(names(input) %in% c("Group.1", "Group.2"))]
return(input)
}

x1 <- remover(df1)
> x1
X1 X2
1 4 10
2 5 11
3 6 12

There's another option to achieve the negation of the colnames instead of -. We can use !.

remover2 <- function(input) {
input <- input[, which(!names(input) %in% c("Group.1", "Group.2"))]
return(input)
}

There might be a slight speed advantage using !.

Unit: microseconds
expr min lq mean median uq max neval cld
remover(df1) 26.789 28.065 30.27720 28.321 28.831 18419.152 1e+05 a
remover2(df1) 26.534 27.810 30.09483 28.065 28.576 6056.713 1e+05 a

> all.equal(x1, x2)
[1] TRUE

Data:

df1 <- data.frame(matrix(1:12, 
3, 4,
dimnames=list(NULL, c("Group.1", "X1",
"Group.2", "X2"))))

Drop data frame columns by name

There's also the subset command, useful if you know which columns you want:

df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))

UPDATED after comment by @hadley: To drop columns a,c you could do:

df <- subset(df, select = -c(a, c))

Removing specific columns from multiple data frames (.tab) and then merging them in R

First lets make a list of fake files

fakefile <- 'a\tb\tc\td
1\t2\t3\t4'

# In your case instead oof the string it would be the name of the file,
# and therefore it would not have the `text` argument
str(read.table(text = fakefile, header = TRUE))

## 'data.frame': 1 obs. of 4 variables:
## $ a: int 1
## $ b: int 2
## $ c: int 3
## $ d: int 4

# This list would be analogous to your `filenames` list
fakefile_list <- rep(fakefile, 20)
str(fakefile_list)

## chr [1:20] "a\tb\tc\td\n1\t2\t3\t4" "a\tb\tc\td\n1\t2\t3\t4" ...

In principle, all solutions will have the same underlying work as a list
and then merge concept (although the merge might be different here and there).

Solution 1 - If you can rely on the order of column 1

If you can rely on the ordering of the columns, then you dont really need to
read columns 1 and 4 of each file, but just col 4 and bind them.

# Reading column 1 once....
col1 <- read.table(text = fakefile_list[1], header = TRUE)[,1]

# Reading cols 4 in all files

# We first make a function that does our tasks (reading and removing cols)

reader_fun <- function(x) {
read.table(text = x, header = TRUE)[,4]
}

# Then we use lapply to use that function on each elment of our list

cols4 <- lapply(fakefile_list, FUN = reader_fun)
str(cols4)

## List of 20
## $ : int 4
## $ : int 4
## $ : int 4
## $ : int 4

# Then we use do.call and cbind to merge all of them as a matrix
cols4_mat <- do.call(cbind, cols4)

# And finally add column 1 to it
data.frame(col1, cols4_mat)

## col1 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## X20
## 1 4

Solution 2 - If you can not rely in the order

The implementation is easier but it is a lot slower in most situations

# In your case it would be like this ...
# lapply(fakefile_list, FUN = function(x) read.table(x)[, c(1,4)], header = TRUE)

# But since im passing text and not file names ...
my_contents <- lapply(fakefile_list, FUN = function(x, ...) read.table(text = x, ...)[, c(1,4)], header = TRUE)

# And now we use full join and Reduce to merge everything
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)

## a d.x d.y d.x.x d.y.y d.x.x.x d.y.y.y d.x.x.x.x d.y.y.y.y d.x.x.x.x.x
## 1 1 4 4 4 4 4 4 4 4 4
## d.y.y.y.y.y d.x.x.x.x.x.x d.y.y.y.y.y.y d.x.x.x.x.x.x.x d.y.y.y.y.y.y.y
## 1 4 4 4 4 4
## d.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x
## 1 4 4 4
## d.y.y.y.y.y.y.y.y.y d.x.x.x.x.x.x.x.x.x.x d.y.y.y.y.y.y.y.y.y.y
## 1 4 4 4

# you will need to modify the column names btw ...

Bonus - And the most concise solution ...

Depending on how big your data sets are, you might want to ignore the extra
columns from the start (instead of reading them and then removing them).
You can use fread from the data.table package to do that for you.

reader_function <- function(x) {
data.table::fread(x, select = c(1,4))
}

my_contents <- lapply(fakefile_list, FUN = reader_function)
Reduce(function(x,y) dplyr::full_join(x,y, by = 'a') , my_contents)

R remove common columns in dataframes

Using set operations viz. union intersect and setdiff on names of both dfs, we may do this

df1 <- read.table(header = T, text = 'a b c g
1 0 0 0
2 0 0 1')

df2 <- read.table(header = T, text = 'a b d e f
1 1 0 0 0
2 0 0 0 1')

# uncommon column names
x <- setdiff(union(names(df1), names(df2)), intersect(names(df1), names(df2)))

cbind(df1[names(df1) %in% x], df2[names(df2) %in% x])
#> c g d e f
#> 1 0 0 0 0 0
#> 2 0 1 0 0 1

Created on 2021-06-15 by the reprex package (v2.0.0)

Remove rows until columns are identical over multiple data frames

We assume that the question is how to get the values of Type that are common to 4 data frames each of which has a Type column containing unique values.

Form a list L of the data frames, extract the Type column using lapply and [ and iterate merge over that using Reduce :

L <- list(w, x, y, z)
L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
Reduce(merge, L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"

or replace last line with this giving the same result except for order:

Reduce(intersect, L.Type)$Type
## [1] "26809D" "28503C" "360254" "69298N" "32708V"

Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect:

intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
## [1] "26809D" "28503C" "360254" "69298N" "32708V"

Another example

The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD is a built-in data frame has 6 rows. We assign it to X and rename the columns so that the first one has the name Type. Then for i equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type common to all 4. The result correctly shows that 5 and 7 are the only common Type values.

# set up input L, a list of 4 data frames
X <- BOD
names(X) <- c("Type", "X")
L <- lapply(1:4, function(i) X[-i, ])

L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
Reduce(merge, L.Type)$Type
## [1] 5 7

R: for-loop solution to deleting columns from multiple data frames

If you have a data frame, you index rows and columns with

data.frame[row, column]

So, data.frame[2,7]) will give you the value of the 2nd row in the 7th column. I guess you were looking for

temp <- temp[, c(2,7)]

or, if temp is a list of data frames

temp <- lapply(temp, function(x) x[, c(2,7)])

So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...). If I understand your example right, you don't need any loop-command, if you use lapply.



Related Topics



Leave a reply



Submit