Strange Behaviour Dropping Column from Data.Frame in R

Strange behaviour dropping column from data.frame in R

While your exact question has already been answered in the comments, an alternative to avoid this behaviour is to convert your data.frame to a tibble, which is a stripped downed version of a data.frame, without column name munging, among other things:

library(tibble)
df_t <- as_data_frame(a)
df_t
# A tibble: 3 × 1
abc
<dbl>
1 3
2 2
3 1
> df_t$a
NULL
Warning message:
Unknown column 'a'

Unexpected behaviour: Removing rows from data frame converts to vector R

We can use the drop as by default for ?Extract

x[i, j, ... , drop = TRUE]

and the drop documentation says

drop - For matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.

drop is TRUE especially with data.frame. But, that is not the case in subset or with data.table or tibble

a[-(1:2),, drop = FALSE] 
# x
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
#10 10

It is a case when there is a single column or row


With tibble, it wouldn't drop the dimensions

library(dplyr)
tibble(x = 1:10) %>%
slice(-(1:2))
# A tibble: 8 x 1
# x
# <int>
#1 3
#2 4
#3 5
#4 6
#5 7
#6 8
#7 9
#8 10

Or

tibble(x = 1:10)[-(1:2),]

Or with data.table

library(data.table)
data.table(x = 1:10)[-(1:2)]

How to replace a column in R? strange behavior with dates

R stores dates as numbers, so I think you're getting some wacky behavior because you're operating on the date output (i.e., putting the dates back into a matrix, which makes them appear as the numbers they really are). Instead, you should explicitly use a data.frame with data.frame(). Also, you may save some time if you use vectorized operations (I think the apply family still uses loops):

period2date <- function(period) {
period <- as.character(period)
half <- substr(period, 1, 1)
year <- substr(period, 2, 3)
dates <- as.Date(ifelse(half=="1", paste(year, "0101", sep=""), paste(year, "0701", sep="")), format="%y%m%d")
return(dates)
}

data <- data.frame(data, period2date(data$dates))

You can make this cleaner by replacing vice appending the period/date column, also.

R: losing column names when adding rows to an empty data frame

The rbind help pages specifies that :

For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)

So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :

rbind.data.frame(c(5,6))
# X5 X6
#1 5 6

Maybe one way to insert the row could be :

a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6

But there may be a better way to do it depending on your code.

Drop data frame columns by name

There's also the subset command, useful if you know which columns you want:

df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))

UPDATED after comment by @hadley: To drop columns a,c you could do:

df <- subset(df, select = -c(a, c))

Strange behavior when using apply with rank and order on a data.frame with ordered factors

As requested by the OP, here is a detailed explanation which may help other R users to evade the traps.

Trap 1

As joran has pointed out, apply coerces the data frame into a matrix thereby replacing the ordered factors by characters. So, the original data.frame

data1
x y z
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3

becomes

as.matrix(data1)
x y z
[1,] "6" "9" "10"
[2,] "1" "3" "1"
[3,] "3" "8" "8"
[4,] "3" "10" "3"

Trap 2

Characters are sorted lexically. Thus, sorting the y column as character returns

sort(c("9", "3", "8", "10"))
[1] "10" "3" "8" "9"

instead of

sort(c(9, 3, 8, 10))
[1] 3 8 9 10

This explains why apply returns a different result for the rank operation here.

Solution

You can use lapply to compute the rank of each column of the data frame.

as.data.frame(lapply(data1, rank))
x y z
1 4.0 3 4
2 1.0 1 1
3 2.5 2 3
4 2.5 4 2

lapply returns a list and a data frame is a special kind of list.

Avoid sapply because sapply takes the output of lapplyand "simplifies" it to something what it thinks is appropriate. Here,

sapply(data1, rank)
x y z
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2

returns a matrix (again!) which needs to be coerced to a data frame. (See chapter 8.3.20 of The R Inferno by Patrick Burns.The text is a good read, anyway.)

Alternative Solution

The OP has not given an indication why he needs to work with ordered factors. If factors, ordered or not, are not essential to the OPs underlying problem, then applywould have worked as expected.

set.seed(4)
x2 <- sample(1:10, size = 4, replace = T)
y2 <- sample(1:10, size = 4, replace = T)
z2 <- sample(1:10, size = 4, replace = T)
data2 <- data.frame(x2, y2, z2)
data2
x2 y2 z2
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
apply(data2, 2, rank)
x2 y2 z2
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2

(Nevertheless, better to use lapply instead of apply with a data frame).

Trap 3

When I started to learn R, I was misled by the name of the function ordered(). It took me a while to understand that it creates a special kind of factors. Likewise, it took me some time to figure out the difference between sort() and order() and when to use which function appropriately.

Strange behavior when merging data frames

If you specify columns in by they would be merged into one by merge. In your attempt you add a new column sequentially which gives incorrect output, we instead need to match them against one common value (here using LETTERS).

dfa$inds <- match(dfa$letter1, LETTERS)
dfb$inds <- match(dfb$letter2, LETTERS)

merge(dfa, dfb, all = TRUE)

# inds letter1 letter2
#1 1 <NA> A
#2 2 <NA> B
#3 3 C C
#4 4 D D
#5 5 E <NA>
#6 6 F <NA>
#7 7 G G
#8 8 H H
#9 9 I I
#10 10 J <NA>
#11 11 K <NA>
#12 12 L <NA>
#13 13 M <NA>
#14 14 N <NA>
#15 15 <NA> O
#16 16 <NA> P
#17 17 <NA> Q
#18 18 <NA> R
#19 19 <NA> S
#20 20 <NA> T
#21 21 <NA> U
#22 22 <NA> V
#23 23 <NA> W
#24 24 <NA> X
#25 25 <NA> Y

As a general case, we can get the common value by combining all the values both the columns can take (all_vals) and then match with these values.

all_vals <- unique(c(dfa$letter1, dfb$letter2))
dfa$inds <- match(dfa$letter1, all_vals)
dfb$inds <- match(dfb$letter2, all_vals)
merge(dfa, dfb, all = TRUE, by = "inds")

For multiple such dataframes it is better to put them together in list, assuming the first column is the one we want to match across all the dataframes

list_df <- list(dfa, dfb, dfc)
all_vals <- Reduce(union, lapply(list_df, `[[`, 1))
list_df <- lapply(list_df, function(x) {x$inds <- match(x[[1]], all_vals) ; x})
Reduce(function(x, y) merge(x, y, all = TRUE), list(dfa, dfb, dfc))

data

dfa <- data.frame(letter1 = LETTERS[3:14], stringsAsFactors = FALSE)
dfb <- data.frame(letter2 = LETTERS[c(1:4,7:9,15:25)], stringsAsFactors = FALSE)
dfc <- data.frame(letter3 = LETTERS[1:4], stringsAsFactors = FALSE)

Weird filtering/matching behavior in a data.frame in R

We can use apply to avoid a loop as follows:

apply(df,2,function(x) which.max(abs(x)))

If we want to use a loop(not recommended in most cases for computational reasons):

 res<-vector()
for(i in 1:ncol(df)){
res[i]<-which.max(abs(df[,i]))
}
res

A variant for loop:

for(i in 1:ncol(df)){
res[i]<-which(abs(df[,i])==max(abs(df[,i])))
}
res

With sapply:

sapply(df,function(x) which.max(abs(x)))

As suggested by @akrun we can also use max.col

Results:
apply(more informative):

PC1 PC2 PC3 PC4 PC5 PC6 
6 5 2 6 3 3

Explicit loop:

[1] 6 5 2 6 3 3

With max.col:

max.col(t(abs(df)), 'first')
[1] 6 5 2 6 3 3

With sapply:

PC1 PC2 PC3 PC4 PC5 PC6 
6 5 2 6 3 3

With purrr:

purrr::map_dbl(df,function(x) which.max(abs(x)))
PC1 PC2 PC3 PC4 PC5 PC6
6 5 2 6 3 3

Behavior of - NULL on lists versus data.frames for removing data

DISCLAIMER : This is a relatively long answer, not very clear, and not very interesting, so feel free to skip it or to only read the (sort of) conclusion.

I've tried a bit of tracing on
[<-.data.frame, as suggested by Ari B. Friedman. Debugging starts on line 162 of the function, where there is a test to determine if value (the replacement value argument) is not a list.

Case 1 : value is not a list

Then it is considered as a vector. Matrices and arrays are considered as one vector, like the help page says :

Note that when the replacement value is an array (including a matrix)
it is not treated as a series of columns (as 'data.frame’ and
‘as.data.frame’ do) but inserted as a single column.

If only one column of the data frame is selected in the LHS, then the only constraint is that the number of rows to be replaced must be equal to or a multiple of length(value). If this is the case, value is recycled with rep if necessary and converted to a list. If length(value)==0, there is no recycling (as it is impossible), and value is just converted to a list.

If several columns of the data frame are selected in the LHS, then the constraint is a bit more complex : length(value) must be equal to or a multiple of the total number of elements to be replaced, ie the number of rows * the number of columns.

The exact test is the following :

(m < n * p && (m == 0L || (n * p)%%m))

Where n is the number of rows, p the number of columns, and m the length of value. If the condition is FALSE, then value is converted into an n x p matrix (thus recycled if necessary) and the matrix is splitted by columns into a list.

If value is NULL, then the condition is TRUE as m==0, and the function is stopped.
Note that the problem occurs for every value of length 0. For example,

cars1[,c("mpg")] <- numeric(0)

works, whereas :

cars1[,c("mpg","disp")] <- numeric(0)

fails in the same way as cars1[,c("mpg","disp")] <- NULL

Case 2 : value is a list

If value is a list, then it is used to replace several columns at the same time. For example :

cars1[,c("mpg","disp")] <- list(1,2)

will replace cars1$mpg with a vector of 1s, and cars1$disp with a vector of 2s.

There is a sort of "double recycling" which happens here :

  • first, the length of the value list must be less than or equal to the number of columns to be replaced. If it is less, then a classic recycling is done.
  • second, for each element of the value list, its length must be equal to, greater than or a multiple of the number of rows to be replaced. If it is less, another recycling is done for each list element to match the number of rows. If it is more, a warning is displayed.

When the value in RHS is list(NULL), nothing really happens, as recycling is impossible (rep(NULL, 10) is always NULL). But the code continues and in the end each column to be replaced is assigned NULL, ie is removed.

Summary and (sort of) conclusion

data.frame and list behave differently because of the specific constraint on data frames, where each element must be of the same length. Removing several columns by assigning NULL fails not because of the NULL value by itself, but because NULL is of length 0. The error comes from a test which verifies if the length of the assigned value is a multiple of the number of elements to be replaced (number of rows * number of columns).

Handling the case of value=NULL for multiple columns doesn't seem difficult (by adding about four lines of simple code), but it requires to consider NULL as a special case. I'm not able to determine if it is not handled because it would break the logic of the function implementation, or because it would have side effects I don't know.



Related Topics



Leave a reply



Submit