Strange behaviour dropping column from data.frame in R
While your exact question has already been answered in the comments, an alternative to avoid this behaviour is to convert your data.frame
to a tibble
, which is a stripped downed version of a data.frame
, without column name munging, among other things:
library(tibble)
df_t <- as_data_frame(a)
df_t
# A tibble: 3 × 1
abc
<dbl>
1 3
2 2
3 1
> df_t$a
NULL
Warning message:
Unknown column 'a'
Unexpected behaviour: Removing rows from data frame converts to vector R
We can use the drop
as by default for ?Extract
x[i, j, ... , drop = TRUE]
and the drop
documentation says
drop - For matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
drop
is TRUE especially with data.frame
. But, that is not the case in subset
or with data.table
or tibble
a[-(1:2),, drop = FALSE]
# x
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
#10 10
It is a case when there is a single column or row
With tibble
, it wouldn't drop the dimensions
library(dplyr)
tibble(x = 1:10) %>%
slice(-(1:2))
# A tibble: 8 x 1
# x
# <int>
#1 3
#2 4
#3 5
#4 6
#5 7
#6 8
#7 9
#8 10
Or
tibble(x = 1:10)[-(1:2),]
Or with data.table
library(data.table)
data.table(x = 1:10)[-(1:2)]
How to replace a column in R? strange behavior with dates
R
stores dates as numbers, so I think you're getting some wacky behavior because you're operating on the date output (i.e., putting the dates back into a matrix, which makes them appear as the numbers they really are). Instead, you should explicitly use a data.frame with data.frame()
. Also, you may save some time if you use vectorized operations (I think the apply
family still uses loops):
period2date <- function(period) {
period <- as.character(period)
half <- substr(period, 1, 1)
year <- substr(period, 2, 3)
dates <- as.Date(ifelse(half=="1", paste(year, "0101", sep=""), paste(year, "0701", sep="")), format="%y%m%d")
return(dates)
}
data <- data.frame(data, period2date(data$dates))
You can make this cleaner by replacing vice appending the period/date column, also.
R: losing column names when adding rows to an empty data frame
The rbind
help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a
is ignored in your rbind
instruction. Not totally ignored, it seems, because as it is a data frame the rbind
function is called as rbind.data.frame
:
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.
Drop data frame columns by name
There's also the subset
command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by @hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
Strange behavior when using apply with rank and order on a data.frame with ordered factors
As requested by the OP, here is a detailed explanation which may help other R users to evade the traps.
Trap 1
As joran has pointed out, apply
coerces the data frame into a matrix thereby replacing the ordered factors by characters. So, the original data.frame
data1
x y z
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
becomes
as.matrix(data1)
x y z
[1,] "6" "9" "10"
[2,] "1" "3" "1"
[3,] "3" "8" "8"
[4,] "3" "10" "3"
Trap 2
Characters are sorted lexically. Thus, sorting the y
column as character returns
sort(c("9", "3", "8", "10"))
[1] "10" "3" "8" "9"
instead of
sort(c(9, 3, 8, 10))
[1] 3 8 9 10
This explains why apply
returns a different result for the rank
operation here.
Solution
You can use lapply
to compute the rank of each column of the data frame.
as.data.frame(lapply(data1, rank))
x y z
1 4.0 3 4
2 1.0 1 1
3 2.5 2 3
4 2.5 4 2
lapply
returns a list and a data frame is a special kind of list.
Avoid sapply
because sapply
takes the output of lapply
and "simplifies" it to something what it thinks is appropriate. Here,
sapply(data1, rank)
x y z
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2
returns a matrix (again!) which needs to be coerced to a data frame. (See chapter 8.3.20 of The R Inferno by Patrick Burns.The text is a good read, anyway.)
Alternative Solution
The OP has not given an indication why he needs to work with ordered factors. If factors, ordered or not, are not essential to the OPs underlying problem, then apply
would have worked as expected.
set.seed(4)
x2 <- sample(1:10, size = 4, replace = T)
y2 <- sample(1:10, size = 4, replace = T)
z2 <- sample(1:10, size = 4, replace = T)
data2 <- data.frame(x2, y2, z2)
data2
x2 y2 z2
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
apply(data2, 2, rank)
x2 y2 z2
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2
(Nevertheless, better to use lapply
instead of apply
with a data frame).
Trap 3
When I started to learn R
, I was misled by the name of the function ordered()
. It took me a while to understand that it creates a special kind of factors. Likewise, it took me some time to figure out the difference between sort()
and order()
and when to use which function appropriately.
Strange behavior when merging data frames
If you specify columns in by
they would be merged into one by merge
. In your attempt you add a new column sequentially which gives incorrect output, we instead need to match
them against one common value (here using LETTERS
).
dfa$inds <- match(dfa$letter1, LETTERS)
dfb$inds <- match(dfb$letter2, LETTERS)
merge(dfa, dfb, all = TRUE)
# inds letter1 letter2
#1 1 <NA> A
#2 2 <NA> B
#3 3 C C
#4 4 D D
#5 5 E <NA>
#6 6 F <NA>
#7 7 G G
#8 8 H H
#9 9 I I
#10 10 J <NA>
#11 11 K <NA>
#12 12 L <NA>
#13 13 M <NA>
#14 14 N <NA>
#15 15 <NA> O
#16 16 <NA> P
#17 17 <NA> Q
#18 18 <NA> R
#19 19 <NA> S
#20 20 <NA> T
#21 21 <NA> U
#22 22 <NA> V
#23 23 <NA> W
#24 24 <NA> X
#25 25 <NA> Y
As a general case, we can get the common value by combining all the values both the columns can take (all_vals
) and then match
with these values.
all_vals <- unique(c(dfa$letter1, dfb$letter2))
dfa$inds <- match(dfa$letter1, all_vals)
dfb$inds <- match(dfb$letter2, all_vals)
merge(dfa, dfb, all = TRUE, by = "inds")
For multiple such dataframes it is better to put them together in list, assuming the first column is the one we want to match across all the dataframes
list_df <- list(dfa, dfb, dfc)
all_vals <- Reduce(union, lapply(list_df, `[[`, 1))
list_df <- lapply(list_df, function(x) {x$inds <- match(x[[1]], all_vals) ; x})
Reduce(function(x, y) merge(x, y, all = TRUE), list(dfa, dfb, dfc))
data
dfa <- data.frame(letter1 = LETTERS[3:14], stringsAsFactors = FALSE)
dfb <- data.frame(letter2 = LETTERS[c(1:4,7:9,15:25)], stringsAsFactors = FALSE)
dfc <- data.frame(letter3 = LETTERS[1:4], stringsAsFactors = FALSE)
Weird filtering/matching behavior in a data.frame in R
We can use apply
to avoid a loop as follows:
apply(df,2,function(x) which.max(abs(x)))
If we want to use a loop(not recommended in most cases for computational reasons):
res<-vector()
for(i in 1:ncol(df)){
res[i]<-which.max(abs(df[,i]))
}
res
A variant for loop:
for(i in 1:ncol(df)){
res[i]<-which(abs(df[,i])==max(abs(df[,i])))
}
res
With sapply
:
sapply(df,function(x) which.max(abs(x)))
As suggested by @akrun we can also use max.col
Results:apply
(more informative):
PC1 PC2 PC3 PC4 PC5 PC6
6 5 2 6 3 3
Explicit loop:
[1] 6 5 2 6 3 3
With max.col
:
max.col(t(abs(df)), 'first')
[1] 6 5 2 6 3 3
With sapply
:
PC1 PC2 PC3 PC4 PC5 PC6
6 5 2 6 3 3
With purrr
:
purrr::map_dbl(df,function(x) which.max(abs(x)))
PC1 PC2 PC3 PC4 PC5 PC6
6 5 2 6 3 3
Behavior of - NULL on lists versus data.frames for removing data
DISCLAIMER : This is a relatively long answer, not very clear, and not very interesting, so feel free to skip it or to only read the (sort of) conclusion.
I've tried a bit of tracing on[<-.data.frame
, as suggested by Ari B. Friedman. Debugging starts on line 162 of the function, where there is a test to determine if value
(the replacement value argument) is not a list.
Case 1 : value
is not a list
Then it is considered as a vector. Matrices and arrays are considered as one vector, like the help page says :
Note that when the replacement value is an array (including a matrix)
it is not treated as a series of columns (as 'data.frame’ and
‘as.data.frame’ do) but inserted as a single column.
If only one column of the data frame is selected in the LHS, then the only constraint is that the number of rows to be replaced must be equal to or a multiple of length(value)
. If this is the case, value
is recycled with rep
if necessary and converted to a list. If length(value)==0
, there is no recycling (as it is impossible), and value
is just converted to a list.
If several columns of the data frame are selected in the LHS, then the constraint is a bit more complex : length(value)
must be equal to or a multiple of the total number of elements to be replaced, ie the number of rows * the number of columns.
The exact test is the following :
(m < n * p && (m == 0L || (n * p)%%m))
Where n
is the number of rows, p
the number of columns, and m
the length of value
. If the condition is FALSE, then value
is converted into an n x p
matrix (thus recycled if necessary) and the matrix is splitted by columns into a list.
If value
is NULL, then the condition is TRUE as m==0
, and the function is stopped.
Note that the problem occurs for every value
of length 0. For example,
cars1[,c("mpg")] <- numeric(0)
works, whereas :
cars1[,c("mpg","disp")] <- numeric(0)
fails in the same way as cars1[,c("mpg","disp")] <- NULL
Case 2 : value
is a list
If value
is a list, then it is used to replace several columns at the same time. For example :
cars1[,c("mpg","disp")] <- list(1,2)
will replace cars1$mpg
with a vector of 1s, and cars1$disp
with a vector of 2s.
There is a sort of "double recycling" which happens here :
- first, the length of the
value
list must be less than or equal to the number of columns to be replaced. If it is less, then a classic recycling is done. - second, for each element of the
value
list, its length must be equal to, greater than or a multiple of the number of rows to be replaced. If it is less, another recycling is done for each list element to match the number of rows. If it is more, a warning is displayed.
When the value
in RHS is list(NULL)
, nothing really happens, as recycling is impossible (rep(NULL, 10)
is always NULL
). But the code continues and in the end each column to be replaced is assigned NULL
, ie is removed.
Summary and (sort of) conclusion
data.frame
and list
behave differently because of the specific constraint on data frames, where each element must be of the same length. Removing several columns by assigning NULL
fails not because of the NULL
value by itself, but because NULL
is of length 0. The error comes from a test which verifies if the length of the assigned value is a multiple of the number of elements to be replaced (number of rows * number of columns).
Handling the case of value=NULL
for multiple columns doesn't seem difficult (by adding about four lines of simple code), but it requires to consider NULL
as a special case. I'm not able to determine if it is not handled because it would break the logic of the function implementation, or because it would have side effects I don't know.
Related Topics
Force Ggplot to Evaluate Counter Variable
Geom_Smooth with Facet_Grid and Different Fitting Functions
Is There an Efficient Way to Parallelize Mapply
Filling Polygons of a Map Using Ggplot in R
Piecewise Function Fitting with Nls() in R
How to Fix Degree Symbol Not Showing Correctly in R on Linux/Fedora 31
How to Show Only The Lower Triangle in Ggpairs
Can't Install Any R Packages on Linux Server
Meaning of Error Using . Shorthand Inside Dplyr Function
In R, Merge Two Data Frames, Fill Down The Blanks
Coloring a Geom_Histogram by Gradient
Mlogit: Missing Value Where True/False Needed
Filtering Single-Column Data Frames
R Package Conflict Between Gam and Mgcv