R: Why does mean(NA, na.rm = TRUE) return NaN
It is a bit pity that ?mean
does not say anything about this. My comment only told you that applying mean
on an empty "numeric" results in NaN
without more reasoning. Rui Barradas's comment tried to reason this but was not accurate, as division by 0
is not always NaN
, it can be Inf
or -Inf
. I once discussed about this in R: element-wise matrix division. However, we are getting close. Although mean(x)
is not coded by sum(x) / length(x)
, this mathematical fact really explains this NaN
.
From ?sum:
*NB:* the sum of an empty set is zero, by definition.
So sum(numeric(0))
is 0
. As length(numeric(0))
is 0
, mean(numeric(0))
is 0 / 0
which is NaN
.
mean(,na.rm = TRUE) returns NA and Warning Message
We could change the ifelse
to if/else
for(i in seq_along(df_sw)){
# // if the column is numeric
if(is.numeric(df_sw[,i])) {
# // print the mean
cat(colnames(df_sw)[i], ": numeric the mean is: ",
mean(df_sw[,i],trim = 0, na.rm = TRUE), "\n")
} else {
# // print that it is a character column
cat(colnames(df_sw)[i], ": character: \n")
}
}
#name : character:
#height : numeric the mean is: 174.358
#mass : numeric the mean is: 97.31186
#hair_color : character:
#skin_color : character:
#eye_color : character:
#birth_year : numeric the mean is: 87.56512
#sex : character:
#gender : character:
#homeworld : character:
#species : character:
Is there a difference between na.rm = FALSE and na.ram = na.rm?
Many base functions (base as in base R or base to any particular package) accept the argument na.rm=
, where the default is often FALSE
. (Some functions use useNA=
or na.action
, depending on different actions, but we'll ignore those.)
Higher-level functions (user-defined and/or other packages) might also define this argument and then pass it on to the other functions. For example:
parent_func <- function(x, ..., na.rm = FALSE) {
# something important
mu <- mean(x, na.rm = na.rm)
sigma <- sd(x, na.rm = na.rm)
(mu - x) / sigma
}
One premise being that if you intend to remove/ignore NA
values for one portion of the function, you might use it in other places (or all).
In this case, in the call to mean(x, na.rm = na.rm)
, the left na.rm
is referring to the argument named na.rm
in the definition of mean
. The right na.rm
is referring to the same-named argument of parent_func
.
An alternative way to define this parent function (for the sake of differentiating variables) could be:
parent_func <- function(x, ..., NARM = FALSE) {
# something important
mu <- mean(x, na.rm = NARM)
sigma <- sd(x, na.rm = NARM)
(mu - x) / sigma
}
The advantage of using na.rm=
instead of this NARM=
is likely consistency (though that is not always one of R's strengths across all functions). Many users are likely more intuitively familiar with the na.rm=
argument name, purpose, and effect than something else.
Edit:
I'm seeing that it is better practice to do function(x, na.rm = FALSE) {} in general to allow the user to change it and to be consistent with default settings for sum and mean. Is this correct?
I believe so. In general I find that removal of missing data should be an explicit act by the user, not a default by the function. That is, if having missing data indicates a larger problem, then defaulting to na.rm=FALSE
will quickly indicate to the user that something is wrong; na.rm=TRUE
will mask this problem and suggest valid results when perhaps there should be no NA
s at all. This holds true for the "smaller" functions (e.g., mean
, sum
) and so its logic should be carried outwards to the encapsulating functions.
na.rm = TRUE failing to remove NA on unite()
The issue is probably the columns are factors. Try using :
library(dplyr)
library(tidyr)
Need_info %>%
mutate_if(is.factor, as.character) %>%
unite("lala",c(5,6,13,14,15,16),na.rm = TRUE,remove = TRUE)
Using a reproducible example :
df <- data.frame(a = c(letters[1:5], NA), b = c(NA, letters[11:15]))
df %>% unite("lala", c(1, 2), na.rm =TRUE, remove = TRUE)
# lala
#1 1_NA
#2 2_1
#3 3_2
#4 4_3
#5 5_4
#6 NA_5
After converting to character :
df %>%
mutate_all(as.character) %>%
unite("lala", c(1, 2), na.rm = TRUE, remove = TRUE)
# lala
#1 a
#2 b_k
#3 c_l
#4 d_m
#5 e_n
#6 o
Why does na.rm=TRUE not work for weighted SD in R?
The problem appears to be that weighted.sd()
will not operate as you are expecting across rows of a data frame.
Running weighted.sd
we can see the code:
weighted.sd <- function (x, wt, na.rm = TRUE)
{
if (na.rm) {
x <- na.omit(x)
wt <- na.omit(wt)
}
wt <- wt/sum(wt)
wm <- weighted.mean(x, wt)
sqrt(sum(wt * (x - wm)^2))
}
In your example, you are not feeding in a vector for x
, but rather a single row of a data frame. Function na.omit(x)
will remove that entire row, due to the NA
values - not elements of the vector.
You can try to convert the row to a vector with as.numeric()
, but that will fail for this function as well due to how NA
is removed from wt
.
It seems like something like this is probably what you want. Of course, you have to be careful that you are feeding in valid columns for x
.
weighted.sd2 <- function (x, wt, na.rm = TRUE) {
x <- as.numeric(x)
if (na.rm) {
is_na <- is.na(x)
x <- x[!is_na]
wt <- wt[!is_na]
}
wt <- wt/sum(wt)
wm <- weighted.mean(x, wt)
sqrt(sum(wt * (x - wm)^2))
}
weighted.sd2(mtcars[18,1:11], c(11,11,11,11,11,11,11,11,11,11,11), na.rm = TRUE)#works
# [1] 26.76086
weighted.sd2(mtcars[5,1:11], c(11,11,11,11,11,11,11,11,11,11,11), na.rm = TRUE)#issue here
# [1] 116.545
To apply this to all columns, you can use apply()
.
mtcars$weighted.sd <- apply(mtcars[,1:11], 1, weighted.sd2, wt = rep(11, 11))
mpg cyl disp hp drat wt qsec vs am gear carb weighted.sd
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 NA 1 4 4 52.61200
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 NA 1 4 4 52.58011
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 37.06108
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 NA 3 1 78.36300
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 NA NA 3 2 116.54503
...
Problem using na.rm=TRUE in summarize in R code
If we want to find the mode, use Mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
and now it should work
Test%>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarize(Mode=Mode(time),.groups = 'drop')
# A tibble: 2 × 2
Week Mode
<chr> <dbl>
1 Thursday 0
2 Wednesday 5
If we want to insert the na.rm
, it should be an argument to the function and the max
should also have that argument
Test1 <- function(t, rm_na) {
s <- table(as.vector(t))
names(s)[s %in% max(s, na.rm = rm_na)]
}
and use the function as
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarize(Mode=Test1(time, TRUE),.groups = 'drop')
Related Topics
How to Get All Possible Combinations of N Number of Data Set
How to Make Install.Packages Return an Error If an R Package Cannot Be Installed
How to Debug Methods from Reference Classes
Merge Data Based on Nearest Date R
How to Uninstall R Completely from Os X
Ggplot: Recommended Colour Palettes Also Distinguishable for B&W Printing
Could Not Find Function Tagpos
R Find Overlap Among Time Periods
Coloring a Geom_Histogram by Gradient
How to Force Ggplot's Geom_Tile to Fill Every Facet
How to Filter Cases in a Data.Table by Multiple Conditions Defined in Another Data.Table
Plot Weighted Frequency Matrix
How to Predict Survival Probabilities in R