What Does Na.Rm=True Actually Means

R: Why does mean(NA, na.rm = TRUE) return NaN

It is a bit pity that ?mean does not say anything about this. My comment only told you that applying mean on an empty "numeric" results in NaN without more reasoning. Rui Barradas's comment tried to reason this but was not accurate, as division by 0 is not always NaN, it can be Inf or -Inf. I once discussed about this in R: element-wise matrix division. However, we are getting close. Although mean(x) is not coded by sum(x) / length(x), this mathematical fact really explains this NaN.

From ?sum:

 *NB:* the sum of an empty set is zero, by definition.

So sum(numeric(0)) is 0. As length(numeric(0)) is 0, mean(numeric(0)) is 0 / 0 which is NaN.

mean(,na.rm = TRUE) returns NA and Warning Message

We could change the ifelse to if/else

for(i in seq_along(df_sw)){

# // if the column is numeric
if(is.numeric(df_sw[,i])) {
# // print the mean
cat(colnames(df_sw)[i], ": numeric the mean is: ",
mean(df_sw[,i],trim = 0, na.rm = TRUE), "\n")
} else {
# // print that it is a character column
cat(colnames(df_sw)[i], ": character: \n")

}
}

#name : character:
#height : numeric the mean is: 174.358
#mass : numeric the mean is: 97.31186
#hair_color : character:
#skin_color : character:
#eye_color : character:
#birth_year : numeric the mean is: 87.56512
#sex : character:
#gender : character:
#homeworld : character:
#species : character:

Is there a difference between na.rm = FALSE and na.ram = na.rm?

Many base functions (base as in base R or base to any particular package) accept the argument na.rm=, where the default is often FALSE. (Some functions use useNA= or na.action, depending on different actions, but we'll ignore those.)

Higher-level functions (user-defined and/or other packages) might also define this argument and then pass it on to the other functions. For example:

parent_func <- function(x, ..., na.rm = FALSE) {
# something important
mu <- mean(x, na.rm = na.rm)
sigma <- sd(x, na.rm = na.rm)
(mu - x) / sigma
}

One premise being that if you intend to remove/ignore NA values for one portion of the function, you might use it in other places (or all).

In this case, in the call to mean(x, na.rm = na.rm), the left na.rm is referring to the argument named na.rm in the definition of mean. The right na.rm is referring to the same-named argument of parent_func.

An alternative way to define this parent function (for the sake of differentiating variables) could be:

parent_func <- function(x, ..., NARM = FALSE) {
# something important
mu <- mean(x, na.rm = NARM)
sigma <- sd(x, na.rm = NARM)
(mu - x) / sigma
}

The advantage of using na.rm= instead of this NARM= is likely consistency (though that is not always one of R's strengths across all functions). Many users are likely more intuitively familiar with the na.rm= argument name, purpose, and effect than something else.

Edit:

I'm seeing that it is better practice to do function(x, na.rm = FALSE) {} in general to allow the user to change it and to be consistent with default settings for sum and mean. Is this correct?

I believe so. In general I find that removal of missing data should be an explicit act by the user, not a default by the function. That is, if having missing data indicates a larger problem, then defaulting to na.rm=FALSE will quickly indicate to the user that something is wrong; na.rm=TRUE will mask this problem and suggest valid results when perhaps there should be no NAs at all. This holds true for the "smaller" functions (e.g., mean, sum) and so its logic should be carried outwards to the encapsulating functions.

na.rm = TRUE failing to remove NA on unite()

The issue is probably the columns are factors. Try using :

library(dplyr)
library(tidyr)

Need_info %>%
mutate_if(is.factor, as.character) %>%
unite("lala",c(5,6,13,14,15,16),na.rm = TRUE,remove = TRUE)

Using a reproducible example :

df <- data.frame(a = c(letters[1:5], NA), b = c(NA, letters[11:15]))
df %>% unite("lala", c(1, 2), na.rm =TRUE, remove = TRUE)

# lala
#1 1_NA
#2 2_1
#3 3_2
#4 4_3
#5 5_4
#6 NA_5

After converting to character :

df %>% 
mutate_all(as.character) %>%
unite("lala", c(1, 2), na.rm = TRUE, remove = TRUE)

# lala
#1 a
#2 b_k
#3 c_l
#4 d_m
#5 e_n
#6 o

Why does na.rm=TRUE not work for weighted SD in R?

The problem appears to be that weighted.sd() will not operate as you are expecting across rows of a data frame.

Running weighted.sd we can see the code:

weighted.sd <- function (x, wt, na.rm = TRUE) 
{
if (na.rm) {
x <- na.omit(x)
wt <- na.omit(wt)
}
wt <- wt/sum(wt)
wm <- weighted.mean(x, wt)
sqrt(sum(wt * (x - wm)^2))
}

In your example, you are not feeding in a vector for x, but rather a single row of a data frame. Function na.omit(x) will remove that entire row, due to the NA values - not elements of the vector.

You can try to convert the row to a vector with as.numeric(), but that will fail for this function as well due to how NA is removed from wt.

It seems like something like this is probably what you want. Of course, you have to be careful that you are feeding in valid columns for x.

weighted.sd2 <- function (x, wt, na.rm = TRUE) {

x <- as.numeric(x)

if (na.rm) {
is_na <- is.na(x)

x <- x[!is_na]
wt <- wt[!is_na]
}

wt <- wt/sum(wt)
wm <- weighted.mean(x, wt)
sqrt(sum(wt * (x - wm)^2))
}
weighted.sd2(mtcars[18,1:11], c(11,11,11,11,11,11,11,11,11,11,11), na.rm = TRUE)#works
# [1] 26.76086
weighted.sd2(mtcars[5,1:11], c(11,11,11,11,11,11,11,11,11,11,11), na.rm = TRUE)#issue here
# [1] 116.545

To apply this to all columns, you can use apply().

mtcars$weighted.sd <- apply(mtcars[,1:11], 1, weighted.sd2, wt = rep(11, 11))
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb weighted.sd
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 NA 1 4 4 52.61200
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 NA 1 4 4 52.58011
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 37.06108
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 NA 3 1 78.36300
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 NA NA 3 2 116.54503
...

Problem using na.rm=TRUE in summarize in R code

If we want to find the mode, use Mode

Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

and now it should work

Test%>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarize(Mode=Mode(time),.groups = 'drop')
# A tibble: 2 × 2
Week Mode
<chr> <dbl>
1 Thursday 0
2 Wednesday 5

If we want to insert the na.rm, it should be an argument to the function and the max should also have that argument

Test1 <- function(t, rm_na) {
s <- table(as.vector(t))
names(s)[s %in% max(s, na.rm = rm_na)]
}

and use the function as

Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarize(Mode=Test1(time, TRUE),.groups = 'drop')


Related Topics



Leave a reply



Submit