Explain Ggplot2 Warning: "Removed K Rows Containing Missing Values"

Explain ggplot2 warning: Removed k rows containing missing values

The behavior you're seeing is due to how ggplot2 deals with data that are outside the axis ranges of the plot. scale_y_continuous (or, equivalently, ylim) excludes values outside the plot area when calculating statistics, summaries, or regression lines. coord_cartesian includes all values in these calculations, regardless of whether they are visible in the plot area. Here are some examples:

library(ggplot2)

# Set one point to a large hp value
d = mtcars
d$hp[d$hp==max(d$hp)] = 1000

All points are visible in this plot:

ggplot(d, aes(mpg, hp)) + 
geom_point() +
geom_smooth(method="lm") +
labs(title="All points are visible; no warnings")
#> `geom_smooth()` using formula 'y ~ x'

Sample Image

In the plot below, one point with hp = 1000 is outside the y-axis range of the plot. Because we used scale_y_continuous to set the y-axis range, this point is not included in any other statistics or summary measures calculated by ggplot, such as the linear regression line calculated by geom_smooth. ggplot also provides warnings about the excluded point.

ggplot(d, aes(mpg, hp)) + 
geom_point() +
scale_y_continuous(limits=c(0,300)) + # Change this to limits=c(0,1000) and the warning disappears
geom_smooth(method="lm") +
labs(title="scale_y_continuous: excluded point is not used for regression line")
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 1 rows containing non-finite values (stat_smooth).
#> Warning: Removed 1 rows containing missing values (geom_point).

Sample Image

In the plot below, the point with hp = 1000 is still outside the y-axis range of the plot. However, because we used coord_cartesian, this point is nevertheless included in any statistics or summary measures that ggplot calculates, such as the linear regression line.

If you compare this and the previous plot, you can see that the linear regression line in the second plot has a much steeper slope and wider confidence bands, because the point with hp=1000 is included when calculating the regression line, even though it's not visible in the plot.

ggplot(d, aes(mpg, hp)) + 
geom_point() +
coord_cartesian(ylim=c(0,300)) +
geom_smooth(method="lm") +
labs(title="coord_cartesian: excluded point is still used for regression line")
#> `geom_smooth()` using formula 'y ~ x'

Sample Image

Removed N rows containing missing values BUT there are no missing values nor values out of range

We can reproduce the error if you change any one value to NA in the column.

library(dplyr)
library(ggplot2)

df$Maritaldummy[195] <- NA

df %>%
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(Maritaldummy %in% c("Not married", "Married")),
Married_Rs = cumsum(Maritaldummy == "Married")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, Married_Rs) %>%
mutate(Married_prop = Married_Rs/Rs) %>%
ggplot(aes(x = date, y = Married_prop)) +
geom_point() +
geom_line()

Returns

Warning messages:
1: Removed 38 rows containing missing values (geom_point).
2: Removed 38 row(s) containing missing values (geom_path).

Since one or more than one value is NA cumsum fails and returns NA for all the values after that. An easy fix is to use %in% instead of == which returns FALSE when compared to NA.

df %>% 
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(Maritaldummy %in% c("Not married", "Married")),
Married_Rs = cumsum(Maritaldummy %in% "Married")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, Married_Rs) %>%
mutate(Married_prop = Married_Rs/Rs) %>%
ggplot(aes(x = date, y = Married_prop)) +
geom_point() +
geom_line()

ggplot2 warning removed x rows containing missing values when drop = FALSE

Your code does not work because even with drop = FALSE the missing categories are still not present in ..count.. and ..x... This can be seen by plotting ..count.. and ..x...

library("tidyverse")

df <- data.frame(
location = c(rep("in", 231), rep("out", 83)),
status = c(rep("normal", 73), rep("mild", 42), rep("moderate", 20), rep("fever", 4),
rep("normal", 70), rep("mild", 41), rep("moderate", 62), rep("fever", 2)))

df$status <- factor(df$status, levels = c("normal", "mild", "moderate", "severe", "fever"))

Plot ..count..

df %>%
ggplot(aes(x = status,
y = ..count..,
fill = location)) +
geom_bar(position = "dodge") +
scale_x_discrete(drop=F)

Sample Image

The missing categories are not present in ..count.. which we can infer from the fact that for normal only one value shows up, i.e. ..count.. is the vector

..count.. <- c(143, 64, 19, 20, 62, 4, 2)

Plot ..x..

df %>%
ggplot(aes(x = status,
y = ..x..,
fill = location)) +
geom_bar(position = "dodge") +
scale_x_discrete(drop=F)

Sample Image

As with ..count.. the missing categories are not present in ..x.. i.e. ..x.. is the vector

..x.. <- c(1, 2, 2, 3, 3, 5, 5)

Why the code does not work

As a first step I compute tapply(..count.., ..x.., sum) which gives us a vector of length 4 (total counts for non-missing status categories):

tapply(..count.., ..x.., sum)
#> 1 2 3 5
#> 143 83 82 6

Now, extracting the elements via [..x..] results in

tapply(..count.., ..x.., sum)[..x..]
#> 1 2 2 3 3 <NA> <NA>
#> 143 83 83 82 82 NA NA

or

..count.. / tapply(..count.., ..x.., sum)[..x..]
#> 1 2 2 3 3 <NA> <NA>
#> 1.0000 0.7711 0.2289 0.2439 0.7561 NA NA

Hence your code results in two missings for the last two categories, which explains the warning Removed 2 rows containing missing values (geom_bar). The reason is that with ..x.. <- c(1, 2, 2, 3, 3, 5, 5) we are trying to extract two times the 5th element from the length 4 vector tapply(..count.., ..x.., sum) and therefore get NAs back.

In case of drop=TRUE everything works fine because in that case ..x.. <- c(1, 2, 2, 3, 3, 4, 4) while ..count.. is the same.

Solution

The issue can be solved by converting ..x.. to a character vector. In that case we extract elements by names:

library("tidyverse")

df <- data.frame(
location = c(rep("in", 231), rep("out", 83)),
status = c(rep("normal", 73), rep("mild", 42), rep("moderate", 20), rep("fever", 4),
rep("normal", 70), rep("mild", 41), rep("moderate", 62), rep("fever", 2)))

df$status <- factor(df$status, levels = c("normal", "mild", "moderate", "severe", "fever"))

# Convert ..x.. to character
df %>%
ggplot(aes(x = status,
y = ..count.. / tapply(..count.., ..x.., sum)[as.character(..x..)],
fill = location)) +
geom_bar(position = "dodge") +
scale_x_discrete(drop=F)

Sample Image

Created on 2020-03-23 by the reprex package (v0.3.0)

R: Removed n rows containing missing values (geom_path)

I think it is because you haven't filtered df so when the limits of scale_x_datetime come along they remove the rows in df that don't fit between the slider parameters. I added this:

df %>% filter(between(x, in_slider_1, in_slider_2))

which seems to remove the issue for me. Please test. Just to mention that I did have some time zone problems.

Full code below:

library(shiny)
library(ggplot2)
library(scales)


ui <- navbarPage("Test",
tabPanel("Test_2",
fluidPage(
fluidRow(
column(width = 12, plotOutput("plot", width = 1200, height = 600))
),
fluidRow(
column(width = 12, sliderInput("slider",
label = "Range [h]",
min = as.POSIXct("2019-11-01 00:00"),
max = as.POSIXct("2019-11-01 07:00"),
value = c(as.POSIXct("2019-11-01 00:00"),as.POSIXct("2019-11-01 07:00"))))
))))

server <- function(input, output, session) {

df <- data.frame("x" = c(as.POSIXct("2019-11-01 00:00"),as.POSIXct("2019-11-01 01:00"),
as.POSIXct("2019-11-01 02:00"),as.POSIXct("2019-11-01 03:00"),
as.POSIXct("2019-11-01 04:00"),as.POSIXct("2019-11-01 05:00"),
as.POSIXct("2019-11-01 06:00"),as.POSIXct("2019-11-01 07:00")),
"y" = c(0,1,2,3,4,5,6,7))

observe({
len_date_list <- length(df$x)

min_merge_datetime <- df$x[1]
max_merge_datetime <- df$x[len_date_list]

updateSliderInput(session, "slider",
min = as.POSIXct(min_merge_datetime),
max = as.POSIXct(max_merge_datetime),
timeFormat = "%Y-%m-%d %H:%M")
})

output$plot <- renderPlot({

in_slider_1 <- input$slider[1]
in_slider_2 <- input$slider[2]

ggplot(data=df %>% filter(between(x, in_slider_1, in_slider_2)), aes(x, y, group = 1)) +
theme_bw() +
geom_line(color="black", stat="identity") +
# geom_point() +
scale_x_datetime(labels = date_format("%m-%d %H:%M"),
limits = c(
as.POSIXct(in_slider_1),
as.POSIXct(in_slider_2)))
})
}

shinyApp(server = server, ui = ui)

It looks like you could now actually remove the scale_x_datetime completely and just have:

        ggplot(data=df %>% filter(between(x, in_slider_1, in_slider_2)), aes(x, y, group = 1)) +
theme_bw() +
geom_line(color="black", stat="identity")

How can I stop geom_point from removing rows in order to create a map

I am not sure why you are running your first part of the code:

locations$Latitude=as.numeric(levels(locations$Latitude))[locations$Latitude] locations$Longitude=as.numeric(levels(locations$Longitude))[locations$Longitude]

If you don't run that part, there won't be any NA anymore. So if you run the following code, it should work:

library(tidyverse)
library(raster)
uganda <- raster::getData('GADM', country='UGA', level=1)

ggplot() +
geom_polygon(data = uganda,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = locations,
aes(x = Longitude, y = Latitude)) +
coord_map() +
theme_bw() +
xlab("Longitude") + ylab("Latitude")

Output:

Sample Image



Related Topics



Leave a reply



Submit