How to Find the Percentage of Nas in a Data.Frame

How to find the percentage of NAs in a data.frame?

x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5))

For the whole dataframe:

sum(is.na(x))/prod(dim(x))

Or

mean(is.na(x))

For columns:

apply(x, 2, function(col)sum(is.na(col))/length(col))

Or

colMeans(is.na(x))

Find out the percentage of missing values in each column in the given dataset

How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})

And if you want the missing percentages sorted, follow the above with:

missing_value_df.sort_values('percent_missing', inplace=True)

As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

percent_missing = df.isnull().sum() * 100 / len(df)

Is there a way to calculate the percentage of NA's in each column of a dataframe, but with the df split into separate groups?

Grouped by 'programme', get the mean of NA elements in the other columns, gather to 'long' format and spread back to 'wide' format

library(tidyverse)
df %>%
group_by(programme) %>%
summarise_all(funs(mean(is.na(.)))) %>%
gather(variables, val, -programme) %>%
spread(programme, val)
# A tibble: 3 x 4
# variables A B C
# <chr> <int> <int> <int>
#1 v1 0 1 0
#2 v2 1 0 0
#3 v3 0 0 1

Calculate using dplyr, percentage of NA'S in each column

First, I created a test data for you:

a<- c(1,NA,NA,4)
b<- c(NA,2,3,4)
x<- data.frame(a,b)
x
# a b
# 1 1 NA
# 2 NA 2
# 3 NA 3
# 4 4 4

Then you can use colMeans(is.na(x)) :

colMeans(is.na(x))
# a b
# 0.50 0.25

R: Calculate percentage of missing Values (NA) per day for a Column in a data frame using panel data and remove the days with missing data of over 25%

If you summarize(), you lose lots of information on the individual days. Furthermore, use group_by() before further dplyr verbs. You can calculate the percentage of NA by dividing the sum of NA by the sum of days. as_tibble() is only used to better show the number of rows, it would work without it too. I added a column "CountDate" so that you know how many times the same day appears in your dataframe.

Data %>% as_tibble() %>%  
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100)

# A tibble: 27 x 5
# Groups: Date [9]
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 03.09.2018 NA 3 100
4 A 04.09.2018 3 3 0
5 A 05.09.2018 4 3 0
6 A 11.11.2020 5 3 33.3
7 A 12.11.2020 3 3 0
8 A 13.11.2020 NA 3 33.3
9 A 14.11.2020 6 3 0
10 B 01.09.2018 7 3 0
# ... with 17 more rows

To remove the dates having >25% NA, just filter():

Data %>% as_tibble() %>%  
group_by(Date) %>%
mutate(CountDate = n(), PercNA = sum(is.na(Size))/n()*100) %>%
filter(PercNA <25) %>%
ungroup()

# A tibble: 18 x 5
Product Date Size CountDate PercNA
<chr> <chr> <int> <int> <dbl>
1 A 01.09.2018 10 3 0
2 A 02.09.2018 9 3 0
3 A 04.09.2018 3 3 0
4 A 05.09.2018 4 3 0
5 A 12.11.2020 3 3 0
6 A 14.11.2020 6 3 0
7 B 01.09.2018 7 3 0
8 B 02.09.2018 4 3 0
9 B 04.09.2018 4 3 0
10 B 05.09.2018 6 3 0
11 B 12.11.2020 4 3 0
12 B 14.11.2020 7 3 0
13 C 01.09.2018 3 3 0
14 C 02.09.2018 4 3 0
15 C 04.09.2018 2 3 0
16 C 05.09.2018 4 3 0
17 C 12.11.2020 7 3 0
18 C 14.11.2020 5 3 0

In Python, how to view the percentage of missing values per each column?

give this a try:

my_df.isnull().sum()/len(my_df)

Determine percentage of rows with missing values in a dataframe in R

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.

You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:

df2 = df1 %>% 
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25

R: Counting Overall Percentage of 0's in Data

You can try to unlist your data frame into a vector

vec = unlist(data_frame)

mean(vec %in% "j") * 100 # 6.25
mean(vec %in% "0") * 100 # 6.25
mean(vec %in% NA) * 100 # 43.75

Calculate the percentage of non-NA values of subgroups

You can do:

df %>%
group_by(sample, sub_sample) %>%
summarise(value_non_na = sum(!is.na(value))/n()*100)

sample sub_sample value_non_na
<int> <fct> <dbl>
1 1 A 66.7
2 1 B 33.3
3 1 C 100
4 2 A 100
5 2 B 66.7
6 2 C 33.3


Related Topics



Leave a reply



Submit