Find names of columns which contain missing values
Like this?
colnames(mymatrix)[colSums(is.na(mymatrix)) > 0]
# [1] "aa" "ee"
Or as suggested by @thelatemail:
names(which(colSums(is.na(mymatrix)) > 0))
# [1] "aa" "ee"
Pandas: print column name with missing values
df.isnull().any()
generates a boolean array (True if the column has a missing value, False otherwise). You can use it to index into df.columns
:
df.columns[df.isnull().any()]
will return a list of the columns which have missing values.
df = pd.DataFrame({'A': [1, 2, 3],
'B': [1, 2, np.nan],
'C': [4, 5, 6],
'D': [np.nan, np.nan, np.nan]})
df
Out:
A B C D
0 1 1.0 4 NaN
1 2 2.0 5 NaN
2 3 NaN 6 NaN
df.columns[df.isnull().any()]
Out: Index(['B', 'D'], dtype='object')
df.columns[df.isnull().any()].tolist() # to get a list instead of an Index object
Out: ['B', 'D']
How to find which columns contain any NaN value in Pandas dataframe
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN
value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as @root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN
value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
How to find columns that contain N/A values
If those are the only cases, something like this could work:
na_cols = sapply(df, function(x) sum(ifelse(x == '' | is.na(x) == TRUE | x == 'N/A', 1, 0)))
names(na_cols[na_cols > 0])
If there were more "NA" conditions, you'd need to add to the ifelse statement.
make a list of the variables that contain missing values - pandas
Here error means there are some duplicated columns names, so df[var]
return DataFrame
, not Series
.
df = pd.DataFrame ({'a':[np.nan, 1],'b':[1, 1],'c':[np.nan, np.nan]})
df.columns = ['a','a','s']
print (df['a'])
a a
0 NaN 1
1 1.0 1
vars_with_na = [var for var in df.columns if df[var].isnull().sum() > 0]
print (vars_with_na)
Possible solution is deduplicated them first.
Checking all columns in data frame for missing values in R
The anyNA
function is built for this. You can apply it to all columns of a data frame with sapply(books, anyNA)
. To count NA
values, akrun's suggestion of colSums(is.na(books))
is good.
Return list of column names with missing (NA) data for each row of a data frame in R
Here is a tidyverse
solution.
df <- read_table("
ID Var1 Var2 Var3 Var4 Var5
1 10 T NA 2 NA
2 15 F 50 2 NA
3 12 NA 41 2 NA
4 NA NA NA 1 NA
5 NA F NA NA NA", col_names = TRUE)
library(dplyr)
library(tidyr)
df %>%
mutate(across(starts_with("var"), is.na)) %>% # replace all NA with TRUE and else FALSE
pivot_longer(-ID, names_to = "var") %>% # pivot longer
filter(value) %>% # remove the FALSE rows
group_by(ID) %>% # group by the ID
summarise(`Missing Variables` = toString(var)) # convert the variable names to a string column
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
ID `Missing Variables`
<dbl> <chr>
1 1 Var3, Var5
2 2 Var5
3 3 Var2, Var5
4 4 Var1, Var2, Var3, Var5
5 5 Var1, Var3, Var4, Var5
Find columns with all missing values
This is easy enough to with sapply
and a small anonymous function:
sapply(test1, function(x)all(is.na(x)))
X1 X2 X3
FALSE FALSE FALSE
sapply(test2, function(x)all(is.na(x)))
X1 X2 X3
FALSE TRUE FALSE
And inside a function:
na.test <- function (x) {
w <- sapply(x, function(x)all(is.na(x)))
if (any(w)) {
stop(paste("All NA in columns", paste(which(w), collapse=", ")))
}
}
na.test(test1)
na.test(test2)
Error in na.test(test2) : All NA in columns 2
Find column names of missing values based on list from other dataset
I think I found a way after researching a little bit more. I simply create a list of all the required points from the nominal table, join it with the measurement table on side
, and then do a difference between these two columns of arrays. Seems to work fine.
Here is an example given the nominals
table and the actuals
table.
First create a list of all the nominal points per id
and side
nominal_list = (
nominals
.groupby("side")
.agg(F.collect_list(F.col("point_name")))
.withColumnRenamed("collect_list(point_name)", "nominal_points")
)
then join over a similar dataset created with measurements with the additional id
column (note the order of the arrays when calling F.array_except
).
missing = (
actuals
.groupby(["id", "side"])
.agg(F.collect_list(F.col("point_name")))
.withColumnRenamed("collect_list(point_name)", "measured_points")
.join(nominal_list, "side")
.withColumn('missing', F.array_except('nominal_points', 'measured_points'))
)
Note that this is slightly different from what I asked in the beginning as I have the side as additional column and not hidden in the column name, i.e. missing_L
and missing_R
.
Viewing all column names with any NA in R
Another acrobatic solution (just for fun) :
colnames(df)[!complete.cases(t(df))]
[1] "b" "c"
The idea is : Getting the columns of A that have at least 1 NA is equivalent to get the rows that have at least NA for t(A). complete.cases
by definition (very efficient since it is just a call to C function) gives the rows without any missing value.
Related Topics
Programming-Safe Version of Subset - to Evaluate Its Condition While Called from Another Function
Change Plotly Chart Y Variable Based on Selectinput
Remove Part of a String in Dataframe Column (R)
Extract the Coefficients for the Best Tuning Parameters of a Glmnet Model in Caret
List Members Can Be Accessed with Partial Name? Is This a Feature
Twitter Data Analysis - Error in Term Document Matrix
Reshape Data Long to Wide - Understanding Reshape Parameters
How to Split an Igraph into Connected Subgraphs
How to Read Data with Different Separators
How to Use Loess Method in Ggally::Ggpairs Using Wrap Function
How to Filter Data Frame with Conditions of Two Columns
Adding 15 Business Days in Lubridate
Dictionary() Is Not Supported Anymore in Tm Package. How to Emend Code