Remove rows with all or some NAs (missing values) in data.frame
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
How to remove row if it has a NA value in one certain column
The easiest solution is to use is.na()
:
df[!is.na(df$B), ]
which gives you:
A B C
1 NA 2 NA
2 1 2 3
4 1 2 3
remove Rows with complete set of NA
We can use dplyr. With the example by @lovalery:
library(dplyr)
df %>% filter(!if_all(V2:V3, is.na))
#> V1 V2 V3
#> 1 3 3 NA
#> 2 NA 1 NA
#> 3 3 5 NA
We can use many different selection statements inside if_all
. Check the documentation for more examples.
Deleting rows corresponding to particular string, but keep NA values
We can create a condition with is.na
to return those elements that are NA
as well with |
subset(df, Col1 != 'string' | is.na(Col1))
Remove rows which have na values
I never used datatable
but pandas.DataFrame
has isna()
to select rows with na
, and drop()
to remove rows (or it can use del
for this) and I found similar functions for datatable
.
- Official documentation: Delete Rows/Columns
- Other place (found with Google): Filtering out Empty Rows
datatable
can use del
to remove selected rows. It can use also .isna()
or == None
to select rows with na
. Problem is that it can filter it only on one column - so it may need for
-loop to check different columns.
columns = dt.f[:]
for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] == None)
del df[rows, columns]
print(df)
This removes values from datatable but not rows and it create empty rows like this
| A B B_lag_1 B_lag_2 B_lag_3 B_lag_4
| int64 float64 float64 float64 float64 float64
-- + ----- --------- --------- --------- --------- ---------
0 | NA NA NA NA NA NA
1 | NA NA NA NA NA NA
2 | NA NA NA NA NA NA
3 | NA NA NA NA NA NA
4 | 0 -0.234153 1.52303 0.647689 -0.138264 0.496714
5 | NA NA NA NA NA NA
6 | NA NA NA NA NA NA
7 | NA NA NA NA NA NA
8 | NA NA NA NA NA NA
9 | 1 0.54256 -0.469474 0.767435 1.57921 -0.234137
[10 rows x 6 columns]
It can be better to keep rows which don't have None
columns = dt.f[:]
for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] != None)
df = df[rows, columns]
print(df)
Result:
| A B B_lag_1 B_lag_2 B_lag_3 B_lag_4
| int64 float64 float64 float64 float64 float64
-- + ----- --------- --------- -------- --------- ---------
0 | 0 -0.234153 1.52303 0.647689 -0.138264 0.496714
1 | 1 0.54256 -0.469474 0.767435 1.57921 -0.234137
[2 rows x 6 columns]
But you can use &
(as operator AND
) and |
(as operator OR
) to do the same without for
-loop.
columns = dt.f[:]
rows = (dt.f['B_lag_1'] != None) & (dt.f['B_lag_2'] != None) & (dt.f['B_lag_3'] != None) & (dt.f['B_lag_4'] != None)
df = df[rows, columns]
print(df)
But later I found that datatable
has dt.rowall()
and dt.rowany()
to work with many columns and code can be simpler.
- Other place (found with Google): Filtering across Multiple Columns
rowall()
works like operator AND
, rowany()
works like operator OR
.
columns = dt.f[:]
rows = dt.rowall(dt.f['B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4'] != None)
#rows = dt.rowall(dt.f['B_lag_1':'B_lag_4'] != None) # range of columns
#rows = dt.rowall(dt.f[:] != None) # all columns
df = df[rows, columns]
print(df)
Full working code:
I took code from my previous answer Create many lagged variables
import datatable as dt
import numpy as np
def test1(df):
print('\n--- test 1 ---\n')
df = df.copy()
#columns = dt.f['A', 'B', 'B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4']
#columns = df.keys()
columns = dt.f[:]
for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] == None)
del df[rows, columns]
print(df)
def test2(df):
print('\n--- test 2 ---\n')
df = df.copy()
#columns = dt.f['A', 'B', 'B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4']
#columns = df.keys()
columns = dt.f[:]
for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] != None)
df = df[rows, columns]
print(df)
def test3(df):
print('\n--- test 3 ---\n')
df = df.copy()
rows = (dt.f['B_lag_1'] != None) & (dt.f['B_lag_2'] != None) & (dt.f['B_lag_3'] != None) & (dt.f['B_lag_4'] != None)
columns = dt.f[:]
df = df[rows, columns]
print(df)
def test4(df):
print('\n--- test 4 ---\n')
df = df.copy()
columns = dt.f[:]
#rows = dt.rowall(dt.f['B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4'] != None) # use columns in some range
#rows = dt.rowall(dt.f['B_lag_1':'B_lag_4'] != None) # use columns in some range
#rows = dt.rowall(dt.f[float] != None) # use columns which have float values
rows = dt.rowall(dt.f[:] != None) # use all columns
df = df[rows, columns]
print(df)
# --- main ---
np.random.seed(42)
df = dt.Frame({
"A": np.repeat(np.arange(0, 2), 5),
"B": np.random.normal(0, 1, 10)
})
for n in range(1, 5):
df[f'B_lag_{n}'] = df[:, dt.shift(dt.f.B, n), dt.by('A')]['B']
# --- tests ---
test1(df)
test2(df)
test3(df)
test4(df)
How to remove rows with a NA value?
dat <- data.frame(x1 = c(1,2,3, NA, 5), x2 = c(100, NA, 300, 400, 500))
na.omit(dat)
x1 x2
1 1 100
3 3 300
5 5 500
remove rows from dataframe based on value, ignoring NAs
We need to select the rows where d
is not greater than 0 OR there is NA
in d
df[with(df, !d > 0 | is.na(d)), ]
# a b c d
#1 1 2 3 0
#2 4 NA 1 NA
#4 NA 5 0 0
Or we can also use subset
subset(df, !d > 0 | is.na(d))
or dplyr
filter
library(dplyr)
df %>% filter(!d > 0 | is.na(d))
The !d > 0
part can also be reversed to
subset(df, d < 1 | is.na(d))
to get the same result.
Remove rows where all columns except one have NA values?
We may use if_all
in filter
- select the columns a to b in if_all
, apply the is.na
(check for NA), the output will be TRUE for a row if both a and b have NA, negate (!
) to convert TRUE-> FALSE and FALSE->TRUE
library(dplyr)
df %>%
filter(!if_all(a:b, is.na))
-output
ID a b
1 1 ab <NA>
2 1 <NA> ab
Or instead of negating (!
), we may use complete.cases
with if_any
df %>%
filter(if_any(a:b, complete.cases))
ID a b
1 1 ab <NA>
2 1 <NA> ab
Regarding the issue in OP's code, the logic is created by looking whether there is atleast one NA (> 0
) which is true for all the rows. Instead, it should be all NA and then negate
na_rows <- df %>%
select(-"ID") %>%
is.na() %>%
{rowSums(.) == ncol(.)}
data
df <- structure(list(ID = c(1L, 1L, 1L), a = c("ab", NA, NA), b = c(NA,
"ab", NA)), class = "data.frame", row.names = c(NA, -3L))
How to drop rows of Pandas DataFrame whose value in a certain column is NaN
Don't drop, just take the rows where EPS is not NA:
df = df[df['EPS'].notna()]
Related Topics
R Column Check If Contains Value from Another Column
Clustering Algorithm for Obtaining Equal Sized Clusters
Select Unique Values with 'Select' Function in 'Dplyr' Library
Plot a Legend and Well-Spaced Universal Y-Axis and Main Titles in Grid.Arrange
Adding Lagged Variables to an Lm Model
Find the Most Frequently Occuring Words in a Text in R
R: Unexpected Results from P.Adjust (Fdr)
Why Is This Naive Matrix Multiplication Faster Than Base R'S
R Aggregate Data in One Column Based on 2 Other Columns
How to Make a Post Request with Header and Data Options in R Using Httr::Post
Install a Local R Package with Dependencies from Cran Mirror
Make Dataframe of Top N Frequent Terms for Multiple Corpora Using Tm Package in R
How to Prevent Exposure of My Password When Using Rgoogledocs
How to Know If R Is Running on 64 Bits Versus 32
For Each Group Summarise Means for All Variables in Dataframe (Ddply? Split)