How to Remove "Rows" with a Na Value

Remove rows with all or some NAs (missing values) in data.frame

Also check complete.cases :

> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

but using complete.cases is quite a lot more clear, and faster.

How to remove row if it has a NA value in one certain column

The easiest solution is to use is.na():

df[!is.na(df$B), ]

which gives you:

   A B  C
1 NA 2 NA
2 1 2 3
4 1 2 3

remove Rows with complete set of NA

We can use dplyr. With the example by @lovalery:

library(dplyr)

df %>% filter(!if_all(V2:V3, is.na))

#> V1 V2 V3
#> 1 3 3 NA
#> 2 NA 1 NA
#> 3 3 5 NA

We can use many different selection statements inside if_all. Check the documentation for more examples.

Deleting rows corresponding to particular string, but keep NA values

We can create a condition with is.na to return those elements that are NA as well with |

 subset(df, Col1 != 'string' | is.na(Col1))

Remove rows which have na values

I never used datatable but pandas.DataFrame has isna() to select rows with na, and drop() to remove rows (or it can use del for this) and I found similar functions for datatable.

  • Official documentation: Delete Rows/Columns
  • Other place (found with Google): Filtering out Empty Rows

datatable can use del to remove selected rows. It can use also .isna() or == None to select rows with na. Problem is that it can filter it only on one column - so it may need for-loop to check different columns.

columns = dt.f[:]

for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] == None)
del df[rows, columns]

print(df)

This removes values from datatable but not rows and it create empty rows like this

   |     A          B    B_lag_1    B_lag_2    B_lag_3    B_lag_4
| int64 float64 float64 float64 float64 float64
-- + ----- --------- --------- --------- --------- ---------
0 | NA NA NA NA NA NA
1 | NA NA NA NA NA NA
2 | NA NA NA NA NA NA
3 | NA NA NA NA NA NA
4 | 0 -0.234153 1.52303 0.647689 -0.138264 0.496714
5 | NA NA NA NA NA NA
6 | NA NA NA NA NA NA
7 | NA NA NA NA NA NA
8 | NA NA NA NA NA NA
9 | 1 0.54256 -0.469474 0.767435 1.57921 -0.234137
[10 rows x 6 columns]

It can be better to keep rows which don't have None

columns = dt.f[:]

for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] != None)
df = df[rows, columns]

print(df)

Result:

   |     A          B    B_lag_1   B_lag_2    B_lag_3    B_lag_4
| int64 float64 float64 float64 float64 float64
-- + ----- --------- --------- -------- --------- ---------
0 | 0 -0.234153 1.52303 0.647689 -0.138264 0.496714
1 | 1 0.54256 -0.469474 0.767435 1.57921 -0.234137
[2 rows x 6 columns]

But you can use & (as operator AND) and | (as operator OR) to do the same without for-loop.

columns = dt.f[:]

rows = (dt.f['B_lag_1'] != None) & (dt.f['B_lag_2'] != None) & (dt.f['B_lag_3'] != None) & (dt.f['B_lag_4'] != None)

df = df[rows, columns]

print(df)

But later I found that datatable has dt.rowall() and dt.rowany() to work with many columns and code can be simpler.

  • Other place (found with Google): Filtering across Multiple Columns

rowall() works like operator AND, rowany() works like operator OR.

columns = dt.f[:]

rows = dt.rowall(dt.f['B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4'] != None)
#rows = dt.rowall(dt.f['B_lag_1':'B_lag_4'] != None) # range of columns
#rows = dt.rowall(dt.f[:] != None) # all columns

df = df[rows, columns]

print(df)

Full working code:

I took code from my previous answer Create many lagged variables

import datatable as dt
import numpy as np

def test1(df):
print('\n--- test 1 ---\n')

df = df.copy()

#columns = dt.f['A', 'B', 'B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4']
#columns = df.keys()
columns = dt.f[:]

for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] == None)
del df[rows, columns]

print(df)

def test2(df):
print('\n--- test 2 ---\n')

df = df.copy()

#columns = dt.f['A', 'B', 'B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4']
#columns = df.keys()
columns = dt.f[:]

for n in range(1, 5):
rows = (dt.f[f'B_lag_{n}'] != None)
df = df[rows, columns]

print(df)

def test3(df):
print('\n--- test 3 ---\n')

df = df.copy()

rows = (dt.f['B_lag_1'] != None) & (dt.f['B_lag_2'] != None) & (dt.f['B_lag_3'] != None) & (dt.f['B_lag_4'] != None)

columns = dt.f[:]

df = df[rows, columns]

print(df)

def test4(df):
print('\n--- test 4 ---\n')

df = df.copy()

columns = dt.f[:]

#rows = dt.rowall(dt.f['B_lag_1', 'B_lag_2', 'B_lag_3', 'B_lag_4'] != None) # use columns in some range
#rows = dt.rowall(dt.f['B_lag_1':'B_lag_4'] != None) # use columns in some range
#rows = dt.rowall(dt.f[float] != None) # use columns which have float values
rows = dt.rowall(dt.f[:] != None) # use all columns

df = df[rows, columns]

print(df)


# --- main ---

np.random.seed(42)

df = dt.Frame({
"A": np.repeat(np.arange(0, 2), 5),
"B": np.random.normal(0, 1, 10)
})

for n in range(1, 5):
df[f'B_lag_{n}'] = df[:, dt.shift(dt.f.B, n), dt.by('A')]['B']

# --- tests ---

test1(df)
test2(df)
test3(df)
test4(df)

How to remove rows with a NA value?

dat <- data.frame(x1 = c(1,2,3, NA, 5), x2 = c(100, NA, 300, 400, 500))

na.omit(dat)
x1 x2
1 1 100
3 3 300
5 5 500

remove rows from dataframe based on value, ignoring NAs

We need to select the rows where d is not greater than 0 OR there is NA in d

df[with(df, !d > 0 | is.na(d)), ]

# a b c d
#1 1 2 3 0
#2 4 NA 1 NA
#4 NA 5 0 0

Or we can also use subset

subset(df, !d > 0 | is.na(d))

or dplyr filter

library(dplyr)
df %>% filter(!d > 0 | is.na(d))

The !d > 0 part can also be reversed to

subset(df, d < 1 | is.na(d))

to get the same result.

Remove rows where all columns except one have NA values?

We may use if_all in filter- select the columns a to b in if_all, apply the is.na (check for NA), the output will be TRUE for a row if both a and b have NA, negate (!) to convert TRUE-> FALSE and FALSE->TRUE

library(dplyr)
df %>%
filter(!if_all(a:b, is.na))

-output

ID    a    b
1 1 ab <NA>
2 1 <NA> ab

Or instead of negating (!), we may use complete.cases with if_any

df %>% 
filter(if_any(a:b, complete.cases))
ID a b
1 1 ab <NA>
2 1 <NA> ab

Regarding the issue in OP's code, the logic is created by looking whether there is atleast one NA (> 0) which is true for all the rows. Instead, it should be all NA and then negate

na_rows <- df %>% 
select(-"ID") %>%
is.na() %>%
{rowSums(.) == ncol(.)}

data

df <- structure(list(ID = c(1L, 1L, 1L), a = c("ab", NA, NA), b = c(NA, 
"ab", NA)), class = "data.frame", row.names = c(NA, -3L))

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

Don't drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]


Related Topics



Leave a reply



Submit