Drop Columns That Take Less Than N Values

drop columns that take less than n values?

Alternatively, you can use select_if() from dplyr where you can pass a function as predicate to select columns:

library(dplyr)
df %>% select_if(function(col) n_distinct(col) > 10)

# v2 V3 v4
#1 T a 12
#2 R k 7
#3 L l 1
# ...

Or using select with where in dplyr version >=1.00

df  %>%
select(where(~ n_distinct(.) > 10))

Drop columns in Dataframe if more than 90% of the values in the column are 0's

First of all, next time please give an example dataset, not an image or copy of one. It's best to give a minimal example that reproduces your problem (it's also a good way to investigate your problem). This df, for example, will do the trick:

df = pd.DataFrame.from_dict({
'a':[1,0,0,0,0,0,0,0,0,0,0],
'b':[1,1,1,0,1,0,0,0,0,0,0]})

Now, the previous answers help, but if you can avoid a loop, it's preferable. You can write something simpler and more concise that will do the trick:

df.drop(columns=df.columns[df.eq(0).mean()>0.9])

Let's go through it step by step:

The df.eq(0) returns True \ False in each cell.

The .mean() method treats True as 1 and False as 0, so comparing that mean to 0.9 is what you want.

Calling df.columns[...] at these places will return only those where the >0.9 holds,
and drop just drops them.

quickly drop dataframe columns with only one distinct value

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])

In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:

In [156]: df
Out[156]:
1
0 2
1 3
2 2

Timing results -

In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res

## -- End pasted text --

In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})

In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop

In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop

In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using unique and looping through the columns.

How do I remove columns in Pandas that contains non-zero in less than 1% of number of rows?

You can use loc to get the specified columns or rows for your new df as shown in this answer, essentially you can do this:

df.loc[rows, cols]  # accepts boolean lists/arrays

So the df with removed columns can be achieved with this:

col_condition = df[df > 0].count() / df.shape[0] >= .01
df_ = df[:, col_condition]

If you need to switch between columns and rows you can simply transpose the dataframe with

df.T

So the same for rows where number of nonzeros is less than 1% of length of columns:

row_condition = df.T[df.T > 0].count() / df.shape[1] >= .01
df_ = df[row_condition]

And in a bit more concise forms:

df_ = df.loc[:, df.gt(0).mean() >= .01]  # keep columns
df_ = df[df.T.gt(0).mean() >= .01] # keep rows

Remove columns from df where one of 2 values has very few instances

There are probably more elegant solutions, but this should do the trick:

check_max_pct = lambda x : x.value_counts().max()/x.value_counts().sum() < 0.998
df = df.loc[:, [check_max_pct(df[col]) for col in df.columns]]

The function checks whether the dominant value in a column makes up over 99.8% of all values (can be adjusted by changed the 0.998 value), so it should also extend to instances where there are more than two unique values in a column, but only a single value makes almost all cases.

How to remove rows of columns whose value count is less than particular number?

Use:

df = pd.DataFrame({'Brand':[1,2,3,3,3,3,3,3,3,3]})
df[df.apply(lambda x: df.value_counts()[x]>6)['Brand'].values]

Output:

Sample Image

A more efficient way, if your data size is huge:

temp = df.value_counts()>6
df[df['Brand'].isin(temp[temp].index.get_level_values(0).values)]

output:

Sample Image

Another way:

df = pd.DataFrame({'Brand':[1,2,3,3,3,3,3,3,3,3]})
temp = df['Brand'].tolist()
df[df['Brand'].apply(lambda x: temp.count(x)>6)]

with the same output.

How to drop columns which have same values in all rows via pandas or spark dataframe?

What we can do is use nunique to calculate the number of unique values in each column of the dataframe, and drop the columns which only have a single unique value:

In [285]:
nunique = df.nunique()
cols_to_drop = nunique[nunique == 1].index
df.drop(cols_to_drop, axis=1)

Out[285]:
index id name data1
0 0 345 name1 3
1 1 12 name2 2
2 5 2 name6 7

Another way is to just diff the numeric columns, take abs values and sums them:

In [298]:
cols = df.select_dtypes([np.number]).columns
diff = df[cols].diff().abs().sum()
df.drop(diff[diff== 0].index, axis=1)

Out[298]:
index id name data1
0 0 345 name1 3
1 1 12 name2 2
2 5 2 name6 7

Another approach is to use the property that the standard deviation will be zero for a column with the same value:

In [300]:
cols = df.select_dtypes([np.number]).columns
std = df[cols].std()
cols_to_drop = std[std==0].index
df.drop(cols_to_drop, axis=1)

Out[300]:
index id name data1
0 0 345 name1 3
1 1 12 name2 2
2 5 2 name6 7

Actually the above can be done in a one-liner:

In [306]:
df.drop(df.std()[(df.std() == 0)].index, axis=1)

Out[306]:
index id name data1
0 0 345 name1 3
1 1 12 name2 2
2 5 2 name6 7

How to drop entire record if more than 90% of features have missing value in pandas

You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).

df.dropna(axis=0, thresh=50, inplace=True)

How to drop column according to NAN percentage for dataframe?

You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:

df = df.loc[:, df.isnull().mean() < .8]

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan

print (df.isnull().mean())
A 0.81
B 0.00
C 0.06
D 0.80
E 0.00
dtype: float64

df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
B C E
0 0.278369 NaN 0.004719
1 0.670749 NaN 0.575093
2 0.209202 NaN 0.219697
3 0.811683 NaN 0.274074
4 0.940030 NaN 0.175410

If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:

np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
1 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN
3 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0
5 NaN NaN NaN 1.0 1.0 NaN NaN 1.0 NaN 1.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
9 1.0 NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN

df1 = df.dropna(thresh=2, axis=1)
print (df1)
0 3 4 5 7 9
0 NaN 1.0 1.0 NaN NaN NaN
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN NaN NaN
4 NaN NaN NaN 1.0 NaN 1.0
5 NaN 1.0 1.0 NaN 1.0 1.0
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 1.0 NaN
9 1.0 NaN 1.0 NaN 1.0 NaN

EDIT: For non-Boolean data

Total number of NaN entries in a column must be less than 80% of total entries:

 df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]


Related Topics



Leave a reply



Submit