How to Detect and Remove a Column That Contains Only Null Values

How to detect and remove a column that contains only null values?

How to detect whether a given column has only the NULL value:

SELECT 1  -- no GROUP BY therefore use a literal
FROM Locations
HAVING COUNT(a) = 0
AND COUNT(*) > 0;

The resultset will either consist of zero rows (column a has a non-NULL value) or one row (column a has only the NULL value). FWIW this code is Standard SQL-92.

How do I know whether to remove the column, or rows when dealing with null data?

yes we can decide a threshold for this.
if you have NAN values in all columns is best use:

data.dropna(axis=0,inplace=True)

this we drop all hows that contain NAN´s, if you use axis=1 will delete all columns that have NAN values.

One thing that you need think is how much percent of the values in a column is NAN, if more that 70% of NAN values is in only one column and i have no other way to complete this I delete this column.
if the NAN values is distributed in the columns is better delete rows.

i hope it helped you.

How can I inexpensively determine if a column contains only NULL records?

What about this:

SELECT
SUM(CASE WHEN column_1 IS NOT NULL THEN 1 ELSE 0) column_1_count,
SUM(CASE WHEN column_2 IS NOT NULL THEN 1 ELSE 0) column_2_count,
...
FROM table_name

?

You can easily create this query if you use INFORMATION_SCHEMA.COLUMNS table.

EDIT:

Another idea:

SELECT MAX(column_1), MAX(column_2),..... FROM table_name

If result contains value, column is populated. It should require one table scan.

Select columns with NULL values only

Here is the sql 2005 or later version: Replace ADDR_Address with your tablename.

declare @col varchar(255), @cmd varchar(max)

DECLARE getinfo cursor for
SELECT c.name FROM sys.tables t JOIN sys.columns c ON t.Object_ID = c.Object_ID
WHERE t.Name = 'ADDR_Address'

OPEN getinfo

FETCH NEXT FROM getinfo into @col

WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @cmd = 'IF NOT EXISTS (SELECT top 1 * FROM ADDR_Address WHERE [' + @col + '] IS NOT NULL) BEGIN print ''' + @col + ''' end'
EXEC(@cmd)

FETCH NEXT FROM getinfo into @col
END

CLOSE getinfo
DEALLOCATE getinfo

Delete rows if there are null values in a specific column in Pandas dataframe

If the relevant entries in Charge_Per_Line are empty (NaN) when you read into pandas, you can use df.dropna:

df = df.dropna(axis=0, subset=['Charge_Per_Line'])

If the values are genuinely -, then you can replace them with np.nan and then use df.dropna:

import numpy as np

df['Charge_Per_Line'] = df['Charge_Per_Line'].replace('-', np.nan)
df = df.dropna(axis=0, subset=['Charge_Per_Line'])

Remove columns from dataframe where ALL values are NA, NULL or empty

We can use Filter

Filter(function(x) !(all(x=="")), df)
# Var1 Var3
#1 2R+ 52
#2 2R+ 169
#3 2R+ 83
#4 2R+ 98
#5 2R+ NA
#6 2R+ 111
#7 2R+ 94
#8 2R+ 116
#9 2R+ 86

NOTE: It should also work if all the elements are NA for a particular column

df$Var3 <- NA
Filter(function(x) !(all(x=="")), df)
# Var1
#1 2R+
#2 2R+
#3 2R+
#4 2R+
#5 2R+
#6 2R+
#7 2R+
#8 2R+
#9 2R+

Update

Based on the updated dataset, if we need to remove the columns with only 0 values, then change the code to

Filter(function(x) !(all(x==""|x==0)), df2)
# VAR1 VAR3 VAR4 VAR7
#1 2R+ 52 1.05 30
#2 2R+ 169 1.02 40
#3 2R+ 83 NA 40
#4 2R+ 98 1.16 40
#5 2R+ 154 1.11 40
#6 2R+ 111 NA 15

data

df2 <- structure(list(VAR1 = c("2R+", "2R+", "2R+", "2R+", "2R+", "2R+"
), VAR2 = c("", "", "", "", "", ""), VAR3 = c(52L, 169L, 83L,
98L, 154L, 111L), VAR4 = c(1.05, 1.02, NA, 1.16, 1.11, NA), VAR5 = c(0L,
0L, 0L, 0L, 0L, 0L), VAR6 = c(0L, 0L, 0L, 0L, 0L, 0L), VAR7 = c(30L,
40L, 40L, 40L, 40L, 15L)), .Names = c("VAR1", "VAR2", "VAR3",
"VAR4", "VAR5", "VAR6", "VAR7"), row.names = c("1", "2", "3",
"4", "5", "6"), class = "data.frame")

Remove NaN/NULL columns in a Pandas dataframe?

Yes, dropna. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna docstring:

Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing

Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include

Returns
-------
dropped : DataFrame

The specific command to run would be:

df=df.dropna(axis=1,how='all')

Efficient way to find columns that contain ANY null values

Spark's SQL function any can check if any value of a column meets a condition.

from pyspark.sql import functions as F

data = [[1,2,3],[None, 5, 6], [7, None, 9]]
df = spark.createDataFrame(data, schema=["col1", "col2", "col3"])

cols = [f"any({col} is null) as {col}_contains_null" for col in df.columns]
df.selectExpr(cols).show()

Output:

+------------------+------------------+------------------+
|col1_contains_null|col2_contains_null|col3_contains_null|
+------------------+------------------+------------------+
| true| true| false|
+------------------+------------------+------------------+


Related Topics



Leave a reply



Submit