How to detect and remove a column that contains only null values?
How to detect whether a given column has only the NULL
value:
SELECT 1 -- no GROUP BY therefore use a literal
FROM Locations
HAVING COUNT(a) = 0
AND COUNT(*) > 0;
The resultset will either consist of zero rows (column a
has a non-NULL
value) or one row (column a
has only the NULL
value). FWIW this code is Standard SQL-92.
How do I know whether to remove the column, or rows when dealing with null data?
yes we can decide a threshold for this.
if you have NAN values in all columns is best use:
data.dropna(axis=0,inplace=True)
this we drop all hows that contain NAN´s, if you use axis=1 will delete all columns that have NAN values.
One thing that you need think is how much percent of the values in a column is NAN, if more that 70% of NAN values is in only one column and i have no other way to complete this I delete this column.
if the NAN values is distributed in the columns is better delete rows.
i hope it helped you.
How can I inexpensively determine if a column contains only NULL records?
What about this:
SELECT
SUM(CASE WHEN column_1 IS NOT NULL THEN 1 ELSE 0) column_1_count,
SUM(CASE WHEN column_2 IS NOT NULL THEN 1 ELSE 0) column_2_count,
...
FROM table_name
?
You can easily create this query if you use INFORMATION_SCHEMA.COLUMNS table.
EDIT:
Another idea:
SELECT MAX(column_1), MAX(column_2),..... FROM table_name
If result contains value, column is populated. It should require one table scan.
Select columns with NULL values only
Here is the sql 2005 or later version: Replace ADDR_Address with your tablename.
declare @col varchar(255), @cmd varchar(max)
DECLARE getinfo cursor for
SELECT c.name FROM sys.tables t JOIN sys.columns c ON t.Object_ID = c.Object_ID
WHERE t.Name = 'ADDR_Address'
OPEN getinfo
FETCH NEXT FROM getinfo into @col
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @cmd = 'IF NOT EXISTS (SELECT top 1 * FROM ADDR_Address WHERE [' + @col + '] IS NOT NULL) BEGIN print ''' + @col + ''' end'
EXEC(@cmd)
FETCH NEXT FROM getinfo into @col
END
CLOSE getinfo
DEALLOCATE getinfo
Delete rows if there are null values in a specific column in Pandas dataframe
If the relevant entries in Charge_Per_Line are empty (NaN
) when you read into pandas, you can use df.dropna
:
df = df.dropna(axis=0, subset=['Charge_Per_Line'])
If the values are genuinely -
, then you can replace them with np.nan
and then use df.dropna
:
import numpy as np
df['Charge_Per_Line'] = df['Charge_Per_Line'].replace('-', np.nan)
df = df.dropna(axis=0, subset=['Charge_Per_Line'])
Remove columns from dataframe where ALL values are NA, NULL or empty
We can use Filter
Filter(function(x) !(all(x=="")), df)
# Var1 Var3
#1 2R+ 52
#2 2R+ 169
#3 2R+ 83
#4 2R+ 98
#5 2R+ NA
#6 2R+ 111
#7 2R+ 94
#8 2R+ 116
#9 2R+ 86
NOTE: It should also work if all the elements are NA for a particular column
df$Var3 <- NA
Filter(function(x) !(all(x=="")), df)
# Var1
#1 2R+
#2 2R+
#3 2R+
#4 2R+
#5 2R+
#6 2R+
#7 2R+
#8 2R+
#9 2R+
Update
Based on the updated dataset, if we need to remove the columns with only 0 values, then change the code to
Filter(function(x) !(all(x==""|x==0)), df2)
# VAR1 VAR3 VAR4 VAR7
#1 2R+ 52 1.05 30
#2 2R+ 169 1.02 40
#3 2R+ 83 NA 40
#4 2R+ 98 1.16 40
#5 2R+ 154 1.11 40
#6 2R+ 111 NA 15
data
df2 <- structure(list(VAR1 = c("2R+", "2R+", "2R+", "2R+", "2R+", "2R+"
), VAR2 = c("", "", "", "", "", ""), VAR3 = c(52L, 169L, 83L,
98L, 154L, 111L), VAR4 = c(1.05, 1.02, NA, 1.16, 1.11, NA), VAR5 = c(0L,
0L, 0L, 0L, 0L, 0L), VAR6 = c(0L, 0L, 0L, 0L, 0L, 0L), VAR7 = c(30L,
40L, 40L, 40L, 40L, 15L)), .Names = c("VAR1", "VAR2", "VAR3",
"VAR4", "VAR5", "VAR6", "VAR7"), row.names = c("1", "2", "3",
"4", "5", "6"), class = "data.frame")
Remove NaN/NULL columns in a Pandas dataframe?
Yes, dropna
. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna
docstring:
Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
Returns
-------
dropped : DataFrame
The specific command to run would be:
df=df.dropna(axis=1,how='all')
Efficient way to find columns that contain ANY null values
Spark's SQL function any can check if any value of a column meets a condition.
from pyspark.sql import functions as F
data = [[1,2,3],[None, 5, 6], [7, None, 9]]
df = spark.createDataFrame(data, schema=["col1", "col2", "col3"])
cols = [f"any({col} is null) as {col}_contains_null" for col in df.columns]
df.selectExpr(cols).show()
Output:
+------------------+------------------+------------------+
|col1_contains_null|col2_contains_null|col3_contains_null|
+------------------+------------------+------------------+
| true| true| false|
+------------------+------------------+------------------+
Related Topics
Replace Multiple Characters in Sql
Sql Server 2008 Change Data Capture Vs Triggers in Audit Trail
Sql Query to Sum Fields from Different Tables
Can't Connect to Msql Server After Upgrading It on Linux
Selecting The Most Common Value from Relation - SQL Statement
Oracle SQL to Change Column Type from Number to Varchar2 While It Contains Data
Sql Query for Insert in Grails
Pl/Sql- Get Column Names from a Query
How to Select Only Row with Max Sequence Without Using a Subquery
Why Bulk Import Is Faster Than Bunch of Inserts
How to Force MySQL to Perform Subquery First
How to Substitute a Left Join in Sql
Sql - Stdevp or Stdev and How to Use It
Why Does Connect by Level on a Table Return Extra Rows
Indexing and Query High Dimensional Data in Postgresql
Ora-06502: Pl/Sql: Numeric or Value Error: Character String Buffer Too Small