How to check whether a pandas DataFrame is empty?
You can use the attribute df.empty
to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: Pandas Documentation
How to check whether a DataFrame is empty?
IIUC: there is .empty
attribute:
DataFrame:
In [86]: pd.DataFrame().empty
Out[86]: True
In [87]: pd.DataFrame([1,2,3]).empty
Out[87]: False
Series:
In [88]: pd.Series().empty
Out[88]: True
In [89]: pd.Series([1,2,3]).empty
Out[89]: False
NOTE: checking the length of DF (len(df)
) might save you a few milliseconds compared to df.empty
method ;-)
In [142]: df = pd.DataFrame()
In [143]: %timeit df.empty
8.25 µs ± 22.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [144]: %timeit len(df)
2.35 µs ± 7.56 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [145]: df = pd.DataFrame(np.random.randn(10*5, 3), columns=['a', 'b', 'c'])
In [146]: %timeit df.empty
15.3 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [147]: %timeit len(df)
3.58 µs ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Check if Dataframe is empty and print results
You need to change the if to:
if df.empty == True:
or
if df.empty:
Find empty or NaN entry in Pandas Dataframe
np.where(pd.isnull(df))
returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap
requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull
.
Is there a way to check if dataframe is empty and if so to add a NA row?
Simple question, OP, but actually pretty interesting. All the elements of your code should work, but the issue is that when you run as is, it will return a list, not a data frame. Let me show you with an example:
growing_df <- data.frame(
A=rep(1, 3),
B=1:3,
c=LETTERS[4:6])
df_empty <- data.frame()
If we evaluate as you have written you get:
df <- ifelse(dim(df_empty)[1]==0, rbind(growing_df, NA))
with df
resulting in a List:
> class(df)
[1] "list"
> df
[[1]]
[1] 1 1 1 NA
The code "worked", but the resulting class of df
is wrong. It's odd because this works:
> rbind(growing_df, NA)
A B c
1 1 1 D
2 1 2 E
3 1 3 F
4 NA NA <NA>
The answer is to use if
and else
, rather than ifelse()
, just as @akrun noted in their answer. The reason is found if you dig into the documentation of ifelse()
:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
Since dim(df_empty)[1]
and/or nrow(df_empty)
are both vectors, the result will be saved as a list. That's why if {}
works, but not ifelse()
here. rbind()
results in a data frame normally, but the class of the result stored into df
when assigning with ifelse()
is decided based on the test element, not the resulting element. Compare that to if{}
statements, which have a result element decided based on whatever expression is input into {}
.
Fastest way to check if dataframe is empty
Suppose we have two types of data.frames:
emptyDF = data.frame(a=1,b="bah")[0,]
fullDF = data.frame(a=1,b="bah")
DFs = list(emptyDF,fullDF)[sample(1:2,1e4,replace=TRUE)]
and your if
condition shows up in a loop like
boundDF = data.frame()
for (i in seq_along(DFs)){ if (nrow(DFs[[i]]))
boundDF <- rbind(boundDF,DFs[[i]])
}
In this case, you're approaching the problem in the wrong way. The if
statement is not necessary: do.call(rbind,DFs)
or library(data.table); rbindlist(DFs)
is faster and clearer.
Generally, you are looking for improvement to the performance of your code in the wrong place. No matter what operation you're doing inside your loop, the step of checking for non-emptiness of the data.frame
is not going to be the part that is taking the most time. While there may be room for optimization in this step, "Premature optimization is the root of all evil" as Donald Knuth said.
Related Topics
Renaming Multiple Columns with Dplyr Rename(Across(
Compute All Pairwise Differences Within a Vector in R
Ggplot2: Horizontal Position of Stat_Summary with Geom_Boxplot
Predicting Probabilities for Gbm with Caret Library
Warning: Replacing Previous Import 'Head' When Loading 'Utils' in R
Extract Columns from Data Table by Numeric Indices Stored in a Vector
How to Write Special Characters in Rmarkdown Latex Documents
Difference Between [] and $ Operators for Subsetting
R - Delete Consecutive (Only) Duplicates
Rename Columns Using 'Starts_With()' Where New Prefix Is a String
Function Composition in R (And High Level Functions)
Return a List in Dplyr Mutate()
Filled.Contour in R 3.0.X Throws Error
R: How to Find What S3 Method Will Be Called on an Object
Ggplot2 and Geom_Density: How to Remove Baseline
Replace Nas with Mean of the Same Column of a Data.Table