How to Determine Whether a Column/Variable Is Numeric or Not in Pandas/Numpy

Stating which columns are numerical values only and stating it in original data frame

num_cols = list(df2.select_dtypes(include=[np.number]).columns.values)
values = ["numerical" if c in num_cols else "" for c in df2.columns]
# values: ['numerical', '']

desired_result = pd.DataFrame(values).T
desired_result.columns = df2.columns
# desired_result:
# column1 column2
# 0 numerical

how to check the dtype of a column in python pandas

You can access the data-type of a column with dtype:

for y in agg.columns:
if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
treat_numeric(agg[y])
else:
treat_str(agg[y])

How to check if float pandas column contains only integer numbers?

Comparison with astype(int)

Tentatively convert your column to int and test with np.array_equal:

np.array_equal(df.v, df.v.astype(int))
True


float.is_integer

You can use this python function in conjunction with an apply:

df.v.apply(float.is_integer).all()
True

Or, using python's all in a generator comprehension, for space efficiency:

all(x.is_integer() for x in df.v)
True

How to check if a variable is either a python list, numpy array or pandas series

You can do it using isinstance:

import pandas as pd
import numpy as np
def f(l):
if isinstance(l,(list,pd.core.series.Series,np.ndarray)):
print(5)
else:
raise Exception('wrong type')

Then f([1,2,3]) prints 5 while f(3.34) raises an error.

Check if dataframe column is Categorical

Use the name property to do the comparison instead, it should always work because it's just a string:

>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'

>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'

So, to sum up, you can end up with a simple, straightforward function:

def is_categorical(array_like):
return array_like.dtype.name == 'category'

Is there an efficient method of checking whether a column has mixed dtypes?

Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max over the array which being a builtin should be reasonably fast. And it does short-cicuit.

def ismixed(a):
try:
max(a)
return False
except TypeError as e: # we take this to imply mixed type
msg, fst, and_, snd = str(e).rsplit(' ', 3)
assert msg=="'>' not supported between instances of"
assert and_=="and"
assert fst!=snd
return True
except ValueError as e: # catch empty arrays
assert str(e)=="max() arg is an empty sequence"
return False

It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.

But it's reasonably fast. If we strip away all pandas overhead:

v = df.values

list(map(is_mixed, v.T))
# [True, False, False]
timeit(lambda: list(map(ismixed, v.T)), number=1000)
# 0.008936170022934675

For comparison

timeit(lambda: list(map(infer_dtype, v.T)), number=1000)
# 0.02499613002873957

How to determine if a pandas column type can be reduced from int64 to int32 or from float64 to float32?

I have a dataframe which is huge(8 gb). I am trying to find if i will loose any information if i downsize the columns from int64 to int32 ...

The simplest way to cast integers to a smaller type and make sure that you are not losing information is to use

df['col'] = pd.to_numeric(df['col'], downcast='integer')

This will both do the conversion, and check that the conversion didn't lose data. You'll need to do that for each integer column in your dataframe.

... or from float64 to float32.

Casting a number to a smaller floating point number always loses some information, unless you are dealing with an exact binary fraction. In practice, you can use 32-bit float if you need around 7 digits or fewer of precision.



Related Topics



Leave a reply



Submit