Determine the Data Types of a Data Frame's Columns

Determine the data types of a data frame's columns

Your best bet to start is to use ?str(). To explore some examples, let's make some data:

set.seed(3221)  # this makes the example exactly reproducible
my.data <- data.frame(y=rnorm(5),
x1=c(1:5),
x2=c(TRUE, TRUE, FALSE, FALSE, FALSE),
X3=letters[1:5])

@Wilmer E Henao H's solution is very streamlined:

sapply(my.data, class)
y x1 x2 X3
"numeric" "integer" "logical" "factor"

Using str() gets you that information plus extra goodies (such as the levels of your factors and the first few values of each variable):

str(my.data)
'data.frame': 5 obs. of 4 variables:
$ y : num 1.03 1.599 -0.818 0.872 -2.682
$ x1: int 1 2 3 4 5
$ x2: logi TRUE TRUE FALSE FALSE FALSE
$ X3: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5

@Gavin Simpson's approach is also streamlined, but provides slightly different information than class():

sapply(my.data, typeof)
y x1 x2 X3
"double" "integer" "logical" "integer"

For more information about class, typeof, and the middle child, mode, see this excellent SO thread: A comprehensive survey of the types of things in R. 'mode' and 'class' and 'typeof' are insufficient.

pandas how to check dtype for all columns in a dataframe?

The singular form dtype is used to check the data type for a single column. And the plural form dtypes is for data frame which returns data types for all columns. Essentially:

For a single column:

dataframe.column.dtype

For all columns:

dataframe.dtypes

Example:

import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [True, False, False], 'C': ['a', 'b', 'c']})

df.A.dtype
# dtype('int64')
df.B.dtype
# dtype('bool')
df.C.dtype
# dtype('O')

df.dtypes
#A int64
#B bool
#C object
#dtype: object

Pandas: Setting no. of max rows

Set display.max_rows:

pd.set_option('display.max_rows', 500)

For older versions of pandas (<=0.11.0) you need to change both display.height and display.max_rows.

pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

See also pd.describe_option('display').

You can set an option only temporarily for this one time like this:

from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
display(df) #need display to show the dataframe when using with in jupyter
#some pandas stuff

You can also reset an option back to its default value like this:

pd.reset_option('display.max_rows')

And reset all of them back:

pd.reset_option('all')

How do I get the classes of all columns in a data frame?

One option is to use lapply and class. For example:

> foo <- data.frame(c("a", "b"), c(1, 2))
> names(foo) <- c("SomeFactor", "SomeNumeric")
> lapply(foo, class)
$SomeFactor
[1] "factor"

$SomeNumeric
[1] "numeric"

Another option is str:

> str(foo)
'data.frame': 2 obs. of 2 variables:
$ SomeFactor : Factor w/ 2 levels "a","b": 1 2
$ SomeNumeric: num 1 2

Determining Pandas Column DataType

This is only a partial answer, but you can get frequency counts of the data type of the elements in a variable over the entire DataFrame as follows:

dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]

This returns

dtypeCount

[<class 'numpy.int32'> 4
Name: a, dtype: int64,
<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64,
<class 'numpy.int32'> 4
Name: c, dtype: int64]

It doesn't print this nicely, but you can pull out information for any variable by location:

dtypeCount[1]

<class 'int'> 2
<class 'str'> 2
Name: b, dtype: int64

which should get you started in finding what data types are causing the issue and how many of them there are.

You can then inspect the rows that have a str object in the second variable using

df[df.iloc[:,1].map(lambda x: type(x) == str)]

a b c
1 1 n 4
3 3 g 6

data

df = DataFrame({'a': range(4),
'b': [6, 'n', 7, 'g'],
'c': range(3, 7)})

Get list of pandas dataframe columns based on data type

If you want a list of columns of a certain type, you can use groupby:

>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
A B C D E
0 1 2.3456 c d 78

[1 rows x 5 columns]
>>> df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}


Related Topics



Leave a reply



Submit