Strings in a Dataframe, But Dtype Is Object

Strings in a DataFrame, but dtype is object

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.

Here is an example:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.

Sample Image

dataframe string type cannot use replace method

If you see the difference by checking with df.dtypes it's evident that you r datatype is ultimately is an object but column is only string hence you need to apply pandas.Series.str.replace to get your results.

However, when you choose dtype="object" your both dtype and column data remains object thus you don't need to use .str converion.

Please check the source code, which explains it well:

For calling .str.{method} on a Series or Index, it is necessary to
first
initialize the :class:StringMethods object, and then call the method.

>>> df = pd.DataFrame({'a': ['asdf']}, dtype="string")
>>> df
a
0 asdf

>>> df.dtypes
a string
dtype: object

>>> df["a"].str.replace("a", "b", regex=True)
0 bsdf
Name: a, dtype: string
>>> df = pd.DataFrame({'a': ['asdf']}, dtype="object")
>>> df.dtypes
a object
dtype: object

dtype:

browned from @HYRY.

Look at here source of inspiration for below explanation

From pandas docs where All dtypes can now be converted to StringDtype

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.

Here is an example:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.

Sample Image

Note:

Object dtype have a much broader scope. They can not only include strings, but also any other data that Pandas doesn't understand.

pandas distinction between str and object types

Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
dtype='|S7')

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

Pandas: Get column dtype as string

Simply convert to string:

df.dtypes.astype(str)[0]

Or if you're really only interested in a single value, use the name attribute.

df.dtypes[0].name

Output: 'int64'

For the whole Series:

>>> df.dtypes.astype(str)
col1 int64
col2 int64
dtype: object

pandas dtype conversion from object to string

All strings are represented as variable-length (which is what object dtype is holding). You can do series.astype('S32') if you want; but it will be recast if you then store it in a DataFrame or do much with it. This is for simplicity.

Certain serialization formats, e.g. HDFStore stores the strings as fixed-length strings on disk though.

You can series.astype(int32) if you would like and it will store as the new type.

Convert object data type to string issue in python

object is the default container capable of holding strings, or any combination of dtypes.

If you are using a version of pandas < '1.0.0' this is your only option. If you are using pd.__version__ >= '1.0.0' then you can use the new experimental pd.StringDtype() dtype. Being experimental, the behavior is subject to change in future versions, so use at your own risk.

df.dtypes
#country object

# .astype(str) and .astype('str') keep the column as object.
df['country'] = df['country'].astype(str)
df.dtypes
#country object

df['country'] = df['country'].astype(pd.StringDtype())
df.dtypes
#country string


Related Topics



Leave a reply



Submit