Strings in a DataFrame, but dtype is object
The dtype
object comes from NumPy, it describes the type of element in a ndarray
. Every element in an ndarray
must have the same size in bytes. For int64
and float64
, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray
directly, Pandas uses an object ndarray
, which saves pointers to objects; because of this the dtype
of this kind ndarray
is object.
Here is an example:
- the int64 array contains 4 int64 value.
- the object array contains 4 pointers to 3 string objects.
dataframe string type cannot use replace method
If you see the difference by checking with df.dtypes
it's evident that you r datatype is ultimately is an object
but column is only string hence you need to apply pandas.Series.str.replace
to get your results.
However, when you choose dtype="object"
your both dtype
and column data remains object
thus you don't need to use .str
converion.
Please check the source code, which explains it well:
For calling
.str.{method}
on a Series or Index, it is necessary to
first
initialize the :class:StringMethods
object, and then call the method.
>>> df = pd.DataFrame({'a': ['asdf']}, dtype="string")
>>> df
a
0 asdf
>>> df.dtypes
a string
dtype: object
>>> df["a"].str.replace("a", "b", regex=True)
0 bsdf
Name: a, dtype: string
>>> df = pd.DataFrame({'a': ['asdf']}, dtype="object")
>>> df.dtypes
a object
dtype: object
dtype:
browned from @HYRY.
Look at here source of inspiration for below explanation
From pandas docs where All dtypes can now be converted to StringDtype
The dtype
object comes from NumPy, it describes the type of element in a ndarray
. Every element in an ndarray
must have the same size in bytes. For int64
and float64
, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray
directly, Pandas uses an object ndarray
, which saves pointers to objects; because of this the dtype
of this kind ndarray
is object.
Here is an example:
- the int64 array contains 4 int64 value.
- the object array contains 4 pointers to 3 string objects.
Note:
Object dtype have a much broader scope. They can not only include strings, but also any other data that Pandas doesn't understand.
pandas distinction between str and object types
Numpy's string dtypes aren't python strings.
Therefore, pandas
deliberately uses native python strings, which require an object dtype.
First off, let me demonstrate a bit of what I mean by numpy's strings being different:
In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)
Now, 'x' is a numpy
string dtype (fixed-width, c-like string) and y
is an array of native python strings.
If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:
In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
dtype='|S7')
While the object dtype versions can be arbitrary length:
In [6]: y[1] = 'a really really really long'
In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)
Next, the |S
dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.
Finally, numpy's strings are actually mutable, while Python strings are not. For example:
In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
dtype='|S7')
For all of these reasons, pandas
chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas
. Instead, it always uses native python strings, which behave in a more intuitive way for most users.
Pandas: Get column dtype as string
Simply convert to string:
df.dtypes.astype(str)[0]
Or if you're really only interested in a single value, use the name
attribute.
df.dtypes[0].name
Output: 'int64'
For the whole Series:
>>> df.dtypes.astype(str)
col1 int64
col2 int64
dtype: object
pandas dtype conversion from object to string
All strings are represented as variable-length (which is what object
dtype is holding). You can do series.astype('S32')
if you want; but it will be recast if you then store it in a DataFrame or do much with it. This is for simplicity.
Certain serialization formats, e.g. HDFStore
stores the strings as fixed-length strings on disk though.
You can series.astype(int32)
if you would like and it will store as the new type.
Convert object data type to string issue in python
object
is the default container capable of holding strings, or any combination of dtypes.
If you are using a version of pandas < '1.0.0'
this is your only option. If you are using pd.__version__ >= '1.0.0'
then you can use the new experimental pd.StringDtype()
dtype. Being experimental, the behavior is subject to change in future versions, so use at your own risk.
df.dtypes
#country object
# .astype(str) and .astype('str') keep the column as object.
df['country'] = df['country'].astype(str)
df.dtypes
#country object
df['country'] = df['country'].astype(pd.StringDtype())
df.dtypes
#country string
Related Topics
How to Remove \Xa0 from String in Python
Catch a Thread's Exception in the Caller Thread
How to Use a Python Script in the Command Line Without Cd-Ing to Its Directory? Is It the Pythonpath
How to Efficiently Parse Fixed Width Files
Why Do I Get a Syntaxerror for a Unicode Escape in My File Path
Getting "Permission Denied" When Running Pip as Root on My MAC
Quoting Backslashes in Python String Literals
Typeerror: Can't Convert 'Int' Object to Str Implicitly
Convert Excel Style Date with Pandas
Pandas Long to Wide Reshape, by Two Variables
How to Find Median and Quantiles Using Spark
Python 2.7 Getting User Input and Manipulating as String Without Quotations
When Do I Need to Call Mainloop in a Tkinter Application
Sqlite Parameter Substitution Problem