pandas: to_numeric for multiple columns
UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:
In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)
In [166]: df.dtypes
Out[166]:
GeoName object
ComponentName object
IndustryId int64
IndustryClassification object
Description object
2004 int64
2005 int64
2006 int64
2007 int64
2008 int64
2009 int64
2010 int64
2011 int64
2012 int64
2013 int64
2014 float64
dtype: object
If you need to convert multiple columns to numeric dtypes - use the following technique:
Sample source DF:
In [271]: df
Out[271]:
id a b c d e f
0 id_3 AAA 6 3 5 8 1
1 id_9 3 7 5 7 3 BBB
2 id_7 4 2 3 5 4 2
3 id_0 7 3 5 7 9 4
4 id_0 2 4 6 4 0 2
In [272]: df.dtypes
Out[272]:
id object
a object
b int64
c int64
d int64
e int64
f object
dtype: object
Converting selected columns to numeric dtypes:
In [273]: cols = df.columns.drop('id')
In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
In [275]: df
Out[275]:
id a b c d e f
0 id_3 NaN 6 3 5 8 1.0
1 id_9 3.0 7 5 7 3 NaN
2 id_7 4.0 2 3 5 4 2.0
3 id_0 7.0 3 5 7 9 4.0
4 id_0 2.0 4 6 4 0 2.0
In [276]: df.dtypes
Out[276]:
id object
a float64
b int64
c int64
d int64
e int64
f float64
dtype: object
PS if you want to select all string
(object
) columns use the following simple trick:
cols = df.columns[df.dtypes.eq('object')]
After applying pd.to_numeric on multiple columns there is no change in columns Dtype
You need assign back columns converted to numeric:
cols = ['temp', 'feels', 'wind', 'gust', 'rain', 'humidity', 'cloud', 'pressure']
df_weather[cols] = df_weather[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Use to_numeric on certain columns only in PANDAS
You can use:
Tracker_sample[['product1','product2','product3','product4','Total']].apply(pd.to_numeric, errors='coerce').fillna(0)
Pandas how to replace multiple columns of str% to float number?
Use pd.to_numeric
with errors = coerce
and DataFrame.stack
+ DataFrame.unstack
avoiding having to use apply
:
new_df = (pd.to_numeric(df.replace('%','',regex = True).stack(),
errors = 'coerce')
.div(100)
.unstack()
.fillna(df))
print(new_df)
A B C D E F
0 CEKAPE 0.236 1.17374 0.0053 0.0011 <2%
1 HYTFFZ 0.2532 1.1625 0.0188 0.0038 0.05
or dropna = False
and Series.str.replace
new_df = (pd.to_numeric(df.stack(dropna = False).str.replace('%',''),
errors = 'coerce')
.div(100)
.unstack()
.fillna(df)
)
Casting types of all columns starting with Pandas
Get all columns starting by comps
and casting to floats:
d = dict.fromkeys(df.columns[df.columns.str.startswith('comps')], 'float')
d = dict.fromkeys(df.filter(regex='^comps').columns, 'float')
df = df.astype(d)
print (df)
Or:
m = df.columns.str.startswith('comps')
df.loc[:, m] = df.loc[:, m].astype(float)
print (df)
Or:
c = df.filter(regex='^comps').columns
df[c] = df[c].astype(float)
print (df)
Or:
df.update(df.filter(regex='^comps').astype(float))
print (df)
If casting to floats failed, is necessary use to_numeric
:
m = df.columns.str.startswith('comps')
df.loc[:, m] = df.loc[:, m].apply(pd.to_numeric, errors='coerce')
print (df)
Or:
c = df.filter(regex='^comps').columns
df[c] = df[c].apply(pd.to_numeric, errors='coerce')
print (df)
Or:
df.update(df.filter(regex='^comps').apply(pd.to_numeric, errors='coerce'))
print (df)
Better way to convert pandas dataframe columns to numeric
Convert_objects is deprecated. Use this instead.
You can add parameter errors='coerce' to convert bad non numeric values to NaN.
conv_cols = obj_cols.apply(pd.to_numeric, errors = 'coerce')
The function will be applied to the whole DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Change column type in pandas
You have four main options for converting types in pandas:
to_numeric()
- provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See alsoto_datetime()
andto_timedelta()
.)astype()
- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).infer_objects()
- a utility method to convert object columns holding Python objects to a pandas type if possible.convert_dtypes()
- convert DataFrame columns to the "best possible" dtype that supportspd.NA
(pandas' object to indicate a missing value).
Read on for more detailed explanations and usage of each of these methods.
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric()
.
This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to to_numeric()
is a Series or a single column of a DataFrame.
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply()
method:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric()
also takes an errors
keyword argument that allows you to force non-numeric values to be NaN
, or simply ignore columns containing these values.
Here's an example using a Series of strings s
which has the object dtype:
>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object
The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN
as follows using the errors
keyword argument:
>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64
The third option for errors
is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with to_numeric()
will give you either an int64
or float64
dtype (or whatever integer width is native to your platform).
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32
, or int8
?
to_numeric()
gives you the option to downcast to either 'integer'
, 'signed'
, 'unsigned'
, 'float'
. Here's an example for a simple series s
of integer type:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to 'integer'
uses the smallest possible integer that can hold the values:
>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8
Downcasting to 'float'
similarly picks a smaller than normal floating type:
>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32
2. astype()
The astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16
), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).
Call the method on the object you want to convert and astype()
will try and convert it for you:
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said "try" - if astype()
does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN
or inf
value you'll get an error trying to convert it to an integer.
As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'
. Your original object will be returned untouched.
Be careful
astype()
is powerful, but it will sometimes convert values "incorrectly". For example:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
These are small integers, so how about converting to an unsigned 8-bit type to save memory?
>>> s.astype(np.uint8)
0 1
1 2
2 249
dtype: uint8
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using pd.to_numeric(s, downcast='unsigned')
instead could help prevent this error.
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects()
for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).
For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a object
b object
dtype: object
Using infer_objects()
, you can change the type of column 'a' to int64:
>>> df = df.infer_objects()
>>> df.dtypes
a int64
b object
dtype: object
Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int)
instead.
4. convert_dtypes()
Version 1.0 and above includes a method convert_dtypes()
to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA
missing value.
Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64
, a column of NumPy int32
values, will become the pandas dtype Int32
.
With our object
DataFrame df
, we get the following result:
>>> df.convert_dtypes().dtypes
a Int64
b string
dtype: object
Since column 'a' held integer values, it was converted to the Int64
type (which is capable of holding missing values, unlike int64
).
Column 'b' contained string objects, so was changed to pandas' string
dtype.
By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False
:
>>> df.convert_dtypes(infer_objects=False).dtypes
a object
b string
dtype: object
Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype
) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.
Related Topics
Combine Lists with Common Elements
Cannot Install Lxml on MAC Os X 10.9
Run Command and Get Its Stdout, Stderr Separately in Near Real Time Like in a Terminal
A Good Way to Get the Charset/Encoding of an Http Response in Python
How to Properly Assert That an Exception Gets Raised in Pytest
Flask SQLalchemy Query, Specify Column Names
Search for String in All Pandas Dataframe Columns and Filter
Transpose Nested List in Python
How to Sort a List of Tuples According to Another List
Google Colab: How to Read Data from My Google Drive
Add Pygame Module in Pycharm Ide
Python Memoising/Deferred Lookup Property Decorator