How to convert a data frame column to numeric type?
Since (still) nobody got check-mark, I assume that you have some practical issue in mind, mostly because you haven't specified what type of vector you want to convert to numeric
. I suggest that you should apply transform
function in order to complete your task.
Now I'm about to demonstrate certain "conversion anomaly":
# create dummy data.frame
d <- data.frame(char = letters[1:5],
fake_char = as.character(1:5),
fac = factor(1:5),
char_fac = factor(letters[1:5]),
num = 1:5, stringsAsFactors = FALSE)
Let us have a glance at data.frame
> d
char fake_char fac char_fac num
1 a 1 1 a 1
2 b 2 2 b 2
3 c 3 3 c 3
4 d 4 4 d 4
5 e 5 5 e 5
and let us run:
> sapply(d, mode)
char fake_char fac char_fac num
"character" "character" "numeric" "numeric" "numeric"
> sapply(d, class)
char fake_char fac char_fac num
"character" "character" "factor" "factor" "integer"
Now you probably ask yourself "Where's an anomaly?" Well, I've bumped into quite peculiar things in R, and this is not the most confounding thing, but it can confuse you, especially if you read this before rolling into bed.
Here goes: first two columns are character
. I've deliberately called 2nd one fake_char
. Spot the similarity of this character
variable with one that Dirk created in his reply. It's actually a numerical
vector converted to character
. 3rd and 4th column are factor
, and the last one is "purely" numeric
.
If you utilize transform
function, you can convert the fake_char
into numeric
, but not the char
variable itself.
> transform(d, char = as.numeric(char))
char fake_char fac char_fac num
1 NA 1 1 a 1
2 NA 2 2 b 2
3 NA 3 3 c 3
4 NA 4 4 d 4
5 NA 5 5 e 5
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
but if you do same thing on fake_char
and char_fac
, you'll be lucky, and get away with no NA's:
> transform(d, fake_char = as.numeric(fake_char),
char_fac = as.numeric(char_fac))
char fake_char fac char_fac num
1 a 1 1 1 1
2 b 2 2 2 2
3 c 3 3 3 3
4 d 4 4 4 4
5 e 5 5 5 5
If you save transformed data.frame
and check for mode
and class
, you'll get:
> D <- transform(d, fake_char = as.numeric(fake_char),
char_fac = as.numeric(char_fac))
> sapply(D, mode)
char fake_char fac char_fac num
"character" "numeric" "numeric" "numeric" "numeric"
> sapply(D, class)
char fake_char fac char_fac num
"character" "numeric" "factor" "numeric" "integer"
So, the conclusion is: Yes, you can convert character
vector into a numeric
one, but only if it's elements are "convertible" to numeric
. If there's just one character
element in vector, you'll get error when trying to convert that vector to numerical
one.
And just to prove my point:
> err <- c(1, "b", 3, 4, "e")
> mode(err)
[1] "character"
> class(err)
[1] "character"
> char <- as.numeric(err)
Warning message:
NAs introduced by coercion
> char
[1] 1 NA 3 4 NA
And now, just for fun (or practice), try to guess the output of these commands:
> fac <- as.factor(err)
> fac
???
> num <- as.numeric(fac)
> num
???
Kind regards to Patrick Burns! =)
Change column type in pandas
You have four main options for converting types in pandas:
to_numeric()
- provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See alsoto_datetime()
andto_timedelta()
.)astype()
- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).infer_objects()
- a utility method to convert object columns holding Python objects to a pandas type if possible.convert_dtypes()
- convert DataFrame columns to the "best possible" dtype that supportspd.NA
(pandas' object to indicate a missing value).
Read on for more detailed explanations and usage of each of these methods.
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric()
.
This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to to_numeric()
is a Series or a single column of a DataFrame.
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply()
method:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric()
also takes an errors
keyword argument that allows you to force non-numeric values to be NaN
, or simply ignore columns containing these values.
Here's an example using a Series of strings s
which has the object dtype:
>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object
The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN
as follows using the errors
keyword argument:
>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64
The third option for errors
is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with to_numeric()
will give you either an int64
or float64
dtype (or whatever integer width is native to your platform).
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32
, or int8
?
to_numeric()
gives you the option to downcast to either 'integer'
, 'signed'
, 'unsigned'
, 'float'
. Here's an example for a simple series s
of integer type:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to 'integer'
uses the smallest possible integer that can hold the values:
>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8
Downcasting to 'float'
similarly picks a smaller than normal floating type:
>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32
2. astype()
The astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16
), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).
Call the method on the object you want to convert and astype()
will try and convert it for you:
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said "try" - if astype()
does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN
or inf
value you'll get an error trying to convert it to an integer.
As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'
. Your original object will be returned untouched.
Be careful
astype()
is powerful, but it will sometimes convert values "incorrectly". For example:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
These are small integers, so how about converting to an unsigned 8-bit type to save memory?
>>> s.astype(np.uint8)
0 1
1 2
2 249
dtype: uint8
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using pd.to_numeric(s, downcast='unsigned')
instead could help prevent this error.
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects()
for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).
For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a object
b object
dtype: object
Using infer_objects()
, you can change the type of column 'a' to int64:
>>> df = df.infer_objects()
>>> df.dtypes
a int64
b object
dtype: object
Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int)
instead.
4. convert_dtypes()
Version 1.0 and above includes a method convert_dtypes()
to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA
missing value.
Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64
, a column of NumPy int32
values, will become the pandas dtype Int32
.
With our object
DataFrame df
, we get the following result:
>>> df.convert_dtypes().dtypes
a Int64
b string
dtype: object
Since column 'a' held integer values, it was converted to the Int64
type (which is capable of holding missing values, unlike int64
).
Column 'b' contained string objects, so was changed to pandas' string
dtype.
By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False
:
>>> df.convert_dtypes(infer_objects=False).dtypes
a object
b string
dtype: object
Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype
) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.
Converting data frame column from character to numeric
If we need only one column to be numeric
yyz$b <- as.numeric(as.character(yyz$b))
But, if all the columns needs to changed to numeric
, use lapply
to loop over the columns and convert to numeric
by first converting it to character
class as the columns were factor
.
yyz[] <- lapply(yyz, function(x) as.numeric(as.character(x)))
Both the columns in the OP's post are factor
because of the string "n/a"
. This could be easily avoided while reading the file using na.strings = "n/a"
in the read.table/read.csv
or if we are using data.frame
, we can have character
columns with stringsAsFactors=FALSE
(the default is stringsAsFactors=TRUE
)
Regarding the usage of apply
, it converts the dataset to matrix
and matrix
can hold only a single class. To check the class
, we need
lapply(yyz, class)
Or
sapply(yyz, class)
Or check
str(yyz)
Convert data. frame column character to numeric
You can try,
mapply(function(x, y)paste(x + as.numeric(y), collapse = ','),df$C1 ,strsplit(df$C3, ','))
[1] "33,333,3933,433,4533,433,4233" "83,132,149,158,241,243,253,266,301" "146,149,159,275,420,424,529,627,628,642"
DATA
df <- data.frame(C1 = c(33, 83, 146),
C2 = c(1, 2, 3),
C3 = c('0,300,3900,400,4500,400,4200', '0,49,66,75,158,160,170,183,218', '0,3,13,129,274,278,383,481,482,496'),
stringsAsFactors = FALSE)
EDIT
To make C3
into numeric you will have to split it into many columns. There are a bunch of ways to do it as shown here. I like the splitstackshape
approach, i.e.
library(splitstackshape)
df1 <- cSplit(df, 'C3', sep = ',')
#C1 C2 C3_01 C3_02 C3_03 C3_04 C3_05 C3_06 C3_07 C3_08 C3_09 C3_10
#1: 33 1 33 333 3933 433 4533 433 4233 NA NA NA
#2: 83 2 83 132 149 158 241 243 253 266 301 NA
#3: 146 3 146 149 159 275 420 424 529 627 628 642
str(df1)
Classes ‘data.table’ and 'data.frame': 3 obs. of 12 variables:
$ C1 : num 33 83 146
$ C2 : num 1 2 3
$ C3_01: int 33 83 146
$ C3_02: int 333 132 149
$ C3_03: int 3933 149 159
$ C3_04: int 433 158 275
$ C3_05: int 4533 241 420
$ C3_06: int 433 243 424
$ C3_07: int 4233 253 529
$ C3_08: int NA 266 627
$ C3_09: int NA 301 628
$ C3_10: int NA NA 642
How to convert data.frame column from Factor to numeric
breast$class <- as.numeric(as.character(breast$class))
If you have many columns to convert to numeric
indx <- sapply(breast, is.factor)
breast[indx] <- lapply(breast[indx], function(x) as.numeric(as.character(x)))
Another option is to use stringsAsFactors=FALSE
while reading the file using read.table
or read.csv
Just in case, other options to create/change columns
breast[,'class'] <- as.numeric(as.character(breast[,'class']))
or
breast <- transform(breast, class=as.numeric(as.character(breast)))
Converting a column in a data.frame to numeric type to calculate mean values in R
The OP's dataset is a tibble
and tibble
don't have default drop=TRUE
to drop the dimensions when there is only a single column selected. So, basically, it is still a tibble
PlaterTest[PlaterTest$`Amino acid position` == "blank", "Fluorescence"]
# A tibble: 2 x 1
# Fluorescence
# <int>
#1 856
#2 356
and mean
won't work with data.frame
or tibble
. According to ?mean
mean(x, ...)
x - An R object. Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects. Complex vectors
are allowed for trim = 0, only.
So, it needs a vector
. Therefore, the option would be to extract the 'Flourescence' as a vector
and subset it based on the 'blank' values in 'Amino acid position' column
PlaterTest$Fluorescence[PlaterTest$`Amino acid position` == "blank"]
#[1] 856 356
mean(PlaterTest$Fluorescence[PlaterTest$`Amino acid position` == "blank"])
#[1] 606
Also, we can use the tidyverse
methods
library(dplyr)
PlaterTest %>%
filter(`Amino acid position` == 'blank') %>%
summarise(Mean = mean(Fluorescence)) %>%
pull(Mean)
data
PlaterTest <- structure(list(Wells = c("A01", "A02", "A03"), `Amino acid position` =
c("D46",
"blank", "blank"), Mutant = c("A", "Y", "R"), Fluorescence = c(456L,
856L, 356L)), .Names = c("Wells", "Amino acid position", "Mutant",
"Fluorescence"), row.names = c("1", "2", "3"), class = c("tbl_df",
"tbl", "data.frame"))
pandas: to_numeric for multiple columns
UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:
In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)
In [166]: df.dtypes
Out[166]:
GeoName object
ComponentName object
IndustryId int64
IndustryClassification object
Description object
2004 int64
2005 int64
2006 int64
2007 int64
2008 int64
2009 int64
2010 int64
2011 int64
2012 int64
2013 int64
2014 float64
dtype: object
If you need to convert multiple columns to numeric dtypes - use the following technique:
Sample source DF:
In [271]: df
Out[271]:
id a b c d e f
0 id_3 AAA 6 3 5 8 1
1 id_9 3 7 5 7 3 BBB
2 id_7 4 2 3 5 4 2
3 id_0 7 3 5 7 9 4
4 id_0 2 4 6 4 0 2
In [272]: df.dtypes
Out[272]:
id object
a object
b int64
c int64
d int64
e int64
f object
dtype: object
Converting selected columns to numeric dtypes:
In [273]: cols = df.columns.drop('id')
In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
In [275]: df
Out[275]:
id a b c d e f
0 id_3 NaN 6 3 5 8 1.0
1 id_9 3.0 7 5 7 3 NaN
2 id_7 4.0 2 3 5 4 2.0
3 id_0 7.0 3 5 7 9 4.0
4 id_0 2.0 4 6 4 0 2.0
In [276]: df.dtypes
Out[276]:
id object
a float64
b int64
c int64
d int64
e int64
f float64
dtype: object
PS if you want to select all string
(object
) columns use the following simple trick:
cols = df.columns[df.dtypes.eq('object')]
converting multiple columns from character to numeric format in r
You could try
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
# Check columns classes
sapply(DF, class)
# a b c
# "character" "character" "character"
cols.num <- c("a","b")
DF[cols.num] <- sapply(DF[cols.num],as.numeric)
sapply(DF, class)
# a b c
# "numeric" "numeric" "character"
R: Convert all columns to numeric with mutate while maintaining character columns
A possible solution:
df <- type.convert(df, as.is = T)
str(df)
#> 'data.frame': 4 obs. of 4 variables:
#> $ Col1: int 647 237 863 236
#> $ Col2: int 125 623 854 234
#> $ Col3: chr "ABC" "BCA" "DFL" "KFD"
#> $ Col4: chr "PWD" "CDL" "QOW" "DKC"
Related Topics
How to Create a Consecutive Group Number
Ggplot2 Stacked Bar Chart - Each Bar Being 100% and With Percenage Labels Inside Each Bar
Changing from Upper to Lower Case in Several Data Frames
Error: Could Not Find Function ... in R
How to Get Summary Statistics by Group
Shading a Kernel Density Plot Between Two Points.
Order Data Frame Rows According to Vector With Specific Order
How to Plot All the Columns of a Data Frame in R
Looping Over a Date or Posixct Object Results in a Numeric Iterator
Converting Data Frame into a List of Lists in R
Replace Column Values With Na Based on a Different Column or Row Position With Tidyverse
R Markdown - Changing Font Size and Font Type in HTML Output
Aggregating by Unique Identifier and Concatenating Related Values into a String
How to Remove All Duplicates So That None Are Left in a Data Frame
Increasing (Or Decreasing) the Memory Available to R Processes