R cleaning up a character and converting it into a numeric
You can parse out what you don't want with regular expressions:
test <- "532.dcx3vds98"
destring <- function(x,keep="0-9.") {
return( as.numeric(gsub(paste("[^",keep,"]+",sep=""),"",x)) )
}
destring(test)
Returns 532.398
.
Edit
This is now in taRifx
:
library(taRifx)
test <- "532.dcx3vds98"
destring(test)
Clean special character, numeric and character
One can use tidyr::extract
to first separate emp_length
in 2 columns. Then replace any symbol (anything other than 0-9
) to ""
in column with number and then convert it to numeric.
Option#1: Keep the symbol with number
library(tidyverse)
df <- df %>% extract(emp_length, c("emp_length", "years"),
regex="([[:digit:]+<]+)\\s+(\\w+)")
df
# emp_length years
# 1 10+ years
# 2 <1 year
# 3 8 years
Option#2: Just number but column is numeric
library(tidyverse)
df <- df %>%
extract(emp_length, c("emp_length", "years"), regex="([[:digit:]+<]+)\\s+(\\w+)") %>%
mutate(emp_length = as.numeric(gsub("[^0-9]","\\1",emp_length)))
df
# emp_length years
# 1 10 years
# 2 1 year
# 3 8 years
Data:
df <- data.frame(emp_length = c("10+ years", "<1 year", "8 years"),
stringsAsFactors = FALSE)
Transforming complete age from character to numeric in R
Using lubridate
convenience functions, period
and time_length
:
library(lubridate)
age %>%
mutate(age_years = time_length(period(complete_age), unit = "years"))
# A tibble: 4 x 2
# complete_age age_years
# <chr> <dbl>
# 1 10 years 8 months 23 days 10.729637
# 2 9 years 11 months 7 days 9.935832
# 3 11 years 3 months 1 day 11.252738
# 4 8 years 6 months 12 days 8.532854
Converting from character to numeric in a matrix
Answered by @Sophia.
Her solution:
How about converting your matrix to data.frame before you add the character columns?
Converting Character to Numeric without NA Coercion in R
As Anando pointed out, the problem is somewhere in your data, and we can't really help you much without a reproducible example. That said, here's a code snippet to help you pin down the records in your data that are causing you problems:
test = as.character(c(1,2,3,4,'M'))
v = as.numeric(test) # NAs intorduced by coercion
ix.na = is.na(v)
which(ix.na) # row index of our problem = 5
test[ix.na] # shows the problematic record, "M"
Instead of guessing as to why NAs are being introduced, pull out the records that are causing the problem and address them directly/individually until the NAs go away.
UPDATE: Looks like the problem is in your call to str_replace_all
. I don't know the stringr
library, but I think you can accomplish the same thing with gsub
like this:
v2 = c("1.00","2.00","3.00")
gsub("\\.00", "", v2)
[1] "1" "2" "3"
I'm not entirely sure what this accomplishes though:
sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent.
[1] 0
Unless this achieves some specific purpose for you, I'd suggest dropping this step from your preprocessing entirely, as it doesn't appear necessary and seems to be giving you problems.
Using stringr for conversion of character object to numeric object
library("tidyverse")
example data
(using some of the values from your URL)
vals <- c("$34,543,701", "$69.40 million","$1.519 billion","junk")
dd <- tibble(vals)
transform
(dd
%>% mutate(vals=str_remove_all(vals,"(,|\\$|\\[.*\\]|\\(.*\\))", ## strip extraneous chars
multiplier=ifelse(str_detect(vals,"million"),1e6,
ifelse(str_detect(vals,"billion"),1e9,1)),
vals=str_remove(vals,"(m|b)illion"), ## drop words
vals=as.numeric(vals)*multiplier)
%>% select(-multiplier) ## drop auxiliary variable
)
I intentionally left a non-numeric value in the example (since such values exist in the example you gave); this will trigger a warning from as.numeric()
. You could use suppressWarnings()
around that particular element in the pipe ...
R data.frame strange behavior when converting characters to numeric
This is a difference in the defaults for tibbles and data.frames. When you mix together strings and numbers as in c(1, "01"), R converts everything to a string.
c(1, "01")
[1] "1" "01"
The default behavior for data.frame
is to make strings into factors. If you look at the help page for data.frame
you will see the argument:
stringsAsFactors: ... The ‘factory-fresh’ default is TRUE
So data frame makes c(1, "01") into a factor with two levels "1" and "01"
T1 = data.frame(fips = c(1,"01"))
str(T1)
'data.frame': 2 obs. of 1 variable:
$ fips: Factor w/ 2 levels "01","1": 2 1
Now factors are stored as integers for efficiency. That is why you see 2 1 at the end of the about output of str(T1). So if you directly convert that to an integer, you get 2 and 1.
You can get the behavior that you want, either by making the data.frame more carefully with
T1 = data.frame(fips = c(1,"01"), stringsAsFactors=FALSE)
or you can convert the factor to a string before converting to a number
fips = as.numeric(as.character(fips))
Tibbles do not have this problem because they do not convert the strings to factors.
R - Convert character columns with $ and % signs into numeric
dat <- structure(list(Col1 = c("CST", "FSD", "SDD"), Col2 = c("$ 128,412.00",
"$ 138,232.40", "$ 112,234.45"), Col3 = c("$ 0.034", "$ 0.023",
"$ 0.023"), Col4 = c("+149.628%", "+124.244%", "-123.324%")),
class = "data.frame", row.names = c(NA, -3L))
# Col1 Col2 Col3 Col4
#1 CST $ 128,412.00 $ 0.034 +149.628%
#2 FSD $ 138,232.40 $ 0.023 +124.244%
#3 SDD $ 112,234.45 $ 0.023 -123.324%
To convert all columns but column 1 to numeric, you can do
tonum <- function (x) {
## delete "$", "," and "%" and convert string to numeric
num <- as.numeric(gsub("[$,%]", "", x))
## watch out for "%", that is, 90% should be 90 / 100 = 0.9
if (grepl("%", x[1])) num <- num / 100
## return
num
}
dat[-1] <- lapply(dat[-1], tonum)
dat
# Col1 Col2 Col3 Col4
#1 CST 128412.0 0.034 1.49628
#2 FSD 138232.4 0.023 1.24244
#3 SDD 112234.4 0.023 -1.23324
Remark:
I just learned readr::parse_number()
from PaulS's answer. It is an interesting function. Basically it removes everything that can not be a valid part of a number. As a practice, I implement the same logic using REGEX. So here is a general-purpose tonum()
.
tonum <- function (x, regex = TRUE) {
## drop everything that is not "+/-", "0-9" or "."
## then convert string to numeric
if (regex) {
num <- as.numeric(stringr::str_remove_all(x, "[^+\\-0-9\\.]*"))
} else {
num <- readr::parse_number(x)
}
## watch out for "%", that is, 90% should be 90 / 100 = 0.9
ind <- grepl("%", x)
num[ind] <- num[ind] / 100
## return
num
}
Here is a quick test:
x <- unlist(dat[-1], use.names = FALSE)
x <- c(x, "euro 300.95", "RMB 888.66", "£1999.98")
# [1] "$ 128,412.00" "$ 138,232.40" "$ 112,234.45" "$ 0.034" "$ 0.023"
# [6] "$ 0.023" "+149.628%" "+124.244%" "-123.324%" "euro 300.95"
#[11] "RMB 888.66" "£1999.98"
tonum(x, regex = TRUE)
# [1] 128412.00000 138232.40000 112234.45000 0.03400 0.02300
# [6] 0.02300 1.49628 1.24244 -1.23324 300.95000
#[11] 888.66000 1999.98000
tonum(x, regex = FALSE)
# [1] 128412.00000 138232.40000 112234.45000 0.03400 0.02300
# [6] 0.02300 1.49628 1.24244 -1.23324 300.95000
#[11] 888.66000 1999.98000
Related Topics
Sum Columns Row-Wise with Similar Names
Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set
Read Column Names as Date Format
Add a Constant Value to All Rows in a Dataframe
Cannot Install Library(Xlsx) in R and Look for an Alternative
"Non-Finite Function Value" When Using Integrate() in R
R: Split String into Numeric and Return the Mean as a New Column in a Data Frame
Grouping Factor Levels in a Data.Table
R Geom_Tile Ggplot2 What Kind of Stat Is Applied
Data.Table: Sum by All Existing Combinations in Table
How to Substitute Symbols in a Language Object
Error Using T.Test() in R - Not Enough 'Y' Observations
Include a Comma Separator for Data Labels
Assignment to Empty Index (Empty Square Brackets X[]<-) on Lhs
Removing Row with Duplicated Values in All Columns of a Data Frame (R)