R Cleaning Up a Character and Converting It into a Numeric

R cleaning up a character and converting it into a numeric

You can parse out what you don't want with regular expressions:

test <- "532.dcx3vds98"
destring <- function(x,keep="0-9.") {
return( as.numeric(gsub(paste("[^",keep,"]+",sep=""),"",x)) )
}
destring(test)

Returns 532.398.

Edit

This is now in taRifx:

library(taRifx)
test <- "532.dcx3vds98"
destring(test)

Clean special character, numeric and character

One can use tidyr::extract to first separate emp_length in 2 columns. Then replace any symbol (anything other than 0-9) to "" in column with number and then convert it to numeric.

Option#1: Keep the symbol with number

library(tidyverse)
df <- df %>% extract(emp_length, c("emp_length", "years"),
regex="([[:digit:]+<]+)\\s+(\\w+)")

df
# emp_length years
# 1 10+ years
# 2 <1 year
# 3 8 years

Option#2: Just number but column is numeric

library(tidyverse)

df <- df %>%
extract(emp_length, c("emp_length", "years"), regex="([[:digit:]+<]+)\\s+(\\w+)") %>%
mutate(emp_length = as.numeric(gsub("[^0-9]","\\1",emp_length)))

df
# emp_length years
# 1 10 years
# 2 1 year
# 3 8 years

Data:

df <- data.frame(emp_length = c("10+ years", "<1 year", "8 years"),
stringsAsFactors = FALSE)

Transforming complete age from character to numeric in R

Using lubridate convenience functions, period and time_length:

library(lubridate)
age %>%
mutate(age_years = time_length(period(complete_age), unit = "years"))

# A tibble: 4 x 2
# complete_age age_years
# <chr> <dbl>
# 1 10 years 8 months 23 days 10.729637
# 2 9 years 11 months 7 days 9.935832
# 3 11 years 3 months 1 day 11.252738
# 4 8 years 6 months 12 days 8.532854

Converting from character to numeric in a matrix

Answered by @Sophia.

Her solution:
How about converting your matrix to data.frame before you add the character columns?

Converting Character to Numeric without NA Coercion in R

As Anando pointed out, the problem is somewhere in your data, and we can't really help you much without a reproducible example. That said, here's a code snippet to help you pin down the records in your data that are causing you problems:

test = as.character(c(1,2,3,4,'M'))
v = as.numeric(test) # NAs intorduced by coercion
ix.na = is.na(v)
which(ix.na) # row index of our problem = 5
test[ix.na] # shows the problematic record, "M"

Instead of guessing as to why NAs are being introduced, pull out the records that are causing the problem and address them directly/individually until the NAs go away.

UPDATE: Looks like the problem is in your call to str_replace_all. I don't know the stringr library, but I think you can accomplish the same thing with gsub like this:

v2 = c("1.00","2.00","3.00")
gsub("\\.00", "", v2)

[1] "1" "2" "3"

I'm not entirely sure what this accomplishes though:

sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent.

[1] 0

Unless this achieves some specific purpose for you, I'd suggest dropping this step from your preprocessing entirely, as it doesn't appear necessary and seems to be giving you problems.

Using stringr for conversion of character object to numeric object

library("tidyverse")

example data

(using some of the values from your URL)

vals <- c("$34,543,701", "$69.40 million","$1.519 billion","junk")
dd <- tibble(vals)

transform

(dd 
%>% mutate(vals=str_remove_all(vals,"(,|\\$|\\[.*\\]|\\(.*\\))", ## strip extraneous chars
multiplier=ifelse(str_detect(vals,"million"),1e6,
ifelse(str_detect(vals,"billion"),1e9,1)),
vals=str_remove(vals,"(m|b)illion"), ## drop words
vals=as.numeric(vals)*multiplier)
%>% select(-multiplier) ## drop auxiliary variable
)

I intentionally left a non-numeric value in the example (since such values exist in the example you gave); this will trigger a warning from as.numeric(). You could use suppressWarnings() around that particular element in the pipe ...

R data.frame strange behavior when converting characters to numeric

This is a difference in the defaults for tibbles and data.frames. When you mix together strings and numbers as in c(1, "01"), R converts everything to a string.

c(1, "01")
[1] "1" "01"

The default behavior for data.frame is to make strings into factors. If you look at the help page for data.frame you will see the argument:

stringsAsFactors: ... The ‘factory-fresh’ default is TRUE

So data frame makes c(1, "01") into a factor with two levels "1" and "01"

T1 = data.frame(fips = c(1,"01")) 
str(T1)
'data.frame': 2 obs. of 1 variable:
$ fips: Factor w/ 2 levels "01","1": 2 1

Now factors are stored as integers for efficiency. That is why you see 2 1 at the end of the about output of str(T1). So if you directly convert that to an integer, you get 2 and 1.

You can get the behavior that you want, either by making the data.frame more carefully with

T1 = data.frame(fips = c(1,"01"), stringsAsFactors=FALSE)

or you can convert the factor to a string before converting to a number

fips = as.numeric(as.character(fips))

Tibbles do not have this problem because they do not convert the strings to factors.

R - Convert character columns with $ and % signs into numeric

dat <- structure(list(Col1 = c("CST", "FSD", "SDD"), Col2 = c("$ 128,412.00", 
"$ 138,232.40", "$ 112,234.45"), Col3 = c("$ 0.034", "$ 0.023",
"$ 0.023"), Col4 = c("+149.628%", "+124.244%", "-123.324%")),
class = "data.frame", row.names = c(NA, -3L))
# Col1 Col2 Col3 Col4
#1 CST $ 128,412.00 $ 0.034 +149.628%
#2 FSD $ 138,232.40 $ 0.023 +124.244%
#3 SDD $ 112,234.45 $ 0.023 -123.324%

To convert all columns but column 1 to numeric, you can do

tonum <- function (x) {
## delete "$", "," and "%" and convert string to numeric
num <- as.numeric(gsub("[$,%]", "", x))
## watch out for "%", that is, 90% should be 90 / 100 = 0.9
if (grepl("%", x[1])) num <- num / 100
## return
num
}

dat[-1] <- lapply(dat[-1], tonum)
dat
# Col1 Col2 Col3 Col4
#1 CST 128412.0 0.034 1.49628
#2 FSD 138232.4 0.023 1.24244
#3 SDD 112234.4 0.023 -1.23324

Remark:

I just learned readr::parse_number() from PaulS's answer. It is an interesting function. Basically it removes everything that can not be a valid part of a number. As a practice, I implement the same logic using REGEX. So here is a general-purpose tonum().

tonum <- function (x, regex = TRUE) {
## drop everything that is not "+/-", "0-9" or "."
## then convert string to numeric
if (regex) {
num <- as.numeric(stringr::str_remove_all(x, "[^+\\-0-9\\.]*"))
} else {
num <- readr::parse_number(x)
}
## watch out for "%", that is, 90% should be 90 / 100 = 0.9
ind <- grepl("%", x)
num[ind] <- num[ind] / 100
## return
num
}

Here is a quick test:

x <- unlist(dat[-1], use.names = FALSE)
x <- c(x, "euro 300.95", "RMB 888.66", "£1999.98")
# [1] "$ 128,412.00" "$ 138,232.40" "$ 112,234.45" "$ 0.034" "$ 0.023"
# [6] "$ 0.023" "+149.628%" "+124.244%" "-123.324%" "euro 300.95"
#[11] "RMB 888.66" "£1999.98"

tonum(x, regex = TRUE)
# [1] 128412.00000 138232.40000 112234.45000 0.03400 0.02300
# [6] 0.02300 1.49628 1.24244 -1.23324 300.95000
#[11] 888.66000 1999.98000

tonum(x, regex = FALSE)
# [1] 128412.00000 138232.40000 112234.45000 0.03400 0.02300
# [6] 0.02300 1.49628 1.24244 -1.23324 300.95000
#[11] 888.66000 1999.98000


Related Topics



Leave a reply



Submit