Factor with Comma and Percentage to Numeric

How to read data when some numbers contain commas as thousand separator?

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

Turn multiple char columns with percentage sign and comma decimal mark into numerics

df %>%  
mutate(across(everything(), ~ ifelse(str_detect(.x, "%"),
parse_number(.x) / 10,
.x)))

# A tibble: 6 x 10
Year gesamt weiblich `weiblich inProzent` Deutsche Deutsch~1 Auslä~2 Auslä~3 davon~4 Polen~5
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1992 472 236 50 325 68.9 169 35.8 167 35.4
2 1993 997 546 54.8 598 60 399 40 384 38.5
3 1994 1443 724 50.2 841 58.3 602 41.7 566 39.2
4 1995 1810 949 52.4 1030 56.9 780 43.1 731 40.4
5 1996 2321 1242 53.5 1348 58.1 973 41.9 883 38
6 1997 2835 1584 55.9 1662 58.6 1173 41.4 1053 37.1
# ... with abbreviated variable names 1: `Deutsche inProzent`, 2: `Ausländer/innengesamt`,
# 3: `Ausländer/innenin Prozent`, 4: davonPolen, 5: `Polenin Prozent`

Or if you want only parse_number

df %>%  
mutate(across(everything(), parse_number))

# A tibble: 6 x 10
Year gesamt weiblich `weiblich inProzent` Deutsche Deutsch~1 Auslä~2 Auslä~3 davon~4 Polen~5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1992 472 236 500 325 689 169 358 167 354
2 1993 997 546 548 598 600 399 400 384 385
3 1994 1443 724 502 841 583 602 417 566 392
4 1995 1810 949 524 1030 569 780 431 731 404
5 1996 2321 1242 535 1348 581 973 419 883 380
6 1997 2835 1584 559 1662 586 1173 414 1053 371
# ... with abbreviated variable names 1: `Deutsche inProzent`, 2: `Ausländer/innengesamt`,
# 3: `Ausländer/innenin Prozent`, 4: davonPolen, 5: `Polenin Prozent`

Add a comma after two digits in R?

We can make use of comma from formattable which will modify the format while keeping the numeric as it is

df1$Percentage <- formattable::comma(df1$Percentage, big.interval = 2, digits = 0)

-checking

> df1
Percentage
1 34,56
2 44,44
3 3,25
> str(df1)
'data.frame': 3 obs. of 1 variable:
$ Percentage: 'formattable' int 34,56 44,44 3,25
..- attr(*, "formattable")=List of 4
.. ..$ formatter: chr "formatC"
.. ..$ format :List of 4
.. .. ..$ format : chr "f"
.. .. ..$ big.mark : chr ","
.. .. ..$ digits : num 0
.. .. ..$ big.interval: num 2
.. ..$ preproc : NULL
.. ..$ postproc : NULL

It is also possible to do calculations as it is a numeric column

> df1$Percentage * 100
[1] 34,56,00 44,44,00 3,25,00

data

df1 <- structure(list(Percentage = c(3456L, 4444L, 325L)), class = "data.frame", row.names = c(NA, 
-3L))

SAS: Converting character to numeric variable - comma as a decimal separator

The reason new = . in your example is because SAS does not recognize the comma as a decimal separator. See the note in the log.

NOTE: Invalid argument to function INPUT at line 4 column 11.
old=1,61 new=. ERROR=1 N=1
NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to
missing values.

The documentation contains a list of various SAS informats. Based on the documentation it looks like you can use the COMMAX informat.

COMMAXw.d - Writes numeric values with a period that separates every three digits and a comma that separates the decimal fraction.

The modified code looks like this:

data temp;
old = '1,61';
new = input(old,commax5.);
run;

proc print;

The resulting output is:

Obs    old      new

1 1,61 1.61

If you want to keep the new variable in the same format you can just add the statement format new commax5.; to the data step.

Thanks to Tom for pointing out that SAS uses informats in the INPUT() function.

What is a clean way to convert a string percent to a float?

Use strip('%') , as:

In [9]: "99.5%".strip('%')
Out[9]: '99.5' #convert this to float using float() and divide by 100

In [10]: def p2f(x):
....: return float(x.strip('%'))/100
....:

In [12]: p2f("99%")
Out[12]: 0.98999999999999999

In [13]: p2f("99.5%")
Out[13]: 0.995

Converting dot to comma in numeric

Your initial idea was almost correct, just regular expression was wrong, because . matches any symbol. You need something like (this will convert numeric vector to a character vector)

df$a <- gsub("\\.", ",", df$a)

Also you can change the output from R printing, plotting and the actions of the as.character function. You change it from its default with:

options(OutDec= ",")

And another option is using format function.

format(df, decimal.mark=",")

I assume that you care about how numbers are printed (output), because internally numeric is stored as a double precision floating point number (Update thanks to comment by @digemall). Also unless for some function like read.table it is specifically specified that decimal separator is ,, it's not possible to do otherwise, because by default , is used for separating function arguments.

And NA are introduced exactly for that reason (aside from incorrect regex).

df$a <- as.numeric(gsub("\\.", ",", df$a))

By default parser does not know that , is used as a decimal separator.



Related Topics



Leave a reply



Submit