How to read data when some numbers contain commas as thousand separator?
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub
, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
Turn multiple char columns with percentage sign and comma decimal mark into numerics
df %>%
mutate(across(everything(), ~ ifelse(str_detect(.x, "%"),
parse_number(.x) / 10,
.x)))
# A tibble: 6 x 10
Year gesamt weiblich `weiblich inProzent` Deutsche Deutsch~1 Auslä~2 Auslä~3 davon~4 Polen~5
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1992 472 236 50 325 68.9 169 35.8 167 35.4
2 1993 997 546 54.8 598 60 399 40 384 38.5
3 1994 1443 724 50.2 841 58.3 602 41.7 566 39.2
4 1995 1810 949 52.4 1030 56.9 780 43.1 731 40.4
5 1996 2321 1242 53.5 1348 58.1 973 41.9 883 38
6 1997 2835 1584 55.9 1662 58.6 1173 41.4 1053 37.1
# ... with abbreviated variable names 1: `Deutsche inProzent`, 2: `Ausländer/innengesamt`,
# 3: `Ausländer/innenin Prozent`, 4: davonPolen, 5: `Polenin Prozent`
Or if you want only parse_number
df %>%
mutate(across(everything(), parse_number))
# A tibble: 6 x 10
Year gesamt weiblich `weiblich inProzent` Deutsche Deutsch~1 Auslä~2 Auslä~3 davon~4 Polen~5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1992 472 236 500 325 689 169 358 167 354
2 1993 997 546 548 598 600 399 400 384 385
3 1994 1443 724 502 841 583 602 417 566 392
4 1995 1810 949 524 1030 569 780 431 731 404
5 1996 2321 1242 535 1348 581 973 419 883 380
6 1997 2835 1584 559 1662 586 1173 414 1053 371
# ... with abbreviated variable names 1: `Deutsche inProzent`, 2: `Ausländer/innengesamt`,
# 3: `Ausländer/innenin Prozent`, 4: davonPolen, 5: `Polenin Prozent`
Add a comma after two digits in R?
We can make use of comma
from formattable
which will modify the format
while keeping the numeric
as it is
df1$Percentage <- formattable::comma(df1$Percentage, big.interval = 2, digits = 0)
-checking
> df1
Percentage
1 34,56
2 44,44
3 3,25
> str(df1)
'data.frame': 3 obs. of 1 variable:
$ Percentage: 'formattable' int 34,56 44,44 3,25
..- attr(*, "formattable")=List of 4
.. ..$ formatter: chr "formatC"
.. ..$ format :List of 4
.. .. ..$ format : chr "f"
.. .. ..$ big.mark : chr ","
.. .. ..$ digits : num 0
.. .. ..$ big.interval: num 2
.. ..$ preproc : NULL
.. ..$ postproc : NULL
It is also possible to do calculations as it is a numeric column
> df1$Percentage * 100
[1] 34,56,00 44,44,00 3,25,00
data
df1 <- structure(list(Percentage = c(3456L, 4444L, 325L)), class = "data.frame", row.names = c(NA,
-3L))
SAS: Converting character to numeric variable - comma as a decimal separator
The reason new = .
in your example is because SAS does not recognize the comma as a decimal separator. See the note in the log.
NOTE: Invalid argument to function INPUT at line 4 column 11.
old=1,61 new=. ERROR=1 N=1
NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to
missing values.
The documentation contains a list of various SAS informats. Based on the documentation it looks like you can use the COMMAX
informat.
COMMAXw.d - Writes numeric values with a period that separates every three digits and a comma that separates the decimal fraction.
The modified code looks like this:
data temp;
old = '1,61';
new = input(old,commax5.);
run;
proc print;
The resulting output is:
Obs old new
1 1,61 1.61
If you want to keep the new
variable in the same format you can just add the statement format new commax5.;
to the data step.
Thanks to Tom for pointing out that SAS uses informats in the INPUT()
function.
What is a clean way to convert a string percent to a float?
Use strip('%')
, as:
In [9]: "99.5%".strip('%')
Out[9]: '99.5' #convert this to float using float() and divide by 100
In [10]: def p2f(x):
....: return float(x.strip('%'))/100
....:
In [12]: p2f("99%")
Out[12]: 0.98999999999999999
In [13]: p2f("99.5%")
Out[13]: 0.995
Converting dot to comma in numeric
Your initial idea was almost correct, just regular expression was wrong, because .
matches any symbol. You need something like (this will convert numeric vector to a character vector)
df$a <- gsub("\\.", ",", df$a)
Also you can change the output from R printing, plotting and the actions of the as.character function. You change it from its default with:
options(OutDec= ",")
And another option is using format
function.
format(df, decimal.mark=",")
I assume that you care about how numbers are printed (output), because internally numeric is stored as a double precision floating point number (Update thanks to comment by @digemall). Also unless for some function like read.table
it is specifically specified that decimal separator is ,
, it's not possible to do otherwise, because by default ,
is used for separating function arguments.
And NA
are introduced exactly for that reason (aside from incorrect regex).
df$a <- as.numeric(gsub("\\.", ",", df$a))
By default parser does not know that ,
is used as a decimal separator.
Related Topics
Plot Separate Years on a Common Day-Month Scale
The Representation of an Empty Argument in a "Call"
Sum Specific Columns Among Rows
Enclosing Variables Within for Loop
The Rolling Regression in R Using Roll Apply
Max and Min Functions That Are Similar to Colmeans
1-Dimensional Matrix Is Changed to a Vector in R
Extract Last Non-Missing Value in Row with Data.Table
Repeat the Re-Sampling Function for 1000 Times? Using Lapply
R: Matrix by Vector Multiplication
Using R to Do a Regression with Multiple Dependent and Multiple Independent Variables
Stacked Bar Chart, Reorder by Total (Sum Up of Values) Instead of Value Ggplot2 + Dplyr
Cannot Read File with "#" and Space Using Read.Table or Read.CSV in R