How to Read CSV File in R Where Some Values Contain the Percent Symbol (%)

How to read csv file in R where some values contain the percent symbol (%)

There is no "percentage" type in R. So you need to do some post-processing:

DF <- read.table(text="actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%", sep=",", header=TRUE)

DF[,3] <- as.numeric(gsub("%", "",DF[,3]))/100

# actual simulated percent.error
#1 2.1496 8.6066 -3.00
#2 0.9170 8.0266 -7.75
#3 7.9406 0.2152 0.97
#4 4.9637 3.5237 0.29

Reading csv files with R with percentages as X% and varying NA characters

With NAs you don't necessarily need to use a solution involving gsub or some of it's kin. There is an argument na.strings in read.table(), and you can specify several NA strings at the same time. For example, the example table you posted could be read in R with the following command:

test<-read.table("clipboard", header=T, sep="\t", na.strings=c("9", "does not apply"))

That takes the table from the clipboard, and converts both "9" and "does not apply" to NAs in the resulting table:

test
x1 x2 x3
1 1 10% 1
2 2 20% 2
3 3 30% NA
4 NA 40% 4

This works fine, unless some of the columns contain, e.g., "9" as data and others have it meaning NA.

As for the percentage problem, that might be easiest to solve using the gsub method. Another solution to the percentage problem might be to define a new coersion function, and then specify the colClasses argument in read.table(). Something like this should work:

# New coersion function
setAs("character", "num_pct", function(from) as.numeric(gsub("%", "", from))/100)
# Define column classes for the columns in the table
test<-read.table("clipboard", header=T, sep="\t", na.strings=c("9", "does not apply"),
colClasses=c("character", "num_pct", "character"))

This command now reads in the table with the specified classes for the columns, and converts the percentages in the second column of the table to decimal numbers on the fly.

read.table: percent sign (%) and forward slah (/) in headers replaced by dot (.)

R by default tries to makes sure that the dataframe you are importing have syntactically valid names using check.names which is TRUE by default. It does not allow column names with symbols like %, / (or other as defined in make.names).

We can, however, override this behavior using check.names = FALSE

read.table(text = "Subject,Exp1_BSL_SDNN,Exp1_BSL_LF/HF,Exp1_BSL_%LF
s1,123,123,123
s2,123,123,123", sep=",", header=TRUE, check.names = FALSE)

# Subject Exp1_BSL_SDNN Exp1_BSL_LF/HF Exp1_BSL_%LF
#1 s1 123 123 123
#2 s2 123 123 123

Read csv file in R with currency column as numeric

I'm not sure how to read it in directly, but you can modify it once it's in:

> A <- read.csv("~/Desktop/data.csv")
> A
id desc price
1 0 apple $1.00
2 1 banana $2.25
3 2 grapes $1.97
> A$price <- as.numeric(sub("\\$","", A$price))
> A
id desc price
1 0 apple 1.00
2 1 banana 2.25
3 2 grapes 1.97
> str(A)
'data.frame': 3 obs. of 3 variables:
$ id : int 0 1 2
$ desc : Factor w/ 3 levels "apple","banana",..: 1 2 3
$ price: num 1 2.25 1.97

I think it might just have been a missing escape in your sub. $ indicates the end of a line in regular expressions. \$ is a dollar sign. But then you have to escape the escape...

Reading X%-formatted percentages into R

Here's a dplyr and readr solution:

library(dplyr) # Version >= 1.0.0
library(readr)
library(stringr)
data %>%
mutate(across(where(~any(str_detect(.,"%"))), parse_number))
# A tibble: 3 x 3
name count percentage
<chr> <dbl> <dbl>
1 Alice 4 40
2 Bob 10 65
3 Carol 15 15

Feel free to replace any with all if you prefer.

A benefit of this approach is it detects columns that have the % and only parses those. No need to know which columns need to be convereted in advance.

How to read data when some numbers contain commas as thousand separator?

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

Read CSV file up to line with unique marker

Three thoughts:

  1. Use readLines (as @user2554330 suggested), find/remove the specific row, filter it, then parse the text vector with read.csv, the least of the three.

  2. before[seq_len(min(head(which(!grepl("^[^- ]+$", before$Total)),1)-1L,nrow(before))),]; a bit complicated, granted, but it does what you need (assuming that you've already filtered the first 14 rows with skip=.

  3. Use an external script such as sed -e '1,14d;/^[ -]\+$/{g;q;} in a pipe(...)-type thing.



Related Topics



Leave a reply



Submit