Finding non-numeric data in a data frame or vector
df <- data.frame(x = c(1,2,3,4,"five",6,7,8,"nine",10))
The trick is knowing that converting to numeric via as.numeric(as.character(.))
will convert non-numbers to NA
.
which(is.na(as.numeric(as.character(df[[1]]))))
## 5 9
(just using as.numeric(df[[1]])
doesn't work - it just drops the levels leaving the numeric codes).
You might choose to suppress the warnings:
which.nonnum <- function(x) {
which(is.na(suppressWarnings(as.numeric(as.character(x)))))
}
which.nonnum(df[[1]])
To be more careful, you should also check that the values weren't NA before conversion:
which.nonnum <- function(x) {
badNum <- is.na(suppressWarnings(as.numeric(as.character(x))))
which(badNum & !is.na(x))
}
lapply(df, which.nonnum)
will report 'bad' values for all columns of the data frame.
Find non-numeric entries in a column that is supposed to contain numbers using R
You could try
which(!grepl('^[0-9]',grades))
to check which entries do not consist out of only numeric characters. It outputs
2 5 7 9
Hope this helps!
Checking all non-numerical entries in a data.frame column and delete or substitute
I would use a loop and readline to create the new vector like this:
df <- data.frame(list(A=c(1, 2, 3, 4, 5, 6, 7, 8, 9), B=c("40g", "< 2", "thx", "about 1", "1-2", "1/2", 3, 2.3, "two")))
df$B <- as.character(df$B)
myscan <- function(x) {
new <- vector("numeric",length(x))
for(i in seq_along(x)) {
new[i] <- readline(sprintf("Non numeric entry '%s' new value to set: ",x[i]))
}
as.numeric(new)
}
# get the entries
notNum <- is.na( as.numeric(df$B) )
# Loop and ask for updates
df$B[notNum] <- myscan(df$B[notNum])
When run it gives:
> df$B[notNum] <- as.numeric( myscan(df$B[notNum]) )
Non numeric entry '40g' new value to set: 0.4
Non numeric entry '< 2' new value to set: na
Non numeric entry 'thx' new value to set: ba
Non numeric entry 'about 1' new value to set: 1
Non numeric entry '1-2' new value to set: 1.5
Non numeric entry '1/2' new value to set: na
Non numeric entry 'two' new value to set: 2
Then we return the column to numeric state:
df$B <- as.numeric(df$B)
And we get the new data frame:
> df
A B
1 1 0.4
2 2 NA
3 3 NA
4 4 1.0
5 5 1.5
6 6 NA
7 7 3.0
8 8 2.3
9 9 2.0
How to use OR between two non-numeric values?
You can use %in%
instead:
m.v <- c("A", "AGG", "A" ,"G", "GA")
count <- 0
for(i in 1: 5){
if(m.v[i] %in% c("A", "G")){
count <- count+1
}
}
count
[1] 3
How to convert all non numeric cells in data frame to NA
Based on your edit, you have vectors which should be numeric, but due to some erroneous data introduced during the reading-in process, the data have been converted to another format (likely character
or factor
).
Here is an example of that case. mydf1 <- mydf2 <- mydf3 <-
just creates three
data.frame(...)data.frame
s with the same data.
# I'm going to show three approaches
mydf1 <- mydf2 <- mydf3 <- data.frame(
A = c(1, 2, "x", 4),
B = c("y", 3, 4, "-")
)
str(mydf1)
# 'data.frame': 4 obs. of 2 variables:
# $ A: Factor w/ 4 levels "1","2","4","x": 1 2 4 3
# $ B: Factor w/ 4 levels "-","3","4","y": 4 2 3 1
One way to do this is to just let R coerce any values that cannot be converted to numeric to NA
:
## You WILL get warnings
mydf1[] <- lapply(mydf1, function(x) as.numeric(as.character(x)))
# Warning messages:
# 1: In FUN(X[[i]], ...) : NAs introduced by coercion
# 2: In FUN(X[[i]], ...) : NAs introduced by coercion
str(mydf1)
# 'data.frame': 4 obs. of 2 variables:
# $ A: num 1 2 NA 4
# $ B: num NA 3 4 NA
Another option is to use makemeNA
from my SOfun package:
library(SOfun)
makemeNA(mydf2, "[^0-9]", FALSE)
# A B
# 1 1 NA
# 2 2 3
# 3 NA 4
# 4 4 NA
str(.Last.value)
# 'data.frame': 4 obs. of 2 variables:
# $ A: int 1 2 NA 4
# $ B: int NA 3 4 NA
This function is a bit different in that it uses type.convert
to do the conversion, and can handle more specific rules for conversion to NA
(just like you can use a vector for na.strings
when reading data into R).
About your error, I believe you would have tried as.numeric
on your data.frame
to get the error you had shown.
Example:
# Your error...
as.numeric(mydf3)
# Error: (list) object cannot be coerced to type 'double'
You won't get that error on a matrix
though (but you'll still get the warning)....
# You'll get a warning
as.numeric(as.matrix(mydf3))
# [1] 1 2 NA 4 NA 3 4 NA
# Warning message:
# NAs introduced by coercion
Why don't we need to explicitly use as.character
? as.matrix
does that for you:
str(as.matrix(mydf3))
# chr [1:4, 1:2] "1" "2" "x" "4" "y" "3" "4" "-"
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "A" "B"
How can you use that information?
mydf3[] <- as.numeric(as.matrix(mydf3))
# Warning message:
# NAs introduced by coercion
str(mydf3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: num 1 2 NA 4
# $ B: num NA 3 4 NA
R: How to find the mean of a column in a data frame, that has non-numeric (specifically, dashes '-') as well as numeric numbers
Try this, assuming your data is called dat
:
dat[dat == "-"] <- NA
mean(dat$Population_and_People, na.rm = TRUE]
Related Topics
Harnessing .F List Names with Purrr::Pmap
In R, How to Check If Two Variable Names Reference the Same Underlying Object
Remove Unused Factor Levels from a Ggplot Bar Plot
Ggplot2: Fix Colors to Factor Levels
Font Family Won't Change in Ggplot
C5.0 Decision Tree - C50 Code Called Exit with Value 1
Apply a Function to Several Columns at Once with Mutate
Given Value of Matrix, Getting It's Coordinate
How to Make a Timeseries Boxplot in R
Expression and New Line in Plot Labels
How to Assign from a Function with Multiple Outputs
Convert Quarter/Year Format to a Date
R Grep Pattern Regex with Brackets
Add Colored Arrow to Axis of Ggplot2 (Partially Outside Plot Region)
Calculating Sum of Previous 3 Rows in R Data.Table (By Grid-Square)