Remove the rows that have non-numeric characters in one column in R
When you import data to a data.frame, it generally gets converted to a factor if the entire column is not numeric. With that in mind, you usually have to convert to character and then to numeric.
dat <- data.frame(A=c(letters[1:5],1:5))
str(dat)
'data.frame': 10 obs. of 1 variable:
$ A: Factor w/ 10 levels "1","2","3","4",..: 6 7 8 9 10 1 2 3 4 5
as.numeric(as.character(dat$A))
[1] NA NA NA NA NA 1 2 3 4 5
Warning message:
NAs introduced by coercion
Notice that it converts characters to NA
. Combining this:
dat <- dat[!is.na(as.numeric(as.character(dat$A))),]
In words, the rows of dat
that are not NA
after conversion from factor to numeric.
Second Issue:
> dat <- data.frame(A=c(letters[1:5],1:5))
> dat <- dat[!is.na(as.numeric(as.character(dat$A))),]
Warning message:
In `[.data.frame`(dat, !is.na(as.numeric(as.character(dat$A))), :
NAs introduced by coercion
> dat <- dat[!is.na(as.numeric(as.character(dat$A))),]
Error in dat$A : $ operator is invalid for atomic vectors
Is there any way to delete the rows of data which don't have all numeric values?
One base R
option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE
.
Or using sapply()
:
data[!is.na(rowSums(sapply(data, as.numeric))), ]
Replacing all non-numeric characters in certain columns in R
You could use across
(within mutate
) to do it over all columns but a
and use regex
(within str_extract
) to extract only numerics (and convert to numerics type).
library(tidyverse)
d |>
mutate(across(-a, ~ . |> str_extract("\\d+") |> as.numeric()))
Output:
# A tibble: 6 × 3
a b c
<chr> <dbl> <dbl>
1 Tom 8 2
2 Mary 3 12
3 Ben 6 6
4 Jane 7 7
5 Lucas 5 1
6 Mark 1 9
Removing data with a non-numeric column value in R
If you just want to filter out rows with NA
values, you can use complete.cases()
:
> df
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
10 10 70 NA 66.0 0 1
> df[complete.cases(df), ]
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
How to delete all non-numeric rows in R?
Subset to numeric IDs:
subset(df, grepl('^\\d+$', df$ID))
The pattern should match values of ID that start and end with digits, and only contain digits.
How to delete a row in R that doesn't have a number
Example data.frame:
df <- data.frame(a=1:10, b=1:10, FRQ=c(rnorm(8), '.', 'rabbit'), stringsAsFactors=FALSE)
To check the class of all your columns try: lapply(df, class)
If the FRQ column is character, you can convert it to numeric by removing all non-numerics, then convert to numeric. Like this:
library(stringr)
df <- df[!str_detect(df$FRQ, '([A-Za-z])'), ]
df <- df[!str_detect(df$FRQ, '\\.$'), ]
df$FRQ <- as.numeric(df$FRQ)
Remove Non Numeric values (*Unknown*) in my data frame
We could avoid this problem while specifying na.strings
in the read.csv/read.table
dataL <- read.csv("file.csv", stringsAsFactors = FALSE,
na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))
The problem with the current approach is that these are factor
columns and replacing those levels
to NA
still show the unused levels
. So, we need droplevels
to remove the unused levels
dataS <- droplevels(na.omit(dataL))
Remove non numeric values from vector in r
A simple solution is to use Filter
over vec <- list(1, 2, T, 'x', 'abc', '6', 7, F, F, 10)
, i.e.,
> unlist(Filter(is.numeric,vec))
[1] 1 2 7 10
Removing rows from dataframe that contains string in a particular column
There are multiple ways you can do this :
Convert to numeric and remove NA
values
subset(df, !is.na(as.numeric(Score)))
# ID Score
#1 1001 4
#2 1002 20
#5 1005 30
Or with grepl
find if there are any non-numeric characters in them and remove them
subset(df, !grepl('\\D', Score))
This can be done with grep
as well.
df[grep('\\D', df$Score, invert = TRUE), ]
data
df <- structure(list(ID = 1001:1005, Score = c("4", "20", "h", "v",
"30")), class = "data.frame", row.names = c(NA, -5L))
Convert non-numeric rows and columns to zero
library(ISLR)
data("Hitters")
d = head(Hitters)
library(dplyr)
d %>%
mutate_if(function(x) !is.numeric(x), function(x) 0) %>% # if column is non numeric add zeros
mutate_all(function(x) ifelse(is.na(x), 0, x)) # if there is an NA element replace it with 0
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
# 1 293 66 1 30 29 14 1 293 66 1 30 29 14 0 0 446 33 20 0.0 0
# 2 315 81 7 24 38 39 14 3449 835 69 321 414 375 0 0 632 43 10 475.0 0
# 3 479 130 18 66 72 76 3 1624 457 63 224 266 263 0 0 880 82 14 480.0 0
# 4 496 141 20 65 78 37 11 5628 1575 225 828 838 354 0 0 200 11 3 500.0 0
# 5 321 87 10 39 42 30 2 396 101 12 48 46 33 0 0 805 40 4 91.5 0
# 6 594 169 4 74 51 35 11 4408 1133 19 501 336 194 0 0 282 421 25 750.0 0
If you want to avoid function(x)
you can use this
d %>%
mutate_if(Negate(is.numeric), ~0) %>%
mutate_all(~ifelse(is.na(.), 0, .))
Related Topics
How to Sort a Matrix by All Columns
Fama MACbeth Standard Errors in R
Keep Same Order as in Data Files When Using Ggplot
Rolling Regression by Group in the Tidyverse
Generating Names Iteratively in R for Storing Plots
Mapping Specific States and Provinces in R
R - Count Shiny Download Button Clicks
Stacked Histograms Like in Flow Cytometry
R Name Colnames and Rownames in List of Data.Frames with Lapply
Sine Curve Fit Using Lm and Nls in R
Write a Data Frame to CSV File Without Column Header in R
Chain Arithmetic Operators in Dplyr with %>% Pipe
How to Increase the Resolution of My Plot in R