Remove Specific Characters from Column Names in R

r Remove parts of column name after certain characters

We can use sub

sub("_3.*", "", df1[,1])
#[1] "col1" "col2" "col3"

remove character for all column names in a data frame

If we need to remove only 'v' the one of more digits (\\d+) at the end ($) is not needed as the expected output also removes 'v' from first column 'q_ve5'

library(dplyr)
library(stringr)
df %>%
rename_with(~ str_remove(., "v"), everything())

-output

# A tibble: 2 × 5
q_e5 q_f_1 q_f_2 q_e6 q_e8
<int> <int> <int> <int> <int>
1 1 3 3 5 5
2 2 4 4 6 6

Or without any packages

names(df) <- sub("v", "", names(df))

removing numbers and characters from column names r

We may change the code to match one or more space (\\s+) followed by the opening parentheses (\\(, one or more digits (\\d+) and other characters (.*) and replace with blank ("")

colnames(data) <- sub("\\s+\\(\\d+.*", "", colnames(data))
colnames(data)
[1] "Subject" "ASE" "ASD" "AFD"

Or another option is trimws from base R

trimws(colnames(data), whitespace = "\\s+\\(.*")
[1] "Subject" "ASE" "ASD" "AFD"

In the OP's, code, it is matching an upper case letter followed by space and the ( is a metacharacter, which is not escaped. , thus in regex mode, it captures the digits (([0-9]+)). But, this don't match the pattern in the column names, because after a space, there is a (, which is not matched, thus it returns the same string

gsub("[A-Z] ([0-9]+)","",colnames(data))
[1] "Subject" "ASE (232)" "ASD (121)" "AFD (313)"

data

data <- structure(list(Subject = 1L, `ASE (232)` = "1.1.", `ASD (121)` = 1.2, 
`AFD (313)` = 1.3), class = "data.frame", row.names = c(NA,
-1L))

How can I remove certain characters from column headers in R?

We can use sub to match the . (metacharacter - so escape) followed by one or more digits (\\d+) at the end ($) of the string and replace with blank ("")

names(df) <- sub("\\.\\d+$", "", names(df))

NOTE: If the data is data.frame, duplicate column names are not allowed and is not recommended

How to remove part of characters in data frame column

There are multiple ways of doing this:

  1. Using as.numeric on a column of your choice.
raw$Zipcode <- as.numeric(raw$Zipcode)

  1. If you want it to be a character then you can use stringr package.
library(stringr)
raw$Zipcode <- str_replace(raw$Zipcode, "^0+" ,"")

  1. There is another function called str_remove in stringr package.
raw$Zipcode <- str_remove(raw$Zipcode, "^0+")

  1. You can also use sub from base R.
raw$Zipcode <- sub("^0+", "", raw$Zipcode)

But if you want to remove n number of leading zeroes, replace + with {n} to remove them.

For instance to remove two 0's use sub("^0{2}", "", raw$Zipcode).

How to remove '.' from column names in a dataframe?

1) sqldf can deal with names having dots in them if you quote the names:

library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')

giving:

  A.B C.D
1 1 2

2) When reading the data using read.table or read.csv use the check.names=FALSE argument.

Compare:

Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4

however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.

3) To simply remove the periods, if DF is a data frame:

names(DF) <- gsub(".", "", names(DF), fixed = TRUE)

or it might be nicer to convert the periods to underscores so that it is reversible:

names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)

This last line could be alternatively done like this:

names(DF) <- chartr(".", "_", names(DF))

Remove specific characters from column names in r

Another option is to use strsplit:

sapply(strsplit(strings, "\\."), function(x)
paste0(x[c(2, 4)], collapse = "."))
[1] "loc1.tret1" "loc2.tret2" "loc100.tret100"

Sample data

(From ManuelBickel's answer)

strings = c("drop.loc1.genom1.tret1.gwas2.a",
"drop.loc2.genom1.tret2.gwas2.a",
"drop.loc100.genom1.tret100.gwas2.a")

Removing characters in column titles after .

You can escape period like this \\.:

x <- "ENSG00000124564.16"
sub("\\..*", "", x)
#[1] "ENSG00000124564"

update:

## if you have list of strings it works
x <- c("ENSG00000124564.16", "ENSG00000257509.1")
sub("\\..*", "", x)
# [1] "ENSG00000124564" "ENSG00000257509"

## if you want to try it to change the column names it works
df <- data.frame(ENSG00000124564.16 = c(1, 2, 3), ENSG00000257509.1 = c(1, 1, 1))
names(df) <- sub("\\..*", "", names(df))
# ENSG00000124564 ENSG00000257509
#1 1 1
#2 2 1
#3 3 1


Related Topics



Leave a reply



Submit