How to Convert Ensembl Id to Gene Symbol in R

How can I convert Ensembl ID to gene symbol in R?

This is because the values you have in your gene column are not gene ids, they are peptide id (they start with ENSP). To get the info you need, try replacing ensembl_gene_id by ensembl_peptide_id:

G_list <- getBM(filters = "ensembl_peptide_id", 
attributes = c("ensembl_peptide_id", "entrezgene", "description"),
values = genes, mart = mart)

Also, what you are really looking for is the hgnc_symbol

Here is the total code to get your output:

library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$genes
df<-df[,-4]
G_list <- getBM(filters= "ensembl_peptide_id", attributes= c("ensembl_peptide_id","hgnc_symbol"),values=genes,mart= mart)
merge(df,G_list,by.x="gene",by.y="ensembl_peptide_id")

convert Ensembl ID to gene name using biomaRt

The biomart part worked, it's your left join that fails because there are no common columns, gene_IDs has the ensembl id under "ensembl_gene_id" while your kidney dataframe has it under "gene_id".

Also you need to check whether they are gencode or ensembl. Gencode ids normally have a .[number] for example, ENSG00000000003.10 , in ensembl database it is ENSG00000000003.

library("biomaRt")
library("dplyr")

kidney <- data.frame(gene_id =
c("ENSG00000000003.10","ENSG00000000005.5",
"ENSG00000000419.8","ENSG00000000457.9","ENSG00000000460.12"),
vals=runif(5)
)
#make this a character, otherwise it will throw errors with left_join
kidney$gene_id <- as.character(kidney$gene_id)
# in case it's gencode, this mostly works
#if ensembl, will leave it alone
kidney$gene_id <- sub("[.][0-9]*","",kidney$gene_id)

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- kidney$gene_id
gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"),
values = genes, mart= mart)

left_join(kidney, gene_IDs, by = c("gene_id"="ensembl_gene_id"))

gene_id vals hgnc_symbol
1 ENSG00000000003 0.2298255 TSPAN6
2 ENSG00000000005 0.4662570 TNMD
3 ENSG00000000419 0.7279107 DPM1
4 ENSG00000000457 0.3240166 SCYL3
5 ENSG00000000460 0.3038986 C1orf112

Trying to convert Ensembl ID to gene name in R (biomaRt)

If you want to just overwrite the Ensemble IDs with the HGNC IDs you can do it in one step:

library(biomaRt)
names(resdata)[1] <- "genes"
head(resdata)

## Write results
resdata <- resdata[complete.cases(resdata), ]

dim(resdata)

charg <- resdata$genes
head(charg)

charg2 = sapply(strsplit(charg, '.', fixed=T), function(x) x[1])

ensembl = useMart(biomart = "ensembl", dataset="hsapiens_gene_ensembl")

resdata[1] = getBM(attributes='hgnc_symbol',
filters = 'ensembl_gene_id',
values = charg2,
mart = ensembl)
resdata

(This keeps Log2FC as column 3, which looks right based on the next steps in your pipeline, but if you want something different let me know and I'll update my answer to suit)

Can't convert dog ensembl IDs into gene names

These IDs are from the Boxer dog genome assembly: https://www.ensembl.org/Canis_lupus_familiaris/Info/Strains?db=core

However, BioMart is not available for dog breeds (as well as other species and strains): https://www.ensembl.info/2021/01/20/important-changes-of-data-availability-in-ensembl-gene-trees-and-biomart/

However, you can use the POST lookup/id REST API endpoint to retrieve the gene symbol for a list of gene IDs from any species: http://rest.ensembl.org/documentation/info/lookup_post

converting from Ensembl gene ID's to different identifier

Here is step-by-step example:

  1. Load the biomaRt library.

    library(biomaRt)
  2. As query input we have Canis lupus familiaris Ensembl transcript IDs (note that they are not Ensembl gene IDs). We also need to strip the dot+digit(s) from the end, which is used to indicate annotation updates.

    tx <- c("ENSCAFT00000001452.3", "ENSCAFT00000001656.3")
    tx <- gsub("\\.\\d+$", "", tx)
  3. We now query the database for the Ensembl transcript IDs in tx

    ensembl <- useEnsembl(biomart = "ensembl", dataset = "cfamiliaris_gene_ensembl")
    res <- getBM(
    attributes = c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", "description"),
    filters = "ensembl_transcript_id",
    values = tx,
    mart = ensembl)
    res
    #ensembl_gene_id ensembl_transcript_id external_gene_name
    #1 ENSCAFG00000000934 ENSCAFT00000001452 COL14A1
    #2 ENSCAFG00000001086 ENSCAFT00000001656 MYC
    # description
    #1 collagen type XIV alpha 1 chain [Source:VGNC Symbol;Acc:VGNC:51768]
    #2 MYC proto-oncogene, bHLH transcription factor [Source:VGNC Symbol;Acc:VGNC:43527]

Note that you can get a data.frame of all attributes for a particular mart with listAttributes(ensembl).

Additionally to the link @GordonShumway gives in the comment above, another good (and succinct) summary/introduction to biomaRt can be found on the Ensembl websites.



Related Topics



Leave a reply



Submit