How can I convert Ensembl ID to gene symbol in R?
This is because the values you have in your gene
column are not gene ids, they are peptide id (they start with ENSP). To get the info you need, try replacing ensembl_gene_id
by ensembl_peptide_id
:
G_list <- getBM(filters = "ensembl_peptide_id",
attributes = c("ensembl_peptide_id", "entrezgene", "description"),
values = genes, mart = mart)
Also, what you are really looking for is the hgnc_symbol
Here is the total code to get your output:
library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$genes
df<-df[,-4]
G_list <- getBM(filters= "ensembl_peptide_id", attributes= c("ensembl_peptide_id","hgnc_symbol"),values=genes,mart= mart)
merge(df,G_list,by.x="gene",by.y="ensembl_peptide_id")
convert Ensembl ID to gene name using biomaRt
The biomart part worked, it's your left join that fails because there are no common columns, gene_IDs has the ensembl id under "ensembl_gene_id" while your kidney dataframe has it under "gene_id".
Also you need to check whether they are gencode or ensembl. Gencode ids normally have a .[number] for example, ENSG00000000003.10 , in ensembl database it is ENSG00000000003.
library("biomaRt")
library("dplyr")
kidney <- data.frame(gene_id =
c("ENSG00000000003.10","ENSG00000000005.5",
"ENSG00000000419.8","ENSG00000000457.9","ENSG00000000460.12"),
vals=runif(5)
)
#make this a character, otherwise it will throw errors with left_join
kidney$gene_id <- as.character(kidney$gene_id)
# in case it's gencode, this mostly works
#if ensembl, will leave it alone
kidney$gene_id <- sub("[.][0-9]*","",kidney$gene_id)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- kidney$gene_id
gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"),
values = genes, mart= mart)
left_join(kidney, gene_IDs, by = c("gene_id"="ensembl_gene_id"))
gene_id vals hgnc_symbol
1 ENSG00000000003 0.2298255 TSPAN6
2 ENSG00000000005 0.4662570 TNMD
3 ENSG00000000419 0.7279107 DPM1
4 ENSG00000000457 0.3240166 SCYL3
5 ENSG00000000460 0.3038986 C1orf112
Trying to convert Ensembl ID to gene name in R (biomaRt)
If you want to just overwrite the Ensemble IDs with the HGNC IDs you can do it in one step:
library(biomaRt)
names(resdata)[1] <- "genes"
head(resdata)
## Write results
resdata <- resdata[complete.cases(resdata), ]
dim(resdata)
charg <- resdata$genes
head(charg)
charg2 = sapply(strsplit(charg, '.', fixed=T), function(x) x[1])
ensembl = useMart(biomart = "ensembl", dataset="hsapiens_gene_ensembl")
resdata[1] = getBM(attributes='hgnc_symbol',
filters = 'ensembl_gene_id',
values = charg2,
mart = ensembl)
resdata
(This keeps Log2FC as column 3, which looks right based on the next steps in your pipeline, but if you want something different let me know and I'll update my answer to suit)
Can't convert dog ensembl IDs into gene names
These IDs are from the Boxer dog genome assembly: https://www.ensembl.org/Canis_lupus_familiaris/Info/Strains?db=core
However, BioMart is not available for dog breeds (as well as other species and strains): https://www.ensembl.info/2021/01/20/important-changes-of-data-availability-in-ensembl-gene-trees-and-biomart/
However, you can use the POST lookup/id REST API endpoint to retrieve the gene symbol for a list of gene IDs from any species: http://rest.ensembl.org/documentation/info/lookup_post
converting from Ensembl gene ID's to different identifier
Here is step-by-step example:
Load the
biomaRt
library.library(biomaRt)
As query input we have Canis lupus familiaris Ensembl transcript IDs (note that they are not Ensembl gene IDs). We also need to strip the dot+digit(s) from the end, which is used to indicate annotation updates.
tx <- c("ENSCAFT00000001452.3", "ENSCAFT00000001656.3")
tx <- gsub("\\.\\d+$", "", tx)We now query the database for the Ensembl transcript IDs in
tx
ensembl <- useEnsembl(biomart = "ensembl", dataset = "cfamiliaris_gene_ensembl")
res <- getBM(
attributes = c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", "description"),
filters = "ensembl_transcript_id",
values = tx,
mart = ensembl)
res
#ensembl_gene_id ensembl_transcript_id external_gene_name
#1 ENSCAFG00000000934 ENSCAFT00000001452 COL14A1
#2 ENSCAFG00000001086 ENSCAFT00000001656 MYC
# description
#1 collagen type XIV alpha 1 chain [Source:VGNC Symbol;Acc:VGNC:51768]
#2 MYC proto-oncogene, bHLH transcription factor [Source:VGNC Symbol;Acc:VGNC:43527]
Note that you can get a data.frame
of all attributes
for a particular mart
with listAttributes(ensembl)
.
Additionally to the link @GordonShumway gives in the comment above, another good (and succinct) summary/introduction to biomaRt
can be found on the Ensembl websites.
Related Topics
Ggplot2: Issues with Dual Y-Axes and Loess Smoothing
R: Sourcing Files Using a Relative Path
How to Remove "Rows" with a Na Value
Difference Between As.Data.Frame(X) and Data.Frame(X)
Stl Decomposition of Time Series with Missing Values for Anomaly Detection
How to Knitr Markdown Straight Out of Your Workspace Using Rstudio
How to Control Number of Minor Grid Lines in Ggplot2
How to Use 'Facet' to Create Multiple Density Plot in Ggplot
Dplyr::First() to Choose First Non Na Value
Correct Positioning of Multiple Significance Labels on Dodged Groups in Ggplot
Multiply Permutations of Two Vectors in R
What Is the Knitr Equivalent of 'R Cmd Sweave Myfile.Rnw'
Figures Captions and Labels in Knitr
Avoid Rbind()/Cbind() Conversion from Numeric to Factor
Insert Portions of a Markdown Document Inside Another Markdown Document Using Knitr
How to Increase Size of the Points in Ggplot2, Similar to Cex in Base Plots