Read FASTA into a dataframe and extract subsequences of FASTA file
You should have a look at the Biostrings package.
library("Biostrings")
s = readDNAStringSet("nm.fasta")
subseq(s, start=c(1, 2, 3), end=c(3, 6, 5))
how to read FASTA into dataframe and extract subsequences of FASTA file in d3.js
1.How to parse it in d3.js?
D3.js is a JavaScript (look at the "js") library for manipulating documents based on data. So, at the end of the day, D3 is javascript, and there is no "parsing" function for nucleic acid sequences.
Regarding D3 (actually regarding JavaScript), you can deal with the DNA sequence as a string:
"ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC..."
or as an array:
["A", "C", "A", "T", "A"...]
Or, in a cumbersome way, as an array of objects:
[{position:1, base:"A"}, {position:2, base:"B"}...]
It depends on you. FASTA is text-based, which means we will treat the data as a string (first option).
2.How to extract subsequence at (start, end) location?
As D3 is a javascript library, you'll have to deal with your string using JavaScript methods.
For instance, to find the position of the start (TAC, corresponding to UAG codon) triplet in your sequence, you can use indexOf
:
var sequence = "ACATACTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC";
var start = "TAC";
console.log(sequence.indexOf(start))
How to split a FASTA file imported as data.frame through
Let's say your file is like this:
writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")
dna.sequences
V1
1 >ID1
2 GAGA
3 >ID2
4 TATA
Assuming it's read correctly:
rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])
Or much better, read it in directly using a package meant for this:
library(Biostrings)
data = readDNAStringSet("test.fa")
data
A DNAStringSet instance of length 2
width seq names
[1] 4 GAGA ID1
[2] 4 TATA ID2
dna.sequences = data.frame(ID=names(data),sequences=as.character(data))
dna.sequences
ID sequences
ID1 ID1 GAGA
ID2 ID2 TATA
Extracting lines based on comma-separated string in another file and write extracted lines to file
Assuming the sequence data ends in a single line (without extending over multiple lines), how about an awk solution:
awk -F'\t' '
NR==FNR { # process SAMPLE.fasta file
if (FNR % 2) { # odd line with contigID
len = split($0, a, "|") # extract the contigID
id = a[len]
seq[id] = $0 # assign seq[id] to the line
} else { # even line with sequence
seq[id] = seq[id] RS $0 # append sequence to seq[id]
}
next
}
{ # process contigIDs file
fname = $1 ".fasta" # filename to write
len = split($2, a, ",") # split the contigIDs
for (i = 1; i <= len; i++) {
split(a[i], b, "|") # extract the contigID
if (b[3] in seq) { # if the sequence is found
print seq[b[3]] > fname # then print it to the file
}
}
close(fname)
}
' SAMPLE.fasta contigIDs
Output:
424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
Related Topics
R Shiny: Download Existing File
Converting a Factor to Numeric Without Losing Information R (As.Numeric() Doesn't Seem to Work)
Combination Boxplot and Histogram Using Ggplot2
Saving Plot as PDF and Simultaneously Display It in the Window (X11)
Using R to Analyze Balance Sheets and Income Statements
Fastest Way to Multiply Matrix Columns with Vector Elements in R
Is There a More Efficient Way to Replace Null with Na in a List
How to Create a List of Vectors in Rcpp
How to Remove Duplicated Column Names in R
How to Solve Prcomp.Default(): Cannot Rescale a Constant/Zero Column to Unit Variance
Add Density Lines to Histogram and Cumulative Histogram
Jitter If Multiple Outliers in Ggplot2 Boxplot
Optimal/Efficient Plotting of Survival/Regression Analysis Results
How to Group by All But One Columns
How to Request an Early Exit When Knitting an Rmd Document
How to Order a Data Frame by One Descending and One Ascending Column