Read Fasta into a Dataframe and Extract Subsequences of Fasta File

Read FASTA into a dataframe and extract subsequences of FASTA file

You should have a look at the Biostrings package.

library("Biostrings")

s = readDNAStringSet("nm.fasta")
subseq(s, start=c(1, 2, 3), end=c(3, 6, 5))

how to read FASTA into dataframe and extract subsequences of FASTA file in d3.js

1.How to parse it in d3.js?

D3.js is a JavaScript (look at the "js") library for manipulating documents based on data. So, at the end of the day, D3 is javascript, and there is no "parsing" function for nucleic acid sequences.

Regarding D3 (actually regarding JavaScript), you can deal with the DNA sequence as a string:

"ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC..."

or as an array:

["A", "C", "A", "T", "A"...]

Or, in a cumbersome way, as an array of objects:

[{position:1, base:"A"}, {position:2, base:"B"}...]

It depends on you. FASTA is text-based, which means we will treat the data as a string (first option).

2.How to extract subsequence at (start, end) location?

As D3 is a javascript library, you'll have to deal with your string using JavaScript methods.

For instance, to find the position of the start (TAC, corresponding to UAG codon) triplet in your sequence, you can use indexOf:

var sequence = "ACATACTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC";
var start = "TAC";
console.log(sequence.indexOf(start))

How to split a FASTA file imported as data.frame through

Let's say your file is like this:

writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")

dna.sequences
V1
1 >ID1
2 GAGA
3 >ID2
4 TATA

Assuming it's read correctly:

rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])

Or much better, read it in directly using a package meant for this:

library(Biostrings)
data = readDNAStringSet("test.fa")

data
A DNAStringSet instance of length 2
width seq names
[1] 4 GAGA ID1
[2] 4 TATA ID2

dna.sequences = data.frame(ID=names(data),sequences=as.character(data))

dna.sequences
ID sequences
ID1 ID1 GAGA
ID2 ID2 TATA

Extracting lines based on comma-separated string in another file and write extracted lines to file

Assuming the sequence data ends in a single line (without extending over multiple lines), how about an awk solution:

awk -F'\t' '
NR==FNR { # process SAMPLE.fasta file
if (FNR % 2) { # odd line with contigID
len = split($0, a, "|") # extract the contigID
id = a[len]
seq[id] = $0 # assign seq[id] to the line
} else { # even line with sequence
seq[id] = seq[id] RS $0 # append sequence to seq[id]
}
next
}
{ # process contigIDs file
fname = $1 ".fasta" # filename to write
len = split($2, a, ",") # split the contigIDs
for (i = 1; i <= len; i++) {
split(a[i], b, "|") # extract the contigID
if (b[3] in seq) { # if the sequence is found
print seq[b[3]] > fname # then print it to the file
}
}
close(fname)
}
' SAMPLE.fasta contigIDs

Output:

424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG


Related Topics



Leave a reply



Submit