Read Fasta into a Dataframe and Extract Subsequences of Fasta File

Read FASTA into a dataframe and extract subsequences of FASTA file

You should have a look at the Biostrings package.

library("Biostrings")

s = readDNAStringSet("nm.fasta")
subseq(s, start=c(1, 2, 3), end=c(3, 6, 5))

how to read FASTA into dataframe and extract subsequences of FASTA file in d3.js

1.How to parse it in d3.js?

D3.js is a JavaScript (look at the "js") library for manipulating documents based on data. So, at the end of the day, D3 is javascript, and there is no "parsing" function for nucleic acid sequences.

Regarding D3 (actually regarding JavaScript), you can deal with the DNA sequence as a string:

"ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC..."

or as an array:

["A", "C", "A", "T", "A"...]

Or, in a cumbersome way, as an array of objects:

[{position:1, base:"A"}, {position:2, base:"B"}...]

It depends on you. FASTA is text-based, which means we will treat the data as a string (first option).

2.How to extract subsequence at (start, end) location?

As D3 is a javascript library, you'll have to deal with your string using JavaScript methods.

For instance, to find the position of the start (TAC, corresponding to UAG codon) triplet in your sequence, you can use indexOf:

var sequence = "ACATACTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC";
var start = "TAC";
console.log(sequence.indexOf(start))

How to split a FASTA file imported as data.frame through

Let's say your file is like this:

writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")

dna.sequences
    V1
1 >ID1
2 GAGA
3 >ID2
4 TATA

Assuming it's read correctly:

rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])

Or much better, read it in directly using a package meant for this:

library(Biostrings)
data = readDNAStringSet("test.fa")

data
  A DNAStringSet instance of length 2
    width seq                                               names               
[1]     4 GAGA                                              ID1
[2]     4 TATA                                              ID2

dna.sequences = data.frame(ID=names(data),sequences=as.character(data))

dna.sequences
     ID sequences
ID1 ID1      GAGA
ID2 ID2      TATA

Extracting lines based on comma-separated string in another file and write extracted lines to file

Assuming the sequence data ends in a single line (without extending over multiple lines), how about an awk solution:

awk -F'\t' '
    NR==FNR {                                   # process SAMPLE.fasta file
        if (FNR % 2) {                          # odd line with contigID
            len = split($0, a, "|")             # extract the contigID
            id = a[len]
            seq[id] = $0                        # assign seq[id] to the line
        } else {                                # even line with sequence
            seq[id] = seq[id] RS $0             # append sequence to seq[id]
        }
        next
    }
    {                                           # process contigIDs file
        fname = $1 ".fasta"                     # filename to write
        len = split($2, a, ",")                 # split the contigIDs
        for (i = 1; i <= len; i++) {
            split(a[i], b, "|")                 # extract the contigID
            if (b[3] in seq) {                  # if the sequence is found
                print seq[b[3]] > fname         # then print it to the file
            }
        }
        close(fname)
    }
' SAMPLE.fasta contigIDs

Output:

424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG

Read Fasta into a Dataframe and Extract Subsequences of Fasta File