Split Fasta Files Based on Header

Split fasta files based on header

You will improve performance by using a tool that doesn't require files to be opened and closed all the time. Awk is an excellent choice for this.

It seems to me that similar results to what you have written could be achieved with:

$ awk '/^>/ { file=substr($1,2) ".fasta" } { print > file }' *.faa

Note that unless you close() a file, awk leaves it open until the awk process is done, so the solution above will append to common fragment names, should they appear in multiple input files.

If you have a very large number of these (tens of thousands), then *.faa might expand to too many files for your shell to handle on one command line. If that's the case, you could process things more slowly using find.

Split multiple-fasta files according to header

Try this:

find . -name '*.fas' -exec \
awk -F'[>_.]' '
NF>1 {
close(out)
out = $NF
sub(/[0-9].*/,"",out)
out = $2 "_" out ".out"
}
{ print >> out }
' {} \;

I suffixed your output files ".out" so you could separate them from the ".fas" input files. Change that to suit your needs. You can use + instead of \; at the end of the find to run awk on multiple files at a time to speed things up a little if you have GNU find, but as written the above will work with any POSIX tools.

The above uses the FS (set by -F) to split each line that starts with > into it's relevant parts then recombines then to form the output file name for that line and everything that follows until the next > line. Then it just prints every line to the current output file name.

Extract sequences from a FASTA file to multiple files, file based on header_IDs in a separate file

A couple brief suggestions:

If all your headers follow the same pattern, then you can extract the unique elements:

record.description.split("_")[1] 

(yields "2040" from "CAP357_2040_011wpi_v1v3_1_008_00006_001.1")

If you use a dict you can assemble collections of records:

collected = {}
for record in records:
descr = record.description.split("_")[1]
try:
collected[descr].append(record)
except KeyError:
collected[descr] = [record ,]

Then you can write out each collection to a new file:

file_name = "outfile%s" 
for (descr, records) in collected.items(): # iteritems in python2
with open(os.path.join(file_path, file_name % descr), 'w') as f:
SeqIO.write(records, f, 'fasta')

Split fasta file into specific new fasta files

I initially created a method (function) to parse out Sequences from a Fasta File but like so many other things Fasta files can be in so many formats that some form of setup will need to be done to parse one type from another type. Your Fasta files are no exception, since in most Fasta sequences I don't believe the hyphen or minus character (-) is allowed within a sequence unless it is contained within the sequence Header but, I can see that in your Fasta files, hyphens are contained within sequences. Of course, I am most likely wrong about that. ;)

With this now known I've added some additional variables that can be manually set so as to provide more flexibility towards different Fasta file formats (I hope). In reality this should actually be a class instead of a method right from the beginning but I'm going to let you convert it to a workable class youself.

Now, this is a large method (and I don't feel good about that) with a lot of comments but I wanted to give you something relatively quick. I recommend you read all the provided comments within the code.

After you try this method...do make a class of it:

/**
* Returns a {@code List<String>} Interface object of all the Sequence Clusters
* detected within the supplied Fasta Data File.<br>
*
* @param sourceFilePath (String) The full path and file name to the source
* Fasta data file to parse.<br>
*
* @param destinationFileFolderPath (String) The full path to the
* destination directory (folder) where the two split Fasta files are to be
* created. If null or Null String ("") is supplied then the two split Fasta
* files will be created within the source file directory. Destination split
* file names are auto-generated!<br>
*
* @param splitRatio (String) The ratio of sequence clusters to be applied to
* each created split Fasta files. Ratio is considered percentages and the
* highest percentage is first followed by the lower percentage delimited with
* a colon (:), for example: "60:40". The two values provided <b>must</b> sum
* to 100.<br>
*
* @param newSequenceDesignator (String) By default an inner blank line is
* considered the end of a sequence cluster and the possible start of a new
* one. If a designator is supplied here then blank lines are ignored within
* the source file while parsing (if they exist). If there are no blank lines
* within the source file separating the sequence clusters then a designator
* <b>must</b> be supplied, usually the Sequence Cluster Header designator (>)
* is used. The New Sequence Designator can be the same as the Sequence Cluster
* Header Designator.<br>
*
* @return ({@code List<String>}) The List of parsed Sequence Clusters from
* the supplied source Fasta data file.
*/
public static java.util.List<String> splitFastaFile(String sourceFilePath, String destinationFileFolderPath,
String splitRatio, String newSequenceDesignator) {
/* System newline character(s) to use for console
display when required. */
String ls = System.lineSeparator();

// Valid characters allowed to be contained within a Fasta sequence line.
String allowableCharactersInSequences = "ABCDEFGHIJKLMNOPQRSTUVWXYZ-*";

/* By default, any sequence cluster line (other than the Header) which
contains an invalid character (a character not stipulated within the
'allowableCharactersInSequences' variable) is ignored and not added
to the Sequence Cluster. If however, you provide a string character
or phrase to the 'replaceInvalidSequenceCharactersWith' variable then
all invalid characters will be replace by what is held within that
variable. Keep in mind that you should use a character or case phrase
that is not considered valid (where as it is not also contained within
the 'allowableCharactersInSequences' variable). You want to always
maintain an invalid sequence line as INVALID unless the replacement is
indeed a VALID responce (repair) to the sequence which makes up the
sequence line or Sequence Cluster. */
String replaceInvalidSequenceCharactersWith = null; // Default is null.

/* The String or string character that will denote
the start of a new Sequence cluster. */
String sequenceDesignatorString = ">";

/* The String or string character that will denote
the start of a Sequence Header line. Can be the
same as the Sequence Designator String. */
String sequenceHeaderDesignatorString = ">";

// Escape RegEx meta characters in allowableCharactersInSequences (if any).
allowableCharactersInSequences = allowableCharactersInSequences.replaceAll("[\\W]", "\\\\$0");

// Add the sequence Header (if any)
boolean keepSequenceHeader = true;

// Add a blank line between sequence clusters in created Split Fasta files.
boolean blankLineBetweenSequenceClusters = false;

/* If a comment is supplied to this variable then it MUST start with a
semicolon (;). If it doesn't then it will be nulled. This comment
will be applied as the fist line of any Split File created. The comment
provided can utilize one or all three of the method tags available.
These tags are:
%H The High percentabe Fasta sequences split value.
%L The Low percentabe Fasta sequences split value.
%SV The Split Value currently being processed.
%SFN The Source File Name

An example might be as what was provided below: */
String splitFileComment = ";SPLIT FASTA FILE - Source File: %SFN - Percent of source: %SV%";

// See if the supplied source Fasta file exists.
File f = new File(sourceFilePath);
if (!f.exists() || !f.canRead()) {
System.err.println("splitFastaFile() method error! Either the "
+ "specified source file can not be found or permission "
+ "to read the file does not exist!");
return null;
}

/* Get the supplied Fasta file name. The destination
files will derive from this name. */
String sourceFileName = f.getName();

/* If null or null string ("") is passed as the destination
directory then the Source file directory will also become
the destination for the two created Split Files. */
if (destinationFileFolderPath == null || destinationFileFolderPath.isEmpty()) {
String absPath = new File(sourceFilePath).getAbsolutePath();
destinationFileFolderPath = absPath.substring(0, absPath.lastIndexOf(File.separator));
}

/* Make sure the supplied destination file folder path
contains a system file separator character (\ or /). */
if (!destinationFileFolderPath.endsWith(File.separator)) {
destinationFileFolderPath = destinationFileFolderPath + File.separator;
}

/* Make sure a proper Split Ratio is supplied! The Split Ratio is
supplied as a colon (:) delimited string containg the desired
percentage of sequences to be saved within the first text file
and the desired percentage of sequences to be saved to the second
text file. If you want 80% of the sequences within the Fasta source
file to be written to the first text file and you want the remaining
20% to be written to the second text file then you would supply to
the 'splitRatio' parameter: "80:20". Whatever is supplied, the sum
of the two supplied values MUST equal 100 (100%). The higher of the
two values MUST be first (this is enforced). */
if (!splitRatio.matches("\\d{1,3}:\\d{1,3}")) {
System.err.println("splitFastaFile() method Error! An invalid Split "
+ "Ratio string was supplied! (" + splitRatio + ") The format "
+ "must be: \"80:20\".");
return null;
}
/* Split the ratio string provided within the 'splitRatio' parameter.
Convert the string numerical values to Integer and check validity. */
String[] ratioParts = splitRatio.split("\\s*:\\s*");
int high = Integer.valueOf(ratioParts[0]);
int low = Integer.valueOf(ratioParts[1]);
if (high + low != 100) {
System.err.println("splitFastaFile() method Error! An invalid Split "
+ "Ratio string was supplied! (" + splitRatio + ") The percentage "
+ "values supplied must sum to 100. What was supplied sums to: "
+ (high + low) + ".");
return null;
}
else if (high < low) {
System.err.println("splitFastaFile() method Error! An invalid Split "
+ "Ratio string was supplied! (" + splitRatio + ") The higher "
+ "percentage value must be on the left and the lower percentage "
+ "value on the right! Swapping values to make it valid ("
+ low + ":" + high + ")!");
int tmp = high;
high = low;
low = tmp;
}

// Load source file Sequences into a List Interface object.
List<String> sequenceList = new ArrayList<>();
int fileLineCount = 0;
int validDataLineCount = 0;
int clusterLineCount = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(sourceFilePath))) {
StringBuilder sb = new StringBuilder("");
String line;
while((line = reader.readLine()) != null) {
/* Increment the File Lines Count (tracks the
total number of lines in Fasta source file). */
fileLineCount++;
/* Trim off leading and trailing whitespaces, tabs,
etc from the read in data file line. */
line = line.trim();
// Ignore lines that START with a simicolon (;). They are considered comment lines.
if (line.startsWith(";")) { continue; }
// Ignore first line of file if it's blank.
if (line.isEmpty() && validDataLineCount == 0) { continue; }
// See if this is a Sequence Header Line and whether or not we are to keep it.
if (line.startsWith(sequenceHeaderDesignatorString) && !keepSequenceHeader) {
// Increment the Sequence Cluster Line Count.
clusterLineCount++;
continue; // Ignore this Header...loop again.
}
if (validDataLineCount > 0 && ((sequenceDesignatorString != null && !sequenceDesignatorString.isEmpty()) ? line.startsWith(sequenceDesignatorString) : line.isEmpty())) {
String tmpLine = "";
if (line.startsWith(sequenceDesignatorString)) {
tmpLine = line;
}
sequenceList.add(sb.toString());
sb.delete(0, sb.length());
clusterLineCount = 0;
if (!tmpLine.isEmpty()) {
sb.append(tmpLine).append(ls);
clusterLineCount++;
validDataLineCount++;
}
}
else {
/* Skip blank lines if it is set to not be the new Sequence designator.
By default, blank inner file lines are considered the designator for
the end of a Sequence Cluster and the beginning of a new one. if
however the sequenceDesignatorString variable contains a designator
then blank lines will be ignored. */
if ((sequenceDesignatorString != null && !sequenceDesignatorString.isEmpty()) && line.isEmpty()) {
continue;
}
// Check sequence line validity
else if (!line.startsWith(sequenceHeaderDesignatorString) && !line.matches("[" + allowableCharactersInSequences + "]+")) {
// NOT VALID!
// Do invalid Characters get replaced...
if (replaceInvalidSequenceCharactersWith != null &&
!replaceInvalidSequenceCharactersWith.isEmpty()) {
StringBuilder tmpSB = new StringBuilder("");
for (int i = 0; i < line.length(); i++) {
String s = line.substring(i, i + 1);
if (s.matches("[" + allowableCharactersInSequences + "]+")) {
tmpSB.append(s);
}
else {
tmpSB.append(replaceInvalidSequenceCharactersWith);
}
}
clusterLineCount++;
sb.append(tmpSB.toString()).append(ls);
validDataLineCount++;
continue;
}
// Display the Invalid sequence line detected in console and...
System.err.println("Invalid sequence line character(s) detected in line #" + clusterLineCount + " of sequence cluster #" + (sequenceList.size() + 1) + ", on file "
+ "line #" + fileLineCount + "!" + ls + line);
// point out the characters that are invalid with Caret (^)
// characters under the invalid file line.
for (int i = 0; i < line.length(); i++) {
System.out.print(!line.substring(i, i + 1)
.matches("[" + allowableCharactersInSequences + "]+") ? "^" : " ");
}
System.out.println();
}
else {
clusterLineCount++;
sb.append(line).append(ls);
validDataLineCount++;
}
}
}
if (!sb.toString().isEmpty()) {
sequenceList.add(sb.toString());
}
} catch (FileNotFoundException ex) {
System.err.println(ex);
return null;
} catch (IOException ex) {
System.err.println(ex);
return null;
}

// Below is used for testing the method code...
//System.out.println("There are " + sequenceList.size() + " sequence(s) within the source file.");
//for (String str : sequenceList) {
// System.out.println(str);
//}

/* Auto-Generate destination file names. Destination file names will be in
the format of:
"Split_{source File Name}_{High Percentage}.txt"
and "Split_{source File Name}_{Low Percentage}.txt"

If the source file name is: "FastaFile010.txt" and the Split Ratio
supplied is: "60:40" then the destination file names will be:
"Split_FastaFile010_60.txt"
and "Split_FastaFile010_40.txt"
*/
String name = sourceFileName.substring(0, sourceFileName.lastIndexOf("."));
String extension = sourceFileName.substring(sourceFileName.lastIndexOf("."));
String highFileName = "Split_" + name + "_" + String.valueOf(high) + extension;
String lowFileName = "Split_" + name + "_" + String.valueOf(low) + extension;

//System.out.println(highFileName); // For Testing...
//System.out.println(lowFileName); // For Testing...

/* Determine the number of sequences for each file
based on the number sequences contained within
the Sequence List and the supplied desired Ratio. */
int highSeqs = (int)((sequenceList.size() * high) * 0.01d);
int lowSeqs = sequenceList.size() - highSeqs;

// System.out.println(highSeqs + " | " + lowSeqs); // For Testing...

// Create the two Split Files based on the desired ratio...
int c = 0;
String destPath;
int alreadyWritten = 0;
// Loop to write create or overwrite two seperate text files.
while (c < 2) {
destPath = destinationFileFolderPath + (c == 0 ? highFileName : lowFileName);
String uFormat = "UTF-8"; // Save text files in UTF-8 format.
try {
FileOutputStream outputStream = new FileOutputStream(destPath);
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(outputStream, uFormat);
try (BufferedWriter bufferedWriter = new BufferedWriter(outputStreamWriter)) {
// Apply a comment (if there is one) to the start of the current Split File.
if (splitFileComment != null && !splitFileComment.isEmpty()) {
// Replace method tags (if any - %H, %L, %SV):
String actFileComment = splitFileComment.trim()
.replace("%H", String.valueOf(high))
.replace("%L", String.valueOf(low))
.replace("%SV", (c == 0 ? String.valueOf(high) : String.valueOf(low)))
.replace("%SFN", sourceFileName);
bufferedWriter.write(actFileComment);
bufferedWriter.write(System.lineSeparator());
}
int i;
for (i = alreadyWritten; i < sequenceList.size(); i++) {
bufferedWriter.write(sequenceList.get(i));
/* Add a blank line after a sequence cluster is written to
file in preparation for the next sequence but don't do
this if it's the last sequence in the required set to be
written. If you don't want a blank line between all your
sequence clusters in the saved files then supply false to
the 'blankLineBetweenSequenceClusters' variable. */
if (blankLineBetweenSequenceClusters &&
(i + 1 - alreadyWritten) != (c == 0 ? highSeqs : lowSeqs)) {
bufferedWriter.write(System.lineSeparator());
}
// Write everything in buffer to file right away.
bufferedWriter.flush();
/* Stop writing to the first file is we've reach our
sequence limit for this file. Break out of the loop
so that file #2 can be written. */
if (c == 0 && (i + 1) == highSeqs) { break; }
}
// Update how many sequences we've already written to files.
alreadyWritten = i+1;
// Close the current writter.
bufferedWriter.close();
}
}
catch (IOException e) {
System.err.println(e);
break;
}
/* Increment for which file we are currently writing to:
0 = file #1 and 1 = file #2. */
c++;
}
return sequenceList;
}

Usage example:

List<String> seqs = splitFastaFile("MyFastaFile.txt", null, "60:40", null);
if (seqs.size() > 0) {
System.out.println("There were " + seqs.size() + " Sequence Clusters within the Fasta source file!");
}
else {
System.err.println("There was Error while processing the supplied Fasta File!");
}

Split one file into multiple files based on a pattern

A minor adjustment of the same solution should work.

Have you read https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html ?

for f in *.faa; do 
awk '/^>/ { file=substr($1,2,2) "." FILENAME }
{ print >> f; close(f); }' "$f"
done

Results:

$: grep . [0-9][0-9].*.faa
15.plate9.H6.phages.faa:>15_fragment_1_38 (32335..32991) 1 K01356 lexA; repressor LexA [EC:3.4.21.88]
15.plate9.H6.phages.faa:MIRRMNKWYEVARQVMDTQQISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRLLGELGLPILTTSIPWNEPGQQNVAPTEQPSRFYRYPVISWVEAGGWNEAVEPYPVGYSDTFELSDYKAKGRAFWLVVRGDSMTAPAGQSIPEGMLILVDTGIEPTPGKLVIAKLPESNEATFKKLVEDAGRYFLKPLNPAYPTIAISEECKLIGVIRQMTMRL*
15.plate9.H6.phages.faa:>15_fragment_1_39 (33140..33397) 1 VOG04172 REFSEQ hypothetical protein
15.plate9.H6.phages.faa:MQNLRPEASQHDAYLALAQRIQDLITSPKAQIEHQVLLVREPGESPVHWEQIVEQISEAEGINVTRNFENGSVNVSWYVESADAY*
39.plate9.H6.phages.faa:>39_fragment_4_246 (275156..276328) -1 K14059 int; integrase
39.plate9.H6.phages.faa:MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*

Convert multi-fasta file into seperate individual fasta files in R

You can read it in data.frame then save it in folder named fastafolder using a for loop :

fasta <- read.table("~/fasta", quote="\"", comment.char="")

dir.create("~/fastafolder")

for (line in 1:nrow(fasta)) {
rows <- 2 * line - 1
if (rows < nrow(fasta)) {
write(fasta[rows:(rows + 1), 1] , paste0("~/fastafolder/fasta" , line))
}
}

Created on 2022-05-28 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit