Importing Large Tab-Delimited .Txt File into Python

How to read a tab delimited file into Python with rows of unequal length?

It also works easily with pandas, keeping the header as columns:

import pandas as pd
data = pd.read_csv(file_to_read, sep='\t')

Reading a text file (tab/space delimited) having named columns into lists with the lists having the same name as the column name

you could do this:

import re
with open('file.txt') as f:
data = [re.split('[ ]+|\t',x) for x in f.read().split('\n')]
res = dict((x,[]) for x in data[0])
for i in data[1:]:
for j in range(len(i)):
res[data[0][j]].append(i[j])
print(res)

Read n qty rows of delimited txt file in Python

Pandas dataframe will help you do that automatically.

import pandas as pd
df = pd.read_csv(myfile,sep='\t')
df.head(n=5) # for the 5 first lines of your file

For more info, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv

Convert tab-delimited txt file into a csv file using Python

csv supports tab delimited files. Supply the delimiter argument to reader:

import csv

txt_file = r"mytxt.txt"
csv_file = r"mycsv.csv"

# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))

out_csv.writerows(in_txt)

Parsing through .txt file to create tab delimited output file

Have a look at this example:

import re

InFileName = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
chromosomes = []
plasmids = []
for line in InFile:
if line.lstrip()[0] == '#':
# Process header part of the file differently from the data part
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
elif assembly in ['Chromosome', 'Complete Genome']:
# Process each data line separately
split_line = line.split()
Type = split_line[3]
RefSeq_Accn = split_line[6]
if Type == "Chromosome":
chromosomes.append(RefSeq_Accn)
if Type == "Plasmid":
plasmids.append(RefSeq_Accn)

# Merge names of up to N chromosomes
N = 1
cstr = ''
for i in range(N):
if i < len(chromosomes):
nextChromosome = chromosomes[i]
else:
nextChromosome = ''
cstr += '\t' + nextChromosome

# Merge names of up to M plasmids
M = 5
pstr = ''
for i in range(M):
if i < len(plasmids):
nextPlasmid = plasmids[i]
else:
nextPlasmid = ''
pstr += '\t' + nextPlasmid

OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
OutputString += cstr
OutputString += pstr

OutFile.write(Headstring+'\n'+OutputString)

Input:

# Assembly name:  ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na

Output:

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1 GCF_000018445.1 Complete Genome 1 2 NC_010611.1 NC_010605.1 NC_010606.1

The main differences from your script:

  • I use condition if line.lstrip()[0] == '#' to process the "header" lines (the lines starting with a hash character) differently from the "table rows" at the bottom (the lines actually containing data for each sequence).
  • I use the condition if assembly in ['Chromosome', 'Complete Genome'] - this is the condition you specified in your question
  • I split each table row into values like this split_line = line.split(). After that I acquire the type by Type = split_line[3] (this is the fourth column in the table data) and RefSeq_Accn = split_line[6] gives me the seventh column in the table.

How to convert a tab delimited .txt file to an xml or csv using Python

Normally,the csv module should be able to it. If not (you do not have consistent amount of spaces separating the values, you can manually split the lines:

content = "INPUTGOESHERE".split("\n")

for i in range(len(content)):
content[i] = filter(bool, content[i].split(" ")) # split the lines at spaces and filter out empty strings

outstr = ""

for line in content:
line = ",".join(line) # convert values list to a comma separated string for each line
outstr += line + "\n"

print(outstr)

See the edit of this answer for how to convert CSV to XML.

Parsing CSV / tab-delimited txt file with Python

Start by turning the text into a list of lists. That will take care of the parsing part:

lol = list(csv.reader(open('text.txt', 'rb'), delimiter='\t'))

The rest can be done with indexed lookups:

d = dict()
key = lol[6][0] # cell A7
value = lol[6][3] # cell D7
d[key] = value # add the entry to the dictionary
...


Related Topics



Leave a reply



Submit