How to read a tab delimited file into Python with rows of unequal length?
It also works easily with pandas, keeping the header as columns:
import pandas as pd
data = pd.read_csv(file_to_read, sep='\t')
Reading a text file (tab/space delimited) having named columns into lists with the lists having the same name as the column name
you could do this:
import re
with open('file.txt') as f:
data = [re.split('[ ]+|\t',x) for x in f.read().split('\n')]
res = dict((x,[]) for x in data[0])
for i in data[1:]:
for j in range(len(i)):
res[data[0][j]].append(i[j])
print(res)
Read n qty rows of delimited txt file in Python
Pandas dataframe will help you do that automatically.
import pandas as pd
df = pd.read_csv(myfile,sep='\t')
df.head(n=5) # for the 5 first lines of your file
For more info, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv
Convert tab-delimited txt file into a csv file using Python
csv
supports tab delimited files. Supply the delimiter
argument to reader
:
import csv
txt_file = r"mytxt.txt"
csv_file = r"mycsv.csv"
# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
Parsing through .txt file to create tab delimited output file
Have a look at this example:
import re
InFileName = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'
# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"
# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
chromosomes = []
plasmids = []
for line in InFile:
if line.lstrip()[0] == '#':
# Process header part of the file differently from the data part
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
elif assembly in ['Chromosome', 'Complete Genome']:
# Process each data line separately
split_line = line.split()
Type = split_line[3]
RefSeq_Accn = split_line[6]
if Type == "Chromosome":
chromosomes.append(RefSeq_Accn)
if Type == "Plasmid":
plasmids.append(RefSeq_Accn)
# Merge names of up to N chromosomes
N = 1
cstr = ''
for i in range(N):
if i < len(chromosomes):
nextChromosome = chromosomes[i]
else:
nextChromosome = ''
cstr += '\t' + nextChromosome
# Merge names of up to M plasmids
M = 5
pstr = ''
for i in range(M):
if i < len(plasmids):
nextPlasmid = plasmids[i]
else:
nextPlasmid = ''
pstr += '\t' + nextPlasmid
OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
OutputString += cstr
OutputString += pstr
OutFile.write(Headstring+'\n'+OutputString)
Input:
# Assembly name: ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na
Output:
GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5
GCA_000018445.1 GCF_000018445.1 Complete Genome 1 2 NC_010611.1 NC_010605.1 NC_010606.1
The main differences from your script:
- I use condition
if line.lstrip()[0] == '#'
to process the "header" lines (the lines starting with a hash character) differently from the "table rows" at the bottom (the lines actually containing data for each sequence). - I use the condition
if assembly in ['Chromosome', 'Complete Genome']
- this is the condition you specified in your question - I split each table row into values like this
split_line = line.split()
. After that I acquire the type byType = split_line[3]
(this is the fourth column in the table data) andRefSeq_Accn = split_line[6]
gives me the seventh column in the table.
How to convert a tab delimited .txt file to an xml or csv using Python
Normally,the csv
module should be able to it. If not (you do not have consistent amount of spaces separating the values, you can manually split the lines:
content = "INPUTGOESHERE".split("\n")
for i in range(len(content)):
content[i] = filter(bool, content[i].split(" ")) # split the lines at spaces and filter out empty strings
outstr = ""
for line in content:
line = ",".join(line) # convert values list to a comma separated string for each line
outstr += line + "\n"
print(outstr)
See the edit of this answer for how to convert CSV to XML.
Parsing CSV / tab-delimited txt file with Python
Start by turning the text into a list of lists. That will take care of the parsing part:
lol = list(csv.reader(open('text.txt', 'rb'), delimiter='\t'))
The rest can be done with indexed lookups:
d = dict()
key = lol[6][0] # cell A7
value = lol[6][3] # cell D7
d[key] = value # add the entry to the dictionary
...
Related Topics
Use Tqdm Progress Bar With Pandas
How to Keep Python Script Keep Running Indefinitely
A Better Way Than Looping and Calling Functions That Loop and Call Another Functions
Splitting Dataframe into Multiple Dataframes
Python Replace Elements in Array At Certain Range
How to Periodically Execute a Function With Asyncio
Pandas Rank by Multiple Columns
Pandas: Difference Between Pivot and Pivot_Table. Why Is Only Pivot_Table Working
Python: How to Escape Slashes in Path
Python/Pandas: Convert Month Int to Month Name
How to Get a Value from a Cell of a Dataframe
Python: How to Add Single Quotes to a Long List
Python Does Not Match Format '%Y-%M-%Dt%H:%M:%S%Z.%F'
How to Extract Integer or Float from String