Split a large text file into multiple files using delimiters
A solution to read and write at the same time to avoid keeping anyting in memory could be:
with open('input.txt') as f:
f_out = None
for line in f:
if line.startswith('[TEST]'): # we need a new output file
title = line.split(' ', 1)[1]
if f_out:
f_out.close()
f_out = open(f'{title}.txt', 'w')
if f_out:
f_out.write(line)
if f_out:
f_out.close()
Splitting a large file into chunks
An easy way to chunk the file is to use f.read(size)
until there is no content left. However this method works with character number instead of lines.
test_file = 'random_test.txt'
def chunks(file_name, size=10000):
with open(file_name) as f:
while content := f.read(size):
yield content
if __name__ == '__main__':
split_files = chunks(test_file)
for chunk in split_files:
print(len(chunk))
For the last chunk, it will take whatever left, here 143
characters
Same Function with lines
test_file = "random_test.txt"
def chunks(file_name, size=10000):
with open(file_name) as f:
while content := f.readline():
for _ in range(size - 1):
content += f.readline()
yield content.splitlines()
if __name__ == '__main__':
split_files = chunks(test_file)
for chunk in split_files:
print(len(chunk))
For the last chunk, it will take whatever left, here 6479
lines
I need to split a very large text file
Here's what you can do:
SIZE = 1024
with open('file.txt') as f:
old, data = '', f.read(SIZE)
while data:
# (1)
lines = data.splitlines()
if not data.endswith('\n'):
old = lines[-1]
else:
old = ''
# process stuff
data = old + f.read(SIZE)
- If you do
data.splitlines(True)
, then new line characters will be kept in the resulted list.
Split a large text file to small ones based on location
This will run with very low memory overhead as it writes each line as it reads it.
Algorithm:
- open input file
- read a line from input file
- get country from line
- if new country then open file for country
- write the line to country's file
- loop if more lines
- close files
Code:
with open('file.txt', 'r') as infile:
try:
outfiles = {}
for line in infile:
country = line.split(';')[2].strip('*')
if country not in outfiles:
outfiles[country] = open(country + '.txt', 'w')
outfiles[country].write(line)
finally:
for outfile in outfiles.values():
outfile.close()
Efficent way to split a large text file in python
Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort
). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.
You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.
And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.
Here it is, I think:
def temp_file_name(l):
id0, id1 = l.split()[:2]
return "tile_%s_%s.tmp" % (id0, id1)
def split_file(name):
ofiles = {}
try:
with open(name) as f:
for l in f:
if l:
fn = temp_file_name(l)
if fn not in ofiles:
ofiles[fn] = open(fn, 'w')
ofiles[fn].write(l)
finally:
for of in ofiles.itervalues():
of.close()
split_file('srcdata1.txt')
But if there is a lot of tiles, more than number of files you can open, you may do so:
def split_file(name):
with open(name) as f:
for l in f:
if l:
fn = temp_file_name(l)
with open(fn, 'a') as of:
of.write(l)
And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.
Related Topics
How to Skip Empty Dates (Weekends) in a Financial Matplotlib Python Graph
Using Beautifulsoup to Extract Text from Div
How to Store Multiple Strings as One Variable
Flask API Typeerror: Object of Type 'Response' Is Not Json Serializable
I Need to Code a 1 22 333 4444 Pattern in Python With While Loops
How to Save Training History on Every Epoch in Keras
How to Scroll a Web Page Using Selenium Webdriver in Python
How to Split a Huge Text File in Python
Printing Each Letter of a Word + Another Letter - Python
How to Transform Floats to Integers in a List
How to Insert Text At Line and Column Position in a File
Replace Values of a Numpy Index Array With Values of a List
Insert Comma into Text File Using Python
Extract Values from Column of Dictionaries Using Pandas
How to Show a Pandas Dataframe into a Existing Flask HTML Table