How to Split a Huge Text File in Python

Split a large text file into multiple files using delimiters

A solution to read and write at the same time to avoid keeping anyting in memory could be:

with open('input.txt') as f:
    f_out = None
    for line in f:
        if line.startswith('[TEST]'):      # we need a new output file
            title = line.split(' ', 1)[1]
            if f_out:
                f_out.close()
            f_out = open(f'{title}.txt', 'w')
        if f_out:
            f_out.write(line)
    if f_out:
        f_out.close()

Splitting a large file into chunks

An easy way to chunk the file is to use f.read(size) until there is no content left. However this method works with character number instead of lines.

test_file = 'random_test.txt'


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.read(size):
            yield content


if __name__ == '__main__':
    split_files = chunks(test_file)
    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 143 characters

Same Function with lines

test_file = "random_test.txt"


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.readline():
            for _ in range(size - 1):
                content += f.readline()

            yield content.splitlines()


if __name__ == '__main__':
    split_files = chunks(test_file)

    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 6479 lines

I need to split a very large text file

Here's what you can do:

SIZE = 1024

with open('file.txt') as f:
    old, data = '', f.read(SIZE)

    while data:
          # (1)
        lines = data.splitlines()
        if not data.endswith('\n'):
            old = lines[-1]
        else:
            old = ''

        # process stuff

        data = old + f.read(SIZE)

If you do data.splitlines(True), then new line characters will be kept in the resulted list.

Split a large text file to small ones based on location

This will run with very low memory overhead as it writes each line as it reads it.

Algorithm:

open input file
read a line from input file
get country from line
if new country then open file for country
write the line to country's file
loop if more lines
close files

Code:

with open('file.txt', 'r') as infile:
    try:
        outfiles = {}
        for line in infile:
            country = line.split(';')[2].strip('*')
            if country not in outfiles:
                outfiles[country] = open(country + '.txt', 'w')
            outfiles[country].write(line)
    finally:
        for outfile in outfiles.values():
            outfile.close()

Efficent way to split a large text file in python

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

Here it is, I think:

def temp_file_name(l):
    id0, id1 = l.split()[:2]
    return "tile_%s_%s.tmp" % (id0, id1)

def split_file(name):
    ofiles = {}
    try:
        with open(name) as f:
            for l in f:
                if l:
                    fn = temp_file_name(l)
                    if fn not in ofiles:
                        ofiles[fn] = open(fn, 'w')
                    ofiles[fn].write(l)
    finally:
        for of in ofiles.itervalues():
            of.close()

split_file('srcdata1.txt')

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):
    with open(name) as f:
        for l in f:
            if l:
                fn = temp_file_name(l)
                with open(fn, 'a') as of:
                    of.write(l)

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.