How to Concatenate Files with the Same Prefix (And Many Prefixes)

Python - merge multiple files based on file prefix

You're writing a new file every time you read a file, you need to append instead. You also have an unnecessary nested for-loop to read the file, while you could read them in the outer loop. This should work:

import os

# get all files in folder

files = os.listdir("C:\\MTA\\mta")

for filename in files:
#get prefix
prefix = filename[0:2]

# open destination file to merge individual files into

with open(os.path.join("C:\\MTA\\mta", "merged" + "_" + prefix + ".txt"), 'a') as outfile:
# go through all files and merge it into outfile
with open(os.path.join("C:\\MTA\\mta", filename)) as infile:
outfile.write(infile.read())
outfile.write("--------------\n")

Merging CSVs with similar name python

The solution to your question is the find_filesets() method below. I've included a CSV merge method as well based on MaxNoe's answer.

#!/usr/bin/env python

import glob
import random
import os
import pandas

def rm_minus_rf(dirname):
for r,d,f in os.walk(dirname):
for files in f:
os.remove(os.path.join(r, files))
os.removedirs(r)

def create_testfiles(path):
rm_minus_rf(path)
os.mkdir(path)

random.seed()
for i in range(10):
n = random.randint(10000,99999)
for j in range(random.randint(0,20)):
# year may repeat, doesn't matter
year = 2015 - random.randint(0,20)
with open("{}/{}-{}.csv".format(path, n, year), "w"):
pass

def find_filesets(path="."):
csv_files = {}
for name in glob.glob("{}/*-*.csv".format(path)):
# there's almost certainly a better way to do this
key = os.path.splitext(os.path.basename(name))[0].split('-')[0]
csv_files.setdefault(key, []).append(name)

for key,filelist in csv_files.items():
print key, filelist
# do something with filelist
create_merged_csv(key, filelist)

def create_merged_csv(key, filelist):
with open('{}-aggregate.csv'.format(key), 'w+b') as outfile:
for filename in filelist:
df = pandas.read_csv(filename, header=False)
df.to_csv(outfile, index=False, header=False)

TEST_DIR_NAME="testfiles"
create_testfiles(TEST_DIR_NAME)
find_filesets(TEST_DIR_NAME)

Bash: list different prefixes of files

Here's a pure BASH way of getting the prefixes:

for file in *.txt
do
echo "${file%_*.txt}"
done | sort -u

This will give you a list of all the file prefixes. From there, you could use this to do your cat.

The for loop goes through all of your files. You could say for file in T*_*.txt to limit what files you're picking up.

The ${file%_*.txt} is a small right pattern filter which removes the _*.txt from the variable $file. The sort -u sorts all of these prefixes, and combines duplicates.

The best way is to use this as a function:

function prefix
{
for file in *.txt
do
echo "${file%_.txt}"
done | sort -u
}

prefix | while read prefix
do
${prefix}_*.txt > cat $prefix.txt
done

Note the ${...} around the name. That's because $prefix_ is also a valid shell script variable. I need the ${prefix} to let the shell know that I'm talking about $prefix and not $prefix_.

Python - Merge PDF files with same prefix using PyPDF2

os.listdir() only lists filenames; it won't include the directory name.

To get the full path to actually add into the merger, you'll have to os.path.join() the root path back in.

However, you'll also need to note that the files you get from os.listdir() may not necessarily be in the order you want for your prefixes, so it'd be better to refactor things so you first group things by prefix, then process each prefix group:

from collections import defaultdict

from PyPDF2 import PdfFileMerger
import os

root_path = "C:\\test\\raw"
result_path = "C:\\test\\result"

files_by_prefix = defaultdict(list)
for filename in os.listdir(root_path):
prefix = filename.split("_")[2]
files_by_prefix[prefix].append(filename)

for prefix, filenames in files_by_prefix.items():
result_name = os.path.join(result_path, prefix + "_merged.pdf")
print(f"Merging {filenames} to {result_name} (prefix {prefix})")
merger = PdfFileMerger()
for filename in sorted(filenames):
merger.append(os.path.join(root_path, filename))
merger.write(os.path.join(result_path, f"{prefix}_merged.pdf"))
merger.close()

How to concatenate files that have the same beginning of a name?

I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.

A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:

for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
cat "$specie"*.fasta > "$specie.fasta"
done

In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.

More robust solutions can be written using find and avoiding ls, but they are more verbose and potentialy less clear:

while IFS= read -r -d '' specie
do
cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)


Related Topics



Leave a reply



Submit