Python - merge multiple files based on file prefix
You're writing a new file every time you read a file, you need to append instead. You also have an unnecessary nested for-loop to read the file, while you could read them in the outer loop. This should work:
import os
# get all files in folder
files = os.listdir("C:\\MTA\\mta")
for filename in files:
#get prefix
prefix = filename[0:2]
# open destination file to merge individual files into
with open(os.path.join("C:\\MTA\\mta", "merged" + "_" + prefix + ".txt"), 'a') as outfile:
# go through all files and merge it into outfile
with open(os.path.join("C:\\MTA\\mta", filename)) as infile:
outfile.write(infile.read())
outfile.write("--------------\n")
Merging CSVs with similar name python
The solution to your question is the find_filesets()
method below. I've included a CSV merge method as well based on MaxNoe's answer.
#!/usr/bin/env python
import glob
import random
import os
import pandas
def rm_minus_rf(dirname):
for r,d,f in os.walk(dirname):
for files in f:
os.remove(os.path.join(r, files))
os.removedirs(r)
def create_testfiles(path):
rm_minus_rf(path)
os.mkdir(path)
random.seed()
for i in range(10):
n = random.randint(10000,99999)
for j in range(random.randint(0,20)):
# year may repeat, doesn't matter
year = 2015 - random.randint(0,20)
with open("{}/{}-{}.csv".format(path, n, year), "w"):
pass
def find_filesets(path="."):
csv_files = {}
for name in glob.glob("{}/*-*.csv".format(path)):
# there's almost certainly a better way to do this
key = os.path.splitext(os.path.basename(name))[0].split('-')[0]
csv_files.setdefault(key, []).append(name)
for key,filelist in csv_files.items():
print key, filelist
# do something with filelist
create_merged_csv(key, filelist)
def create_merged_csv(key, filelist):
with open('{}-aggregate.csv'.format(key), 'w+b') as outfile:
for filename in filelist:
df = pandas.read_csv(filename, header=False)
df.to_csv(outfile, index=False, header=False)
TEST_DIR_NAME="testfiles"
create_testfiles(TEST_DIR_NAME)
find_filesets(TEST_DIR_NAME)
Bash: list different prefixes of files
Here's a pure BASH way of getting the prefixes:
for file in *.txt
do
echo "${file%_*.txt}"
done | sort -u
This will give you a list of all the file prefixes. From there, you could use this to do your cat.
The for
loop goes through all of your files. You could say for file in T*_*.txt
to limit what files you're picking up.
The ${file%_*.txt}
is a small right pattern filter which removes the _*.txt
from the variable $file
. The sort -u
sorts all of these prefixes, and combines duplicates.
The best way is to use this as a function:
function prefix
{
for file in *.txt
do
echo "${file%_.txt}"
done | sort -u
}
prefix | while read prefix
do
${prefix}_*.txt > cat $prefix.txt
done
Note the ${...}
around the name. That's because $prefix_
is also a valid shell script variable. I need the ${prefix}
to let the shell know that I'm talking about $prefix
and not $prefix_
.
Python - Merge PDF files with same prefix using PyPDF2
os.listdir()
only lists filenames; it won't include the directory name.
To get the full path to actually add into the merger, you'll have to os.path.join()
the root path back in.
However, you'll also need to note that the files you get from os.listdir()
may not necessarily be in the order you want for your prefixes, so it'd be better to refactor things so you first group things by prefix, then process each prefix group:
from collections import defaultdict
from PyPDF2 import PdfFileMerger
import os
root_path = "C:\\test\\raw"
result_path = "C:\\test\\result"
files_by_prefix = defaultdict(list)
for filename in os.listdir(root_path):
prefix = filename.split("_")[2]
files_by_prefix[prefix].append(filename)
for prefix, filenames in files_by_prefix.items():
result_name = os.path.join(result_path, prefix + "_merged.pdf")
print(f"Merging {filenames} to {result_name} (prefix {prefix})")
merger = PdfFileMerger()
for filename in sorted(filenames):
merger.append(os.path.join(root_path, filename))
merger.write(os.path.join(result_path, f"{prefix}_merged.pdf"))
merger.close()
How to concatenate files that have the same beginning of a name?
I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.
A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:
for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
cat "$specie"*.fasta > "$specie.fasta"
done
In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.
More robust solutions can be written using find
and avoiding ls
, but they are more verbose and potentialy less clear:
while IFS= read -r -d '' specie
do
cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)
Related Topics
Mod_Rewrite with Relative Path Redirects
How to Configure a Specific Ip in Prometheus Yml Configuration File
How to Initialize the Attribute Group Correctly for a Platform Driver
Zipping Without Creating Parent Folder
Difference Between Trap Flag (Tf) and Monitor Trap Flag
Nacl Helper Process Running Without a Sandbox Error Using Chrome Through Selenium in Linux
Perl Escaping Argument for Bash Execution
How Does This Canonical Flock Example Work
Linux Serial Port Buffer Not Empty When Opening Device
How to Check If Awk Array Is Empty
Cron Error with Using Backquotes
How to Copy a File with '$' in Name in Linux
Checking Up Intel Assembly Opcodes Easily in Linux
Running Script in Crontab--Reboot: Command Not Found
How to Use One Instance of Emacs as the Default Text Editor? [Linux]
Cannot Open Libmpfr.So.4 After Update on Ubuntu 18.04
How to Access the Base Filename of a File You Are Sourcing in Bash