Merge PDF Files with Numerical Sort

Merge pdf files with numerical sort

you can embed the result of command using $(),
so you can do following

$ pdfunite $(ls -v *.pdf) output.pdf

or

$ pdfunite $(ls *.pdf | sort -n) output.pdf

However, note that this does not work when filename contains special character such as whitespace.

In the case you can do the following:

ls -v *.txt | bash -c 'IFS=$'"'"'\n'"'"' read -d "" -ra x;pdfunite "${x[@]}" output.pdf'

Although it seems a little bit complicated, its just combination of

  • Bash: Read tab-separated file line into array
  • build argument lists containing whitespace
  • How to escape single-quotes within single-quoted strings?

Note that you cannot use xargs since pdfunite requires input pdf's as the middle of arguments.
I avoided using readarray since it is not supported in older bash version, but you can use it instead of IFS=.. read -ra .. if you have newer bash.

merge multiple pdfs in order

It is because of naming of files. Your code
new FileOutputStream(outputfolder + "\\" + "tempcontrat" + debut + "-" + i + "_.pdf")
will produce:

  • tempcontrat0-0_.pdf
  • tempcontrat0-1_.pdf
  • ...
  • tempcontrat0-10_.pdf
  • tempcontrat0-11_.pdf
  • ...
  • tempcontrat0-1000_.pdf

Where tempcontrat0-1000_.pdf will be placed before tempcontrat0-11_.pdf, because you are sorting it alphabetically before merge.

It will be better to left pad file number with 0 character using leftPad() method of org.apache.commons.lang.StringUtils or java.text.DecimalFormat and have it like this tempcontrat0-000000.pdf, tempcontrat0-000001.pdf, ... tempcontrat0-9999999.pdf.


And you can also do it much simpler and skip writing into file and then reading from file steps and merge documents right after the form fill and it will be faster. But it depends how many and how big documents you are merging and how much memory do you have.

So you can save the filled document into ByteArrayOutputStream and after stamper.close() create new PdfReader for bytes from that stream and call pdfSmartCopy.getImportedPage() for that reader. In short cut it can look like:

// initialize

PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
for (int i = debut; i < fin; i++) {
ByteArrayOutputStream out = new ByteArrayOutputStream();

// fill in the form here

stamper.close();
PdfReader reader = new PdfReader(out.toByteArray());
reader.consolidateNamedDestinations();
PdfImportedPage pdfImportedPage = pdfSmartCopy.getImportedPage(reader, 1);
pdfSmartCopy.addPage(pdfImportedPage);

// other actions ...
}

Merge / convert multiple PDF files into one PDF

I'm sorry, I managed to find the answer myself using google and a bit of luck : )

For those interested;

I installed the pdftk (pdf toolkit) on our debian server, and using the following command I achieved desired output:

pdftk file1.pdf file2.pdf cat output output.pdf

OR

gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf ...

This in turn can be piped directly into pdf2ps.

Merge PDF files

Use Pypdf or its successor PyPDF2:

A Pure-Python library built as a PDF toolkit. It is capable of:

  • splitting documents page by page,
  • merging documents page by page,

(and much more)

Here's a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()

if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)

Python - Merge PDF files with same prefix using PyPDF2

os.listdir() only lists filenames; it won't include the directory name.

To get the full path to actually add into the merger, you'll have to os.path.join() the root path back in.

However, you'll also need to note that the files you get from os.listdir() may not necessarily be in the order you want for your prefixes, so it'd be better to refactor things so you first group things by prefix, then process each prefix group:

from collections import defaultdict

from PyPDF2 import PdfFileMerger
import os

root_path = "C:\\test\\raw"
result_path = "C:\\test\\result"

files_by_prefix = defaultdict(list)
for filename in os.listdir(root_path):
prefix = filename.split("_")[2]
files_by_prefix[prefix].append(filename)

for prefix, filenames in files_by_prefix.items():
result_name = os.path.join(result_path, prefix + "_merged.pdf")
print(f"Merging {filenames} to {result_name} (prefix {prefix})")
merger = PdfFileMerger()
for filename in sorted(filenames):
merger.append(os.path.join(root_path, filename))
merger.write(os.path.join(result_path, f"{prefix}_merged.pdf"))
merger.close()


Related Topics



Leave a reply



Submit