Merge PDF Files

Merge PDF files

Use Pypdf or its successor PyPDF2:

A Pure-Python library built as a PDF toolkit. It is capable of:

  • splitting documents page by page,
  • merging documents page by page,

(and much more)

Here's a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()

if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)

Merge / convert multiple PDF files into one PDF

I'm sorry, I managed to find the answer myself using google and a bit of luck : )

For those interested;

I installed the pdftk (pdf toolkit) on our debian server, and using the following command I achieved desired output:

pdftk file1.pdf file2.pdf cat output output.pdf

OR

gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf ...

This in turn can be piped directly into pdf2ps.

Python - Merge PDF files with same prefix using PyPDF2

os.listdir() only lists filenames; it won't include the directory name.

To get the full path to actually add into the merger, you'll have to os.path.join() the root path back in.

However, you'll also need to note that the files you get from os.listdir() may not necessarily be in the order you want for your prefixes, so it'd be better to refactor things so you first group things by prefix, then process each prefix group:

from collections import defaultdict

from PyPDF2 import PdfFileMerger
import os

root_path = "C:\\test\\raw"
result_path = "C:\\test\\result"

files_by_prefix = defaultdict(list)
for filename in os.listdir(root_path):
prefix = filename.split("_")[2]
files_by_prefix[prefix].append(filename)

for prefix, filenames in files_by_prefix.items():
result_name = os.path.join(result_path, prefix + "_merged.pdf")
print(f"Merging {filenames} to {result_name} (prefix {prefix})")
merger = PdfFileMerger()
for filename in sorted(filenames):
merger.append(os.path.join(root_path, filename))
merger.write(os.path.join(result_path, f"{prefix}_merged.pdf"))
merger.close()

Merge PDF pages to 1 file without generating single page files

You need to use BytesIO:

for fileset in filesets:
merger = PdfFileMerger()
page_path = fr".\output\pages"
for file in fileset:
# Load image, read with pytesseract
path = os.path.join(download_location,file)
img = cv2.imread(path,1)
result = pytesseract.image_to_pdf_or_hocr(img, lang="eng",config=tessdata_dir_config)
merger.append(BytesIO(result))

merger.write(fr".\output\{FILE}.pdf")

How can I merge pdf files together and take only the first page from each file?

As explained in this qpdf issue,
the shell expands *.pdf in the command qpdf --empty --pages *.pdf 1 -- "output.pdf", that means it replaces *.pdf
with a list of pdf files in the current directory. Assuming you have the following pdf files in the current directory:

  • file1.pdf
  • file2.pdf
  • file3.pdf

the command becomes:

qpdf --empty --pages file1.pdf file2.pdf file3.pdf 1 -- "output.pdf"

so the page selector is only applied to the last pdf. On a Mac or Linux you can script the command to add a 1 after
each pdf-filename, to take the first page of each pdf file and put it all together like so:

qpdf --empty --pages $(for i in *.pdf; do echo $i 1; done) -- output.pdf

Merge PDF files with TOC element

There is no such element in PDF files, so we need to create this content ourselves.

Now one way would be to create text elements, outlines, and link annotations, position them appropriately, and set the link destinations to outlines.

However, this could be quite some work so perhaps it would be easier to just create the desired TOC element with GemBox.Document, save it as a PDF file, and then import it into the resulting PDF.

// Source data for creating TOC entries with specified text and associated PDF files.
var pdfEntries = new[]
{
new { Title = "First Document Title", Pdf = PdfDocument.Load("input1.pdf") },
new { Title = "Second Document Title", Pdf = PdfDocument.Load("input2.pdf") },
new { Title = "Third Document Title", Pdf = PdfDocument.Load("input3.pdf") },
};

/***************************************************************/
/* Create new document with TOC element using GemBox.Document. */
/***************************************************************/

// Create new document.
var tocDocument = new DocumentModel();
var section = new Section(tocDocument);
tocDocument.Sections.Add(section);

// Create and add TOC element.
var toc = new TableOfEntries(tocDocument, FieldType.TOC);
section.Blocks.Add(toc);
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));

// Create heading style.
// By default, when updating TOC element a TOC entry is created for each paragraph that has heading style.
var heading1Style = (ParagraphStyle)tocDocument.Styles.GetOrAdd(StyleTemplateType.Heading1);

// Add heading and empty (placeholder) pages.
// The number of added placeholder pages depend on the number of pages that actual PDF file has so that TOC entries have correct page numbers.
int totalPageCount = 0;
foreach (var pdfEntry in pdfEntries)
{
section.Blocks.Add(new Paragraph(tocDocument, pdfEntry.Title) { ParagraphFormat = { Style = heading1Style } });
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));

int currentPageCount = pdfEntry.Pdf.Pages.Count;
totalPageCount += currentPageCount;

while (--currentPageCount > 0)
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
}

// Remove last extra-added empty page.
section.Blocks.RemoveAt(section.Blocks.Count - 1);

// Update TOC element and save the document as PDF stream.
toc.Update();
var pdfStream = new MemoryStream();
tocDocument.Save(pdfStream, new GemBox.Document.PdfSaveOptions());

/***************************************************************/
/* Merge PDF files into PDF with TOC element using GemBox.Pdf. */
/***************************************************************/

// Load a PDF stream using GemBox.Pdf.
var pdfDocument = PdfDocument.Load(pdfStream);
var rootDictionary = (PdfDictionary)((PdfIndirectObject)pdfDocument.GetDictionary()[PdfName.Create("Root")]).Value;
var pagesDictionary = (PdfDictionary)((PdfIndirectObject)rootDictionary[PdfName.Create("Pages")]).Value;
var kidsArray = (PdfArray)pagesDictionary[PdfName.Create("Kids")];
var pageIds = kidsArray.Cast<PdfIndirectObject>().Select(obj => obj.Id).ToArray();

// Remove empty (placeholder) pages.
while (totalPageCount-- > 0)
pdfDocument.Pages.RemoveAt(pdfDocument.Pages.Count - 1);

// Add pages from PDF files.
foreach (var pdfEntry in pdfEntries)
foreach (var page in pdfEntry.Pdf.Pages)
pdfDocument.Pages.AddClone(page);

/*****************************************************************************/
/* Update TOC links from placeholder pages to actual pages using GemBox.Pdf. */
/*****************************************************************************/

// Create a mapping from an ID of a empty (placeholder) page indirect object to an actual page indirect object.
var pageCloneMap = new Dictionary<PdfIndirectObjectIdentifier, PdfIndirectObject>();
for (int i = 0; i < kidsArray.Count; ++i)
pageCloneMap.Add(pageIds[i], (PdfIndirectObject)kidsArray[i]);

foreach (var entry in pageCloneMap)
{
// If page was updated, it means that we passed TOC pages, so break from the loop.
if (entry.Key != entry.Value.Id)
break;

// For each TOC page, get its 'Annots' entry.
// For each link annotation from the 'Annots' get the 'Dest' entry.
// Update the first item in the 'Dest' array so that it no longer points to a removed page.
if (((PdfDictionary)entry.Value.Value).TryGetValue(PdfName.Create("Annots"), out PdfBasicObject annotsObj))
foreach (PdfIndirectObject annotObj in (PdfArray)annotsObj)
if (((PdfDictionary)annotObj.Value).TryGetValue(PdfName.Create("Dest"), out PdfBasicObject destObj))
{
var destArray = (PdfArray)destObj;
destArray[0] = pageCloneMap[((PdfIndirectObject)destArray[0]).Id];
}
}

// Save resulting PDF file.
pdfDocument.Save("Result.pdf");
pdfDocument.Close();

This way you can easily customize the TOC element by using the TOC switches and styles. For more info, see the Table Of Content example from GemBox.Document.

Merge PDF files with reversible process (extract original files)

Once you merged the PDF files you cannot split the result and obtain the exact same original files at binary level. Source PDF files are not included as opaque binaries blocks in the merged file.

One possible solution solution, as @mkl said, is to use a PDF portfolio to embed the source files as they are. When viewing the portfolio you will see each file as it is, not as a long merged PDF file.



Related Topics



Leave a reply



Submit