Need to Merge Multiple PDF's into a Single PDF with Table of Contents Sections

Need to merge multiple pdf's into a single PDF with Table Of Contents sections

And what's the best tool for us to merge the pdf's?

On Linux (as well as on Windows), you can install an useful little program, pdftk. It works well to bind PDF's together. For example:

$ pdftk in1.pdf in2.pdf in3.pdf in4.pdf in5.pdf in6.pdf cat output out.pdf

where in*.pdf are the input files and out.pdf is the result. In between, @jerik already gave an answer how to deal with the TOC.

Merge / convert multiple PDF files into one PDF

I'm sorry, I managed to find the answer myself using google and a bit of luck : )

For those interested;

I installed the pdftk (pdf toolkit) on our debian server, and using the following command I achieved desired output:

pdftk file1.pdf file2.pdf cat output output.pdf

OR

gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf ...

This in turn can be piped directly into pdf2ps.

Merge PDF files

Use Pypdf or its successor PyPDF2:

A Pure-Python library built as a PDF toolkit. It is capable of:

  • splitting documents page by page,
  • merging documents page by page,

(and much more)

Here's a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()

if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)

Merge PDF files with TOC element

There is no such element in PDF files, so we need to create this content ourselves.

Now one way would be to create text elements, outlines, and link annotations, position them appropriately, and set the link destinations to outlines.

However, this could be quite some work so perhaps it would be easier to just create the desired TOC element with GemBox.Document, save it as a PDF file, and then import it into the resulting PDF.

// Source data for creating TOC entries with specified text and associated PDF files.
var pdfEntries = new[]
{
new { Title = "First Document Title", Pdf = PdfDocument.Load("input1.pdf") },
new { Title = "Second Document Title", Pdf = PdfDocument.Load("input2.pdf") },
new { Title = "Third Document Title", Pdf = PdfDocument.Load("input3.pdf") },
};

/***************************************************************/
/* Create new document with TOC element using GemBox.Document. */
/***************************************************************/

// Create new document.
var tocDocument = new DocumentModel();
var section = new Section(tocDocument);
tocDocument.Sections.Add(section);

// Create and add TOC element.
var toc = new TableOfEntries(tocDocument, FieldType.TOC);
section.Blocks.Add(toc);
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));

// Create heading style.
// By default, when updating TOC element a TOC entry is created for each paragraph that has heading style.
var heading1Style = (ParagraphStyle)tocDocument.Styles.GetOrAdd(StyleTemplateType.Heading1);

// Add heading and empty (placeholder) pages.
// The number of added placeholder pages depend on the number of pages that actual PDF file has so that TOC entries have correct page numbers.
int totalPageCount = 0;
foreach (var pdfEntry in pdfEntries)
{
section.Blocks.Add(new Paragraph(tocDocument, pdfEntry.Title) { ParagraphFormat = { Style = heading1Style } });
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));

int currentPageCount = pdfEntry.Pdf.Pages.Count;
totalPageCount += currentPageCount;

while (--currentPageCount > 0)
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
}

// Remove last extra-added empty page.
section.Blocks.RemoveAt(section.Blocks.Count - 1);

// Update TOC element and save the document as PDF stream.
toc.Update();
var pdfStream = new MemoryStream();
tocDocument.Save(pdfStream, new GemBox.Document.PdfSaveOptions());

/***************************************************************/
/* Merge PDF files into PDF with TOC element using GemBox.Pdf. */
/***************************************************************/

// Load a PDF stream using GemBox.Pdf.
var pdfDocument = PdfDocument.Load(pdfStream);
var rootDictionary = (PdfDictionary)((PdfIndirectObject)pdfDocument.GetDictionary()[PdfName.Create("Root")]).Value;
var pagesDictionary = (PdfDictionary)((PdfIndirectObject)rootDictionary[PdfName.Create("Pages")]).Value;
var kidsArray = (PdfArray)pagesDictionary[PdfName.Create("Kids")];
var pageIds = kidsArray.Cast<PdfIndirectObject>().Select(obj => obj.Id).ToArray();

// Remove empty (placeholder) pages.
while (totalPageCount-- > 0)
pdfDocument.Pages.RemoveAt(pdfDocument.Pages.Count - 1);

// Add pages from PDF files.
foreach (var pdfEntry in pdfEntries)
foreach (var page in pdfEntry.Pdf.Pages)
pdfDocument.Pages.AddClone(page);

/*****************************************************************************/
/* Update TOC links from placeholder pages to actual pages using GemBox.Pdf. */
/*****************************************************************************/

// Create a mapping from an ID of a empty (placeholder) page indirect object to an actual page indirect object.
var pageCloneMap = new Dictionary<PdfIndirectObjectIdentifier, PdfIndirectObject>();
for (int i = 0; i < kidsArray.Count; ++i)
pageCloneMap.Add(pageIds[i], (PdfIndirectObject)kidsArray[i]);

foreach (var entry in pageCloneMap)
{
// If page was updated, it means that we passed TOC pages, so break from the loop.
if (entry.Key != entry.Value.Id)
break;

// For each TOC page, get its 'Annots' entry.
// For each link annotation from the 'Annots' get the 'Dest' entry.
// Update the first item in the 'Dest' array so that it no longer points to a removed page.
if (((PdfDictionary)entry.Value.Value).TryGetValue(PdfName.Create("Annots"), out PdfBasicObject annotsObj))
foreach (PdfIndirectObject annotObj in (PdfArray)annotsObj)
if (((PdfDictionary)annotObj.Value).TryGetValue(PdfName.Create("Dest"), out PdfBasicObject destObj))
{
var destArray = (PdfArray)destObj;
destArray[0] = pageCloneMap[((PdfIndirectObject)destArray[0]).Id];
}
}

// Save resulting PDF file.
pdfDocument.Save("Result.pdf");
pdfDocument.Close();

This way you can easily customize the TOC element by using the TOC switches and styles. For more info, see the Table Of Content example from GemBox.Document.

Ghostscript to merge PDFs compresses the result

Here's some additional options that you can pass when using pdfwrite as your device. According to that page if you don't pass anything then -dPDFSETTINGS it gets set to something close to /screen, although it doesn't get more specific. You could try setting it to -dPDFSETTINGS=/prepress which should only compress things above 300 dpi.

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=out.pdf in1.pdf in2.pdf

Another alternative is pdftk:

pdftk in1.pdf in2.pdf cat output out.pdf


Related Topics



Leave a reply



Submit