Merge PDF's with PDFtk with Bookmarks

Merging pdf files with bookmarks

Using itextsharp you can do it. I do it by the following method:

MergePdfFiles(string outputPdf, string[] sourcePdfs) {
PdfReader reader = null;
Document document = new Document();
PdfImportedPage page = null;
PdfCopy pdfCpy = null;
int n = 0;
int totalPages = 0;
int page_offset = 0;
List < Dictionary < string, object >> bookmarks = new List < Dictionary < string, object >> ();
IList < Dictionary < string, object >> tempBookmarks;
for (int i = 0; i <= sourcePdfs.GetUpperBound(0); i++) {
reader = new PdfReader(sourcePdfs[i]);
reader.ConsolidateNamedDestinations();
n = reader.NumberOfPages;
tempBookmarks = SimpleBookmark.GetBookmark(reader);
if (i == 0) {
document = new iTextSharp.text.Document(reader.GetPageSizeWithRotation(1));
pdfCpy = new PdfCopy(document, new FileStream(outputPdf, FileMode.Create));
document.Open();
SimpleBookmark.ShiftPageNumbers(tempBookmarks, page_offset, null);
page_offset += n;
if (tempBookmarks != null)
bookmarks.AddRange(tempBookmarks);
// MessageBox.Show(n.ToString());
totalPages = n;
} else {
SimpleBookmark.ShiftPageNumbers(tempBookmarks, page_offset, null);
if (tempBookmarks != null)
bookmarks.AddRange(tempBookmarks);
page_offset += n;
totalPages += n;
}
for (int j = 1; j <= n; j++) {
page = pdfCpy.GetImportedPage(reader, j);
pdfCpy.AddPage(page);
}
reader.Close();
}
pdfCpy.Outlines = bookmarks;
document.Close();
}

Merging .pdf files with Pdftk

EDIT New approach.

  • If you drag'n drop file(s) or a folder to the batch or pass at least one file/folder
  • the following batch will change to the referenced folder and
  • processes all pdf files in that folder combining them into binder.pdf
  • an eventually existing binder.pdf is renamed to binder.bak.pdf

:: Q:\Test\2018\06\06\SO_50728273.cmd
@echo off
setlocal enabledelayedexpansion
if "%~1" neq "" (
Echo %~a1|findstr "d" 2>&1>Nul && Pushd "%~f1" || Pushd "%~dp1"
) else (
Echo No arguments, need a path& pause & goto :Eof
)
Del /f binder.bak.pdf 2>&1>Nul
if exist binder.pdf Ren binder.pdf binder.bak.pdf
pdftk.exe *.pdf cat output binder.pdf
PopD

Without knowing what arguments you pass to the batch diagnosing is impossible.%* is replaced with all arguments you pass, the location of the output is determined by the path of the first argument %~dp1

I ran your batch on my ramdisk a:

Dir before:

> dir A:\
Verzeichnis von A:\

2018-06-06 21:57 65.381 SO_5072812.pdf
2018-06-06 21:56 163 SO_50728273.cmd
2018-06-06 21:55 60.649 SO_50728273.pdf
3 Datei(en), 126.193 Bytes
0 Verzeichnis(se), 1.049.452.544 Bytes frei

And after (I named the batch SO_50728273.cmd):

> SO_50728273.cmd a:\*.pdf

> dir
Verzeichnis von A:\

2018-06-06 21:58 125.756 binder.pdf
2018-06-06 21:57 65.381 SO_5072812.pdf
2018-06-06 21:56 163 SO_50728273.cmd
2018-06-06 21:55 60.649 SO_50728273.pdf
4 Datei(en), 251.949 Bytes
0 Verzeichnis(se), 1.049.260.032 Bytes frei

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.

This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).

from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader 

# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF

import logging

def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)

# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)

if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})

# Add the label information
nums_array.append(number_type)

return nums_array

def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()

# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()

# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)

# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()

# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count

except Exception as err:
print("ERROR: %s" % err)

# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})

page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})

root_obj = merger.output._root_object
root_obj.update(page_labels)

# Write output
merger.write('merged.pdf')


pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)

write_merged(pdf_readers)

Note: PyPDF2 produces this weird error:

  ...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

Merge / convert multiple PDF files into one PDF

I'm sorry, I managed to find the answer myself using google and a bit of luck : )

For those interested;

I installed the pdftk (pdf toolkit) on our debian server, and using the following command I achieved desired output:

pdftk file1.pdf file2.pdf cat output output.pdf

OR

gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf ...

This in turn can be piped directly into pdf2ps.



Related Topics



Leave a reply



Submit