How to Merge Documents Correctly

PDF fill in not merging correctly

PdfWriter.GetImportedPage only returns you a copy of the page contents. This does not include any annotations, in particular not the widget annotations of form fields on the page at hand.

To also copy the annotations of the source pages, use the iText PdfCopy class instead. This class is designed to copy pages including all annotations. Furthermore, it includes methods to copy all pages of a source document in one step.

You have to tell the PdfCopy object to merge fields, otherwise the document-wide form structure won't be built.

As an aside, your code creates many PdfReader objects but does not close them. That may increase your memory requirements substantially.

Thus:

Public Shared Sub MergePDFsImproved(ByVal files As List(Of String), ByVal filename As String)
Using mem As New MemoryStream()
Dim readers As New List(Of PdfReader)
Using doc As New Document
Dim copy As New PdfCopy(doc, mem)
copy.SetMergeFields()
doc.Open()
For Each strfile As String In files
Dim reader As New PdfReader(strfile)
copy.AddDocument(reader)
readers.Add(reader)
Next
End Using
For Each reader As PdfReader In readers
reader.Close()
Next

HttpContext.Current.Response.Clear()
HttpContext.Current.Response.ContentType = "application/pdf"
HttpContext.Current.Response.AppendHeader("Content-Disposition", "inline; filename=" + filename)
HttpContext.Current.Response.BinaryWrite(mem.ToArray)
HttpContext.Current.Response.OutputStream.Flush()
HttpContext.Current.Response.OutputStream.Close()
HttpContext.Current.Response.OutputStream.Dispose()
End Using
End Sub

Actually I'm not sure whether it is a good idea to Close and Dispose the response output stream here, that shouldn't be the responsibility of a PDF merging method.

This is a related answer for the Java version of iText; you may want to read it for additional information. Unfortunately many links in that answer meanwhile are dead.

IText merge documents with acrofields

Depending on what you want exactly, different scenarios are possible, but in any case: you are doing it wrong. You should use either PdfCopy or PdfSmartCopy to merge documents.

The different scenarios are explained in the following video tutorial.

You can find most of the examples in the iText sandbox.

Merging different forms (having different fields)

If you want to merge different forms without flattening them, you should use PdfCopy as is done in the MergeForms example:

public void createPdf(String filename, PdfReader[] readers) throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(filename));
copy.setMergeFields();
document.open();
for (PdfReader reader : readers) {
copy.addDocument(reader);
}
document.close();
for (PdfReader reader : readers) {
reader.close();
}
}

In this case, readers is an array of PdfReader instances containing different forms (with different field names), hence we use PdfCopy and we make sure that we don't forget to use the setMergeFields() method, or the fields won't be copied.

Merging identical forms (having identical fields)

In this case, we need to rename the fields, because we probably want different values on different pages. In PDF a field can only have a single value. If you merge identical forms, you have multiple visualizations of the same field, but each visualization will show the same value (because in reality, there is only one field).

Let's take a look at the MergeForms2 example:

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
copy.setMergeFields();
document.open();
List<PdfReader> readers = new ArrayList<PdfReader>();
for (int i = 0; i < 3; ) {
PdfReader reader = new PdfReader(renameFields(src, ++i));
readers.add(reader);
copy.addDocument(reader);
}
document.close();
for (PdfReader reader : readers) {
reader.close();
}
}

public byte[] renameFields(String src, int i) throws IOException, DocumentException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, baos);
AcroFields form = stamper.getAcroFields();
Set<String> keys = new HashSet<String>(form.getFields().keySet());
for (String key : keys) {
form.renameField(key, String.format("%s_%d", key, i));
}
stamper.close();
reader.close();
return baos.toByteArray();
}

As you can see, the renameFields() method creates a new document in memory. That document is merged with other documents using PdfSmartCopy. If you'd use PdfCopy here, your document would be bloated (as we'll soon find out).

Merging flattened forms

In the FillFlattenMerge1, we fill out the forms using PdfStamper. The result is a PDF file that is kept in memory and that is merged using PdfCopy. While this example is fine if you'd merge different forms, this is actually an example on how not to do it (as explained in the video tutorial).

The FillFlattenMerge2 shows how to merge identical forms that are filled out and flattened correctly:

public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
ByteArrayOutputStream baos;
PdfReader reader;
PdfStamper stamper;
AcroFields fields;
StringTokenizer tokenizer;
BufferedReader br = new BufferedReader(new FileReader(DATA));
String line = br.readLine();
while ((line = br.readLine()) != null) {
// create a PDF in memory
baos = new ByteArrayOutputStream();
reader = new PdfReader(SRC);
stamper = new PdfStamper(reader, baos);
fields = stamper.getAcroFields();
tokenizer = new StringTokenizer(line, ";");
fields.setField("name", tokenizer.nextToken());
fields.setField("abbr", tokenizer.nextToken());
fields.setField("capital", tokenizer.nextToken());
fields.setField("city", tokenizer.nextToken());
fields.setField("population", tokenizer.nextToken());
fields.setField("surface", tokenizer.nextToken());
fields.setField("timezone1", tokenizer.nextToken());
fields.setField("timezone2", tokenizer.nextToken());
fields.setField("dst", tokenizer.nextToken());
stamper.setFormFlattening(true);
stamper.close();
reader.close();
// add the PDF to PdfCopy
reader = new PdfReader(baos.toByteArray());
copy.addDocument(reader);
reader.close();
}
br.close();
document.close();
}

These are three scenarios. Your question is too unclear for anyone but you to decide which scenario is the best fit for your needs. I suggest that you take the time to learn before you code. Watch the video, try the examples, and if you still have doubts, you'll be able to post a smarter question.

How to merge multiple pdf files (generated in run time)?

If you want to merge source documents using iText(Sharp), there are two basic situations:

  1. You really want to merge the documents, acquiring the pages in their original format, transfering as much of their content and their interactive annotations as possible. In this case you should use a solution based on a member of the Pdf*Copy* family of classes.

  2. You actually want to integrate pages from the source documents into a new document but want the new document to govern the general format and don't care for the interactive features (annotations...) in the original documents (or even want to get rid of them). In this case you should use a solution based on the PdfWriter class.

You can find details in chapter 6 (especially section 6.4) of iText in Action — 2nd Edition. The Java sample code can be accessed here and the C#'ified versions here.

A simple sample using PdfCopy is Concatenate.java / Concatenate.cs. The central piece of code is:

byte[] mergedPdf = null;
using (MemoryStream ms = new MemoryStream())
{
using (Document document = new Document())
{
using (PdfCopy copy = new PdfCopy(document, ms))
{
document.Open();

for (int i = 0; i < pdf.Count; ++i)
{
PdfReader reader = new PdfReader(pdf[i]);
// loop over the pages in that document
int n = reader.NumberOfPages;
for (int page = 0; page < n; )
{
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
}
}
}
mergedPdf = ms.ToArray();
}

Here pdf can either be defined as a List<byte[]> immediately containing the source documents (appropriate for your use case of merging intermediate in-memory documents) or as a List<String> containing the names of source document files (appropriate if you merge documents from disk).

An overview at the end of the referenced chapter summarizes the usage of the classes mentioned:

  • PdfCopy: Copies pages from one or more existing PDF documents. Major downsides: PdfCopy doesn’t detect redundant content, and it fails when concatenating forms.

  • PdfCopyFields: Puts the fields of the different forms into one form. Can be used to avoid the problems encountered with form fields when concatenating forms using PdfCopy. Memory use can be an issue.

  • PdfSmartCopy: Copies pages from one or more existing PDF documents. PdfSmartCopy is able to detect redundant content, but it needs more memory and CPU than PdfCopy.

  • PdfWriter: Generates PDF documents from scratch. Can import pages from other PDF documents. The major downside is that all interactive features of the imported page (annotations, bookmarks, fields, and so forth) are lost in the process.

ITextSharp Merge Pdf Exception

I have looked around and some comments have mentioned that my file is corrupt.

The information you found most likely is correct, the file you try to read is likely to be corrupt.

However, I am able to open and read the source pdfs without any issue.

PDF viewers often try to repair a certain amount of corruption under the hood. As the person viewing the PDF can usually quickly recognize whether the repair succeeded or only left some pages full of garbage, this is ok-ish, i.e. less a bug and more a feature.

Libraries that automatically process PDFs, on the other hand, should not try this (at least not as much as viewers do) as their outputs might directly go into some archive never to be checked until an audit some years later. A document full of garbage then will cause lots of trouble.

I am unsure of how to continue now.

Try to repair the PDF in question.

If you open it in a current Adobe Acrobat Reader, the program usually upon closing the document will ask whether you want to save the document. This will actually save a repaired version which iText is very likely to accept without further ado.

If that does not work, i.e. if either Adobe Acrobat Reader does not offer to safe a repaired version or iText does not even accept the repaired versions, please share the PDF in question here for further analysis.

Using iText 2.1.7 to merge large PDFs

For anyone curious, the issue had nothing to do with iText and instead was the code responsible for returning the response from iText.



Related Topics



Leave a reply



Submit