Apache PDFbox: Problems with Encoding

Encoding Problems after using PDFBox

In short

The relevant difference is that PDFBox serializes names differently. But the different outputs according to the PDF specification are equivalent, so you apparently have uncovered a WPViewPDF bug.

The difference in writing names

In the original PDF (raw.pdf) you find the names NOWFJV+Arial,Bold and NOWFJV+Arial,Bold-WinCharSetFFFF, in all files manipulated by PDFBox you find all occurrences of those names outside of content streams replaced by NOWFJV+Arial#2CBold and NOWFJV+Arial#2CBold-WinCharSetFFFF.

WPViewPDF cannot properly display the text written in the fonts with these changed names. After patching the PDFs back to contain a comma in place of the '#2C' in those names, WPViewPDF again properly displays such text.

I would assume WPViewPDF finds NOWFJV+Arial,Bold in the content stream and expects to find the matching font definition in the page resources using the identically written name, so it doesn't recognize it with the name NOWFJV+Arial#2CBold.

Is that a PDFBox bug?

According to the PDF specification,

Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

(ISO 32000-2, section 7.3.5 "Name objects")

Thus, this replacement of commas in names by '#2C' sequences is a completely valid alternative way to write those names.

Thus, no, it's not a PDFBox bug but apparently a WPViewPDF bug.

Apache PDFBox: problems with encoding

This answer is actually an explanation why a generic solution for your task is at least very complicated if not impossible. Under benign circumstances, i.e. for PDFs subject to specific restrictions, code like yours can be successfully used, but your example PDF shows that the PDFs you apparently want to manipulate are not restricted like that.

Why automatic replacement of text is difficult/impossible

There are a number of factors that impede automatic replacement of text in PDFs, some already making finding the instructions for drawing the text in question difficult, and some complicating the replacing the characters in the arguments of those instructions.

The list of problems illustrated here is not exhaustive!

Finding instructions drawing a specific text

PDFs contain content streams which contain sequences of instructions telling a PDF processor where to draw what. Regular text in PDFs is drawn by instructions setting the current font (and font size), setting the position to draw the text at, and actually drawing text. This can be as easy to understand and search for as this:

/TT0 1 Tf
9 0 0 9 5 5 Tm
(file:///C/Users/Mi/Downloads/converted.txt[10.03.2020 18:43:57]) Tj

(Here the font TT0 with size 1 is selected, then an affine transformation is applied to scale text by a factor of 9 and move to the position (5, 5), and finally the text "file:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]" is drawn.)

In such a case searching the instructions responsible for drawing a given piece of text is easy. But the instructions in question may also look differently.

Split lines

For example the string may be drawn in pieces, instead of the Tj instruction above, we may have

[(file:///C/Users/Mi/Downloads/converted.txt)2 ([10.03.2020 18:43:57])] TJ

(Here first "file:///C/Users/Mi/Downloads/converted.txt" is drawn, then the text drawing position is slightly moved, then "[10.03.2020 18:43:57]" is drawn, both in the same TJ instruction.)

Or you may see

(file:///C/Users/Mi/Downloads/converted.txt) Tj
([10.03.2020 18:43:57]) Tj

(The text parts drawn in different instructions.)

Also the order of text pieces may be unexpected:

([10.03.2020 18:43:57]) Tj 
-40 0 Td
(file:///C/Users/Mi/Downloads/converted.txt) Tj

(First the date string is drawn, then the text position is moved left quite a bit before the drawn date, the the URL is drawn.)

Some PDF producers draw each character separately, setting the whole text transformation in between:

9 0 0 9 5 5 Tm
(f) Tj
9 0 0 9 14 5 Tm
(i) Tj
9 0 0 9 23 5 Tm
(l) Tj
...

And these different instructions need not be arranged in sequence as here, they can be spread over the whole stream, even over multiple streams as a page can have an array of content streams instead of a single one or part of the string may be drawn in the content stream of a sub-object referenced from the page content stream.

Thus, for finding the instructions responsible for a specific, multi-character text, you may have to inspect multiple streams and glue the strings you found together according to the position they have been drawn at.

Ligatures

Not every single character code might correspond to a single character as in your search string. There are a number of special glyphs for combinations of characters like for fl etc. So for searching one has to expand such ligatures.

Encodings

In the examples above, the characters of the text were easy to recognize even if the text was not drawn in a single run. But in PDFs the encoding of the characters need not be so obvious, actually each font may come with an own encoding, e.g.

<004B0048004F004F0052000400040004>Tj 

can draw "hello!!!".

(Here the string argument is written as hex string, in the debugger you saw "KHOOR...".)

Thus, for searching text, one needs to first map the string arguments of text drawing instructions to Unicode depending on the specific encoding of the current font.

But the PDF does not need to contain a mapping from the individual codes to Unicode characters, there may only be a mapping to the glyph id in the font file. In case of embedded fonts files, these font files then don't need to contain any mapping to Unicode characters either.

Often PDF files do have information on the Unicode characters matching the codes to allow text extraction e.g. for copy/paste; strictly speaking, though, such information is optional; even worse, that information may contain errors without creating issues when displaying the PDF. In all such situations one has to use OCR like mechanisms to recognize the Unicode characters associated with each glyph.

Replacing text in instructions

Once you found the instructions responsible for drawing the text you searched, you have to replace the text. This may also imply some problems.

Subset fonts

If font files are embedded in a PDF, they often merely are embedded as subsets of the original fonts to save space. E.g. in your example PDF the font Tahoma used to display "hello!!!" only is embedded with the following glyphs:

Tahoma

Even Times New Roman (the font used for the text you could recognize) is only subset embedded with the following glyphs:

Times New Roman

Thus, even if you found the "hello!!!" in Tahoma, simply replacing the character codes to mean "byebye??" would only display " e e " as the only character for which a glyph is present in the embedded font is the 'e'.

Thus, to replace you may either have to edit the embedded font file and the representing PDF font object to contain and encode all required glyphs, or to add another font and instructions to switch to that font for the manipulated text drawing instructions and back again thereafter.

Font encodings

Even if your font is not embedded at all (so your complete local copy of the font will be used) or embedded with all the glyphs you need, the encoding used for your font may be limited. In Western European language based PDFs you will often find WinAnsiEncoding, an encoding similar to Windows code page 1252. If you want to replace with Cyrillic text, there are no character codes for those characters.

Thus in this case you might have to change the encoding to include all the characters you need (by finding unused characters in the present encoding by scanning all uses of the font in question) or add another font with a more apropos encoding.

Layout considerations

If your replacement text is longer or shorter than the replaced text and there is other text following on the same line in the PDF, you have to decide whether that text should be moved, too, or not. It may belong together and has to be shifted accordingly, but it may alternatively be from a separate text block or column in which case it should not be moved.

Text justification may also be damaged.

Also consider marked text (underline / strike through / background color / ...). These markings in PDF (usually) are not font properties but separate vector graphics. To get these right, you have to parse the vector graphics and annotations from the page, heuristically identify text markings, and update them.

Tagged PDFs

If you deal with tagged PDFs (e.g. for accessibility), this may make finding text easier (as accessibility should allow for easy text extraction) but replacing text harder because you may also have to update some tags or structure tree data.

How to implement a generic text replacement nonetheless

As shown above there are a lot of hindrances to text replacement in PDFs. Thus, a complete solution (where possible at all) is far beyond the scope of a stack overflow answer. Some pointers, though:

To find the text to replace you should make use of the PdfTextStripper (a PDFBox utility class for text extraction) and extend it to have all the text with pointers to the text drawing instruction that draws each character respectively. This way you don't have to implement all the decoding and sorting of the text.

To replace the text you can ask the PDFBox font classes (provided by the PdfTextStripper if extended accordingly) whether they can encode your replacement text.

And always have a copy of the PDF specification (ISO 32000-1 or ISO 32000-2) at your hands...

But do be aware that it will take you a while, a number of weeks or months, to get a somewhat decent generic solution.

PDFBox with special characters working fine on Windows but characters getting replaced with other characters in Linux

Change this

PDType0Font font = PDType0Font.load(pdDoc, PDFMailMergeUtil.class.getResourceAsStream("/Arial_Narrow.ttf"));

to this

PDType0Font font = PDType0Font.load(pdDoc, PDFMailMergeUtil.class.getResourceAsStream("/Arial_Narrow.ttf"), false);

to avoid subsetting. IIRC it's because the font file in the subset font doesn't really exist at the time you're using it because the object you're using is a different PDFont object.

PDFBox Symbolic fonts must have a built-in encoding error when using PDFTextStripper.getText()

As already indicated by @Tilman opening a bug issue in the PDFBox Jira, this behavior is a bug:

The DictionaryEncoding constructor retrieves an Encoding instance for the base encoding of a font using Encoding.getInstance and is well aware that this method may return null:

base = Encoding.getInstance(name); // may be null

If it is null, though, and PDFBox has not been able to determine a built-in encoding of the font, the observed exception is thrown:

throw new IllegalArgumentException("Symbolic fonts must have a built-in " + 
"encoding");

In the case at hand, the base encoding is MacExpertEncoding which is one of the possible base encodings explicitly named by the PDF specification. Unfortunately Encoding.getInstance does not know this encoding and, therefore, returns null which in turn triggers the exception as PDFBox also could not identify a built-in encoding.


Thus, a fix should include the addition of an Encoding class for MacExpertEncoding and extending Encoding.getInstance accordingly.

Furthermore, one should consider not throwing the exception at all: There are numerous situation where there is no need for an implicit or explicit base encoding, e.g. if the Differences explicitly provide a mapping for each character code or (in case of pure text extraction) if the font has a good ToUnicode table.



Related Topics



Leave a reply



Submit