Encoding Problems after using PDFBox
In short
The relevant difference is that PDFBox serializes names differently. But the different outputs according to the PDF specification are equivalent, so you apparently have uncovered a WPViewPDF bug.
The difference in writing names
In the original PDF (raw.pdf) you find the names NOWFJV+Arial,Bold and NOWFJV+Arial,Bold-WinCharSetFFFF, in all files manipulated by PDFBox you find all occurrences of those names outside of content streams replaced by NOWFJV+Arial#2CBold and NOWFJV+Arial#2CBold-WinCharSetFFFF.
WPViewPDF cannot properly display the text written in the fonts with these changed names. After patching the PDFs back to contain a comma in place of the '#2C' in those names, WPViewPDF again properly displays such text.
I would assume WPViewPDF finds NOWFJV+Arial,Bold in the content stream and expects to find the matching font definition in the page resources using the identically written name, so it doesn't recognize it with the name NOWFJV+Arial#2CBold.
Is that a PDFBox bug?
According to the PDF specification,
Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.
(ISO 32000-2, section 7.3.5 "Name objects")
Thus, this replacement of commas in names by '#2C' sequences is a completely valid alternative way to write those names.
Thus, no, it's not a PDFBox bug but apparently a WPViewPDF bug.
Apache PDFBox: problems with encoding
This answer is actually an explanation why a generic solution for your task is at least very complicated if not impossible. Under benign circumstances, i.e. for PDFs subject to specific restrictions, code like yours can be successfully used, but your example PDF shows that the PDFs you apparently want to manipulate are not restricted like that.
Why automatic replacement of text is difficult/impossible
There are a number of factors that impede automatic replacement of text in PDFs, some already making finding the instructions for drawing the text in question difficult, and some complicating the replacing the characters in the arguments of those instructions.
The list of problems illustrated here is not exhaustive!
Finding instructions drawing a specific text
PDFs contain content streams which contain sequences of instructions telling a PDF processor where to draw what. Regular text in PDFs is drawn by instructions setting the current font (and font size), setting the position to draw the text at, and actually drawing text. This can be as easy to understand and search for as this:
/TT0 1 Tf
9 0 0 9 5 5 Tm
(file:///C/Users/Mi/Downloads/converted.txt[10.03.2020 18:43:57]) Tj
(Here the font TT0 with size 1 is selected, then an affine transformation is applied to scale text by a factor of 9 and move to the position (5, 5), and finally the text "file:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]" is drawn.)
In such a case searching the instructions responsible for drawing a given piece of text is easy. But the instructions in question may also look differently.
Split lines
For example the string may be drawn in pieces, instead of the Tj instruction above, we may have
[(file:///C/Users/Mi/Downloads/converted.txt)2 ([10.03.2020 18:43:57])] TJ
(Here first "file:///C/Users/Mi/Downloads/converted.txt" is drawn, then the text drawing position is slightly moved, then "[10.03.2020 18:43:57]" is drawn, both in the same TJ instruction.)
Or you may see
(file:///C/Users/Mi/Downloads/converted.txt) Tj
([10.03.2020 18:43:57]) Tj
(The text parts drawn in different instructions.)
Also the order of text pieces may be unexpected:
([10.03.2020 18:43:57]) Tj
-40 0 Td
(file:///C/Users/Mi/Downloads/converted.txt) Tj
(First the date string is drawn, then the text position is moved left quite a bit before the drawn date, the the URL is drawn.)
Some PDF producers draw each character separately, setting the whole text transformation in between:
9 0 0 9 5 5 Tm
(f) Tj
9 0 0 9 14 5 Tm
(i) Tj
9 0 0 9 23 5 Tm
(l) Tj
...
And these different instructions need not be arranged in sequence as here, they can be spread over the whole stream, even over multiple streams as a page can have an array of content streams instead of a single one or part of the string may be drawn in the content stream of a sub-object referenced from the page content stream.
Thus, for finding the instructions responsible for a specific, multi-character text, you may have to inspect multiple streams and glue the strings you found together according to the position they have been drawn at.
Ligatures
Not every single character code might correspond to a single character as in your search string. There are a number of special glyphs for combinations of characters like fl
for fl
etc. So for searching one has to expand such ligatures.
Encodings
In the examples above, the characters of the text were easy to recognize even if the text was not drawn in a single run. But in PDFs the encoding of the characters need not be so obvious, actually each font may come with an own encoding, e.g.
<004B0048004F004F0052000400040004>Tj
can draw "hello!!!".
(Here the string argument is written as hex string, in the debugger you saw "KHOOR...".)
Thus, for searching text, one needs to first map the string arguments of text drawing instructions to Unicode depending on the specific encoding of the current font.
But the PDF does not need to contain a mapping from the individual codes to Unicode characters, there may only be a mapping to the glyph id in the font file. In case of embedded fonts files, these font files then don't need to contain any mapping to Unicode characters either.
Often PDF files do have information on the Unicode characters matching the codes to allow text extraction e.g. for copy/paste; strictly speaking, though, such information is optional; even worse, that information may contain errors without creating issues when displaying the PDF. In all such situations one has to use OCR like mechanisms to recognize the Unicode characters associated with each glyph.
Replacing text in instructions
Once you found the instructions responsible for drawing the text you searched, you have to replace the text. This may also imply some problems.
Subset fonts
If font files are embedded in a PDF, they often merely are embedded as subsets of the original fonts to save space. E.g. in your example PDF the font Tahoma used to display "hello!!!" only is embedded with the following glyphs:
Even Times New Roman (the font used for the text you could recognize) is only subset embedded with the following glyphs:
Thus, even if you found the "hello!!!" in Tahoma, simply replacing the character codes to mean "byebye??" would only display " e e " as the only character for which a glyph is present in the embedded font is the 'e'.
Thus, to replace you may either have to edit the embedded font file and the representing PDF font object to contain and encode all required glyphs, or to add another font and instructions to switch to that font for the manipulated text drawing instructions and back again thereafter.
Font encodings
Even if your font is not embedded at all (so your complete local copy of the font will be used) or embedded with all the glyphs you need, the encoding used for your font may be limited. In Western European language based PDFs you will often find WinAnsiEncoding, an encoding similar to Windows code page 1252. If you want to replace with Cyrillic text, there are no character codes for those characters.
Thus in this case you might have to change the encoding to include all the characters you need (by finding unused characters in the present encoding by scanning all uses of the font in question) or add another font with a more apropos encoding.
Layout considerations
If your replacement text is longer or shorter than the replaced text and there is other text following on the same line in the PDF, you have to decide whether that text should be moved, too, or not. It may belong together and has to be shifted accordingly, but it may alternatively be from a separate text block or column in which case it should not be moved.
Text justification may also be damaged.
Also consider marked text (underline / strike through / background color / ...). These markings in PDF (usually) are not font properties but separate vector graphics. To get these right, you have to parse the vector graphics and annotations from the page, heuristically identify text markings, and update them.
Tagged PDFs
If you deal with tagged PDFs (e.g. for accessibility), this may make finding text easier (as accessibility should allow for easy text extraction) but replacing text harder because you may also have to update some tags or structure tree data.
How to implement a generic text replacement nonetheless
As shown above there are a lot of hindrances to text replacement in PDFs. Thus, a complete solution (where possible at all) is far beyond the scope of a stack overflow answer. Some pointers, though:
To find the text to replace you should make use of the PdfTextStripper
(a PDFBox utility class for text extraction) and extend it to have all the text with pointers to the text drawing instruction that draws each character respectively. This way you don't have to implement all the decoding and sorting of the text.
To replace the text you can ask the PDFBox font classes (provided by the PdfTextStripper
if extended accordingly) whether they can encode your replacement text.
And always have a copy of the PDF specification (ISO 32000-1 or ISO 32000-2) at your hands...
But do be aware that it will take you a while, a number of weeks or months, to get a somewhat decent generic solution.
PDFBox with special characters working fine on Windows but characters getting replaced with other characters in Linux
Change this
PDType0Font font = PDType0Font.load(pdDoc, PDFMailMergeUtil.class.getResourceAsStream("/Arial_Narrow.ttf"));
to this
PDType0Font font = PDType0Font.load(pdDoc, PDFMailMergeUtil.class.getResourceAsStream("/Arial_Narrow.ttf"), false);
to avoid subsetting. IIRC it's because the font file in the subset font doesn't really exist at the time you're using it because the object you're using is a different PDFont object.
PDFBox Symbolic fonts must have a built-in encoding error when using PDFTextStripper.getText()
As already indicated by @Tilman opening a bug issue in the PDFBox Jira, this behavior is a bug:
The DictionaryEncoding
constructor retrieves an Encoding
instance for the base encoding of a font using Encoding.getInstance
and is well aware that this method may return null
:
base = Encoding.getInstance(name); // may be null
If it is null
, though, and PDFBox has not been able to determine a built-in encoding of the font, the observed exception is thrown:
throw new IllegalArgumentException("Symbolic fonts must have a built-in " +
"encoding");
In the case at hand, the base encoding is MacExpertEncoding which is one of the possible base encodings explicitly named by the PDF specification. Unfortunately Encoding.getInstance
does not know this encoding and, therefore, returns null
which in turn triggers the exception as PDFBox also could not identify a built-in encoding.
Thus, a fix should include the addition of an Encoding
class for MacExpertEncoding and extending Encoding.getInstance
accordingly.
Furthermore, one should consider not throwing the exception at all: There are numerous situation where there is no need for an implicit or explicit base encoding, e.g. if the Differences explicitly provide a mapping for each character code or (in case of pure text extraction) if the font has a good ToUnicode table.
Related Topics
Mockito: Invaliduseofmatchersexception
How to Disable or Bypass Hardware Graphics Acceleration(Prism) in Javafx
Consider Defining a Bean of Type 'Package' in Your Configuration [Spring-Boot]
How to Tell Jackson to Ignore a Property for Which I Don't Have Control Over the Source Code
How to Tell If a Checkbox Is Selected in Selenium for Java
Mapping a Specific Servlet to Be the Default Servlet in Tomcat
How to Configure JPA for Testing in Maven
Convert Seconds Value to Hours Minutes Seconds
How to Convert List to JSON in Java
How to Read a File from a Jar File
How to Make Lombok and Aspectj Work Together
Deprecated Java Httpclient - How Hard Can It Be
What Is the Easiest Way to Ignore a JPA Field During Persistence