Using PDFbox to Write Utf-8 Encoded Strings to a PDF

Java: Write national characters to PDF using PDFBox

I think its PDType1Font.TIMES_ROMAN font which is not supporting your Czech national characters. If you can manage to get the .ttf files for the Czech national characters, then use below to get PDFont as below and use the same:

      PDFont font = PDTrueTypeFont.loadTTF( doc, new File( "CheckRepFont.ttf" ) );

Here CheckRepFont.ttf is your font file name as an example. Update it with actual one.

EDIT:

  PDStream pdStream  = new PDStream(doc);
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
font.setToUnicode(pdStream);

Using PDFBox to write unicode strings to a PDF

Essentially all the answers you linked to are correct. You have to keep in mind which PDFBox version they respectively refer to.

concerning this answer:

In the pre-2.0.0 versions (up to the current 1.8.8) the text drawing operations were very limited and didn't support even the full WinAnsi encoding which font objects generated by these versions used as encoding.

concerning this answer:

The current 2.0.0-SNAPSHOT development state has much improved. This means that the limitations of the text drawing operations have been removed, they properly encode the text and the used fonts are properly encoded and embedded. Bugs in the early implementations of these improvements meanwhile have mostly been fixed.

concerning this answer:

This answer points to something one needs to keep in mind, no matter which PDFBox version one uses: specific fonts do not necessarily support the whole Unicode range of code points. If the font you use does not contain a glyph definition for a character, you can encode as much as you want, your character won't be drawn properly. This especially concerns the standard 14 fonts which every PDF viewer has to support: they need only support characters from a few Latin-style encodings, by far not the the full Unicode set.

replace string with unicode text in pdf file using PDFbox?

Your code utterly breaks the PDF, cf. the Adobe Preflight output:

Preflight output

The cause is obvious, your code

PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);

drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.

You must not drop existing resources as they are used in your document.


Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.

I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task

to read the strings from PDF file and replace it with the Unicode text

cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.


What you can implement instead is:

  • First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
  • Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
  • Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.

Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

Java - PDFBox 1.8.9 unicode textfile to pdf

Just found this -> Using PDFBox to write unicode strings to a PDF

Seems it's not possbile, need to update to version 2.0.0 and give it a try.

EDITED #2: In new version of PDFBox 2.0.0 (atleast now)
has been removed the class TextToPDF() (In comment, has been said that its avaiable now)
which let me pass in textFile. So now it means, that either i manually read the text and then write it to PDF, or need to find some other solutions

How to sanitise a string before printing it to PDF with PDFBox

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

StringBuilder sb = new StringBuilder();

for (int i = 0; i < string.length(); i++) {

int codePoint = string.codePointAt(i);

if (codePoint == 0x000A) {
sb.appendCodePoint(codePoint);
continue;
}

String fontName = font.getName();
int cpKey = fontName.hashCode();
cpKey = 31 * cpKey + codePoint;

if (codePointAvailCache.get(cpKey) == null) {

try {
font.encode(string.substring(i, i + 1));
codePointAvailCache.put(cpKey, true);
} catch (Exception e) {
codePointAvailCache.put(cpKey, false);
}
}

if (!codePointAvailCache.get(cpKey)) {

// Need to make sure our font has a replacement character
try {
codePoint = 0xFFFD;
font.encode(new String(new int[] { codePoint }, 0, 1));
} catch (Exception e) {
codePoint = 0x003F;
}
}

sb.appendCodePoint(codePoint);
}

return sb.toString();
}


Related Topics



Leave a reply



Submit