How to Add Unicode in Truetype0Font on PDFbox 2.0.0

how to add unicode in truetype0font on pdfbox 2.0.0?

Here's some code to add a ToUnicode CMap stream in a font. Obviously I can't do it with your file, so I used one of my test files, which can be found here. I had to work on each entry separately and didn't do all. However the result is good enough to extract the first word in the green print ("Bedingungen").

The scenario is somewhat tailored to you:

  • Identity-H entry
  • no ToUnicode entry
  • specific font name

    try (PDDocument doc = PDDocument.load(f))
    {
    for (int p = 0; p < doc.getNumberOfPages(); ++p)
    {
    PDPage page = doc.getPage(p);
    PDResources res = page.getResources();
    for (COSName fontName : res.getFontNames())
    {
    PDFont font = res.getFont(fontName);
    COSBase encoding = font.getCOSObject().getDictionaryObject(COSName.ENCODING);
    if (!COSName.IDENTITY_H.equals(encoding))
    {
    continue;
    }
    // get real name
    String fname = font.getName();
    int plus = fname.indexOf('+');
    if (plus != -1)
    {
    fname = fname.substring(plus + 1);
    }
    if (font.getCOSObject().containsKey(COSName.TO_UNICODE))
    {
    continue;
    }
    System.out.println("File '" + f.getName() + "', page " + (p + 1) + ", " + fontName.getName() + ", " + font.getName());
    if (!fname.startsWith("Calibri-Bold"))
    {
    continue;
    }
    COSStream toUnicodeStream = new COSStream();
    try (PrintWriter pw = new PrintWriter(toUnicodeStream.createOutputStream(COSName.FLATE_DECODE)))
    {
    // "9.10 Extraction of Text Content" in the PDF 32000 specification
    pw.println ("/CIDInit /ProcSet findresource begin\n" +
    "12 dict begin\n" +
    "begincmap\n" +
    "/CIDSystemInfo\n" +
    "<< /Registry (Adobe)\n" +
    "/Ordering (UCS) /Supplement 0 >> def\n" +
    "/CMapName /Adobe-Identity-UCS def\n" +
    "/CMapType 2 def\n" +
    "1 begincodespacerange\n" +
    "<0000> <FFFF>\n" +
    "endcodespacerange\n" +
    "10 beginbfchar\n" + // number is count of entries
    "<0001><0020>\n" + // space
    "<0002><0041>\n" + // A
    "<0003><0042>\n" + // B
    "<0004><0044>\n" + // D
    "<0013><0065>\n" + // e
    "<0012><0064>\n" + // d
    "<0017><0069>\n" + // i
    "<001B><006E>\n" + // n
    "<0015><0067>\n" + // g
    "<0020><0075>\n" + // u
    "endbfchar\n" +
    "endcmap CMapName currentdict /CMap defineresource pop end end");
    }
    font.getCOSObject().setItem(COSName.TO_UNICODE, toUnicodeStream);
    }
    }
    doc.save("huhu.pdf");
    }

Btw the unreleased 2.1 version of PDFDebugger has some improved features to show fonts, you can get it here:

You can use it to verify that your ToUnicode CMap makes sense. Here's what I get with my changes:
Sample Image

Detect missing / corrupt Unicode mapping in PDF

A fourth possibility (next to the three given in Aaron Digulla answer) is to override showGlyph() when extending the PDFTextStripper class:

protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
{
super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
if (unicode == null || unicode.isEmpty())
{
// do stuff
}
}

pdf reading via pdfbox in java

The first file "PnL_500010_0314.pdf"

Indeed, actually the whole line "Statement of Profit and Loss for the year ended March 31, 2014" and much more cannot be extracted; inspecting the contents the reason becomes obvious: This text is written using a composite font which neither has an Encoding nor a ToUnicode entry to allow identifying the character in question.

The org.apache.pdfbox.text.PDFTextStreamEngine (from which PDFTextStripper is derived) method showGlyph shortly before calling processTextPosition (which PDFTextStripper implements and from which it retrieves its text information) contains this code:

// use our additional glyph list for Unicode mapping
unicode = font.toUnicode(code, glyphList);

// when there is no Unicode mapping available, Acrobat simply coerces the character code
// into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in PDFTextStreamEngine.
if (unicode == null)
{
if (font instanceof PDSimpleFont)
{
char c = (char) code;
unicode = new String(new char[] { c });
}
else
{
// Acrobat doesn't seem to coerce composite font's character codes, instead it
// skips them. See the "allah2.pdf" TestTextStripper file.
return;
}
}

The font in question does not offer any clues for text extraction. Thus, unicode here is null.

Furthermore, the font is composite, not simple. Thus, the else clause is executed and processTextPosition is not even called.

PDFTextStripper, therefore, is not informed at all that the line "Statement of Profit and Loss for the year ended March 31, 2014" even exists!

If you replace that

    else
{
// Acrobat doesn't seem to coerce composite font's character codes, instead it
// skips them. See the "allah2.pdf" TestTextStripper file.
return;
}

in PDFTextStreamEngine.showGlyph by some code setting unicode, e.g. using the Unicode replacement character

    else
{
// Use the Unicode replacement character to indicate an unknown character
unicode = "\uFFFD";
}

you'll get

57
THIRTY SEVENTH ANNUAL REPORT 2013-14
STANDALONE FINANCIAL STATEMENTS
�������������������������������������������������������������
As per our report attached. Directors
For Deloitte Haskins & Sells LLP Deepak S. Parekh Nasser Munjee R. S. Tarneja
Chartered Accountants �������� B. S. Mehta J. J. Irani
D. N. Ghosh Bimal Jalan
Keki M. Mistry S. A. Dave D. M. Sukthankar
Sanjiv V. Pilgaonkar ���������������
Partner �����������������������
Renu Sud Karnad V. Srinivasa Rangan Girish V. Koliyote
������, May 6, 2014 Managing Director ������������������ �����������������
Notes Previous Year
� in Crore � in Crore
INCOME
����������������������� 23 23,894.03 20,796.95
���������������������������� 24 248.98 315.55
������������ 25 54.66 35.12
Total Revenue 24,197.67 21,147.62
EXPENSES
Finance Cost 26 16,029.37 13,890.89
�������������� 27 279.18 246.19
���������������������� 28 86.98 75.68
�������������� 29 230.03 193.43
������������������������������ 11 & 12 31.87 23.59
Provision for Contingencies 100.00 145.00
Total Expenses 16,757.43 14,574.78

PROFIT BEFORE TAX 7,440.24 6,572.84
�����������
������������� 1,973.00 1,727.68
�������������� 14 27.00 (3.18)
PROFIT FOR THE YEAR 3 5,440.24 4,848.34
EARNINGS PER SHARE��������������� 2) 31
- Basic 34.89 31.84
- Diluted 34.62 31.45
�������������������������������������������������������������

Unfortunately that PDFTextStreamEngine.showGlyph method uses some private class members. Thus, one cannot simply override it in one's own PDFTextStripper class using the original method code with the change indicated above. One either has to replicate nearly all functionality of PDFTextStreamEngine in one's own class, or one has to resort to Java reflection, or one has to patch PDFBox classes themselves.

This architecture is not exactly perfect.

The second file "Bal_532935_0314.pdf"

The case of the second file is caused by the same piece of PDFBox code quoted above. As this time, though, the font is simple, the other code block is executed:

    if (font instanceof PDSimpleFont)
{
char c = (char) code;
unicode = new String(new char[] { c });
}

What happens here is pure guesswork: If there is no information for mapping glyph code to Unicode, let's assume the mapping is Latin-1 which embeds trivially into char. As becomes visible in the OP's second file, this assumption does not always hold.

If you don't want PDFBox to make assumptions like these here, also replace the if block above by

    if (font instanceof PDSimpleFont)
{
// Use the Unicode replacement character to indicate an unknown character
unicode = "\uFFFD";
}

This results in

Aries Agro Care Private Limited
1118th Annual Report 2013-14
Balance Sheet as at 31st March, 2014
Particulars Note
No.
As at
31 March, 2014
Rupees
As at
31 March, 2013
Rupees
I. EQUITY AND LIABILITIES
(1) Shareholder's Funds
(a) ������������� 3 100,000 100,000
(b) Reserves and Surplus 4 (2,673,971) ������������
(2,573,971) ������������
(2) Current Liabilities
(a) Short Term Borrowings 5 5,805,535 �����������
(b) Trade Payables 6 159,400 ���������
(c) ������������������������� 7 2,500 22,743
5,967,435 5,934,756
TOTAL 3,393,464 �����������
II. ASSETS
(1) Non-Current Assets
(a) �������������������� � - -
- -
(2) Current Assets
(a) ����������������������� 9 39,605 �������
(b) ����������������������������� 10 3,353,859 ����������
3,393,464 ����������
TOTAL 3,393,464 ����������
��������������������������������
The Notes to Accounts 1 to 23 form part of these Financial Statements
As per our report of even date For and on behalf of the Board
For Kirti D. Shah & Associates
���������������������
�����������������������������
Dr. Jimmy Mirchandani
Director
Kirti D. Shah
Proprietor
Membership No 32371
Dr. Rahul Mirchandani
Director
Place : Mumbai.
Date :- 26th May, 2014.


Related Topics



Leave a reply



Submit