How to Extract Subscript/Superscript Properly from a PDF Using Itextsharp

ItextSharp HTML to PDF conversion how to convert text with superscript and subscript as it is

I came up with a solution
in the html I manually created a superscript as follows and it works
(based on my scenario)

 "<span style=\"width:500px;\">" +
"<span>500 m</span>" +
"<span style=\"vertical-align:super;font-size:8px;\">2</span>" +
"</span>" +

Unable to read text in a specific location in a pdf file using iTextSharp

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.

If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.

You can do that by adding after reading the PDF in this line

PdfReader pdfReader = new PdfReader(filePath);

code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:

MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();

memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);

Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.

Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.

You can change this by using a different text extraction strategy, i.e. by replacing this line:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
  • Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.

    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
  • Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.

    ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();

How to extract text 'marked for redaction' from a PDF using iTextSharp?

For the sake of completeness, this question was answered on the iText mailing list:
http://thread.gmane.org/gmane.comp.java.lib.itext.general/62918

iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

The cause for this simply is that "Total For Line Extended Price" is at a y coordinate of 507.37 while "Part Description Quantity Unit Price" is at a y coordinate of 506.42.

The LocationTextExtractionStrategy allows for small variations by only considering the integer part of the y coordinates but even the integer parts differ here. Thus, it assumes the former headings to be on a line above the latter ones and outputs its results accordingly.

In case of such variations usually a first attempt might be to try the SimpleTextExtractionStrategy. Unfortunately this does not help here as the former text actually is drawn before the latter text. Thus, this strategy also returns the headings in the wrong order.

In such a situation you need a strategy that works differently, e.g. the strategy HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one up to iText 5.5.8, the latter one for the current development code 5.5.9-SNAPSHOT) from this answer. Using it you'll get

Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
Parking 1.00 101.96$ USD 101.96 101.96$
??? 1.00 51.65$ USD 51.65 51.65$
Pax Baggage Handling Fee 5.00 8.49$ USD 42.45 42.45 $
Pax Airport Tax 5.00 26.36 $ USD 131.80 131.80$
GA terminal for crew on Arr ferry fit 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Pax on Dep. 5.00 124.00$ USD 620.00 620.00 $
GA terminal for crew on dep. 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Guest on Dep. 1.00 38.00$ USD 38.00 38.00 $
Crew transfer on arr 1.00 70.00 $ USD 70.00 70.00 $
Crew transfer on dep 1.00 70.00 $ USD 70.00 70.00 $
Lavatory Service 1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
Ground Handling 1.00 190.00$ USD 190.00 190.00$
Pax Handling 1.00 190.00$ USD 190.00 190.00$
Push Back 1.00 83.00 $ USD 83.00 83.00 $
Towing 1.00 110.00$ USD 110.00 110.00$

(result of using TextExtraction test method testLocation_text_extraction_test)

Unfortunately, though, these strategies fail if there are overlapping lines in different side-by-side columns, e.g. in your document the invoice recipient address and the information to its right.

You might either try to tweak the horizontal strategies (e.g. by also analyzing horizontal gaps separating columns) or try a combined approach, using the output of multiple strategies for the same document.

ItextSharp anagram output when extract text from rectangle

Have you tried to customize the working SimpleTextExtractionStrategy, in a way that it takes not the full page but the rectangle?

You can find the full code in the ghitub project here: https://github.com/itext/itextsharp/blob/75f05dd7d87797b86c44649f5f96df2d90d730e8/src/extras/itextsharp.tests/iTextSharp/text/pdf/parser/SimpleTextExtractionStrategyTest.cs

iText java not parsing text properly from PDF/

The cause

iText with its standard text extraction strategy extracts

Screenshot

as

SUBMITTALS
1.2

because the "1.2" actually is located (minutely) below the "SUBMITTALS":

q .75000 0 0 .75000 0 792 cm 
1 1 1 rg 0 0 816 -1056 re f
q .32000 0 0 .32000 0 0 cm
q
...
q .20823 0 0 .20807 0 0 cm
BT /F2 220 Tf 0 g 2340 -6628 Td(SUBMITTALS) Tj ET Q
q .20823 0 0 .20807 0 0 cm
BT /F2 220 Tf 0 g 1440 -6634 Td(1.2) Tj ET Q

As you can see in this excerpt of the content drawing instructions from the PDF, the "1.2" is drawn at the scaled y coordinate -6634 while "SUBMITTALS" is drawn at -6628, i.e. "1.2" is drawn 6 scaled units below "SUBMITTALS".

This makes iText put it onto a separate following line.

A solution

You can use the HorizontalTextExtractionStrategy2 from this answer instead of the default extraction strategy, cf. TextExtraction.java test testTestPDF, and get this output:

1.2 SUBMITTALS 

(For details on the use of that strategy, confer the answer mentioned above. HorizontalTextExtractionStrategy2 is the updated strategy from the section "UPDATE: Changes in LocationTextExtractionStrategy" of that answer.)



Related Topics



Leave a reply



Submit