Getting Coordinates of String Using Itextextractionstrategy and Locationtextextractionstrategy in Itextsharp

Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

Here is a very, very simple version of an implementation.

Before implementing it is very important to know that PDFs have zero concept of "words", "paragraphs", "sentences", etc. Also, text within a PDF is not necessarily laid out left to right and top to bottom and this has nothing to do with non-LTR languages. The phrase "Hello World" could be written into the PDF as:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

It could also be written as

Draw Hello World at (10,10)

The ITextExtractionStrategy interface that you need to implement has a method called RenderText that gets called once for every chunk of text within a PDF. Notice I said "chunk" and not "word". In the first example above the method would be called four times for those two words. In the second example it would be called once for those two words. This is the very important part to understand. PDFs don't have words and because of this, iTextSharp doesn't have words either. The "word" part is 100% up to you to solve.

Also along these lines, as I said above, PDFs don't have paragraphs. The reason to be aware of this is because PDFs cannot wrap text to a new line. Any time that you see something that looks like a paragraph return you are actually seeing a brand new text drawing command that has a different y coordinate as the previous line. See this for further discussion.

The code below is a very simple implementation. For it I'm subclassing LocationTextExtractionStrategy which already implements ITextExtractionStrategy. On each call to RenderText() I find the rectangle of the current chunk (using Mark's code here) and storing it for later. I'm using this simple helper class for storing these chunks and rectangles:

//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}

And here's the subclass:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);

//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}

And finally an implementation of the above:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();

doc.Add(new Paragraph("This is my sample file"));

doc.Close();
}
}
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

I can't stress enough that the above does not take "words" into account, that'll be up to you. The TextRenderInfo object that gets passed into RenderText has a method called GetCharacterRenderInfos() that you might be able to use to get more information. You might also want to use GetBaseline() instead ofGetDescentLine()` if you don't care about descenders in the font.

EDIT

(I had a great lunch so I'm feeling a little more helpful.)

Here's an updated version of MyLocationTextExtractionStrategy that does what my comments below say, namely it takes a string to search for and searches each chunk for that string. For all the reasons listed this will not work in some/many/most/all cases. If the substring exists multiple times in a single chunk it will also only return the first instance. Ligatures and diacritics could also mess with this.

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();

//The string that we're searching for
public String TextToSearchFor { get; set; }

//How to compare strings
public System.Globalization.CompareOptions CompareOptions { get; set; }

public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);

//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

//If not found bail
if (startPosition < 0) {
return;
}

//Grab the individual characters
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();

//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}

You would use this the same as before but now the constructor has a single required parameter:

var t = new MyLocationTextExtractionStrategy("sample");

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

It looks like the chunk.m_text only contains one letter at a time which is why it this will never be true:

if (chunk.m_text.Trim() == "MCU_MOSI")

What you could do instead is have each chunk text added to a string and see if it contains your text.

    PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);

LocationTextExtractionStrategyEx strategy;
string str = string.Empty;

for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
var x = strategy.m_SearchResultsList;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
str += chunk.m_text;
if (str.Contains("MCU_MOSI"))
{
str = string.Empty;
Vector location = chunk.m_endLocation;
Console.WriteLine("Bingo");
}
}
}

Note for the example of the location, I made m_endLocation public.

iTextSharp extract each character and getRectangle

The text extraction strategies bundled with iTextSharp (in particular the LocationTextExtractionStrategy used by default by the PdfTextExtractor.GetTextFromPage overload without strategy argument) only allows direct access to the collected plain text, not positions.

Chris Haas' MyLocationTextExtractionStrategy

@Chris Haas in his old answer here presents the following extension of the LocationTextExtractionStrategy

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);

//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}

which makes use of this helper class

//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}

This strategy makes the text chunks and their enclosing rectangles available in the public member List<RectAndText> myPoints which you can access like this:

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

For your task to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character only two details are wrong here:

  • the text chunks returned like that may contain multiple characters
  • the font information is not provided.

Thus, we have to tweak that a bit:

A new CharLocationTextExtractionStrategy

In addition to the MyLocationTextExtractionStrategy class the CharLocationTextExtractionStrategy splits the input by glyph and also provides the font name:

public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo wholeRenderInfo)
{
base.RenderText(wholeRenderInfo);

foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
{
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
}
}
}

//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
public iTextSharp.text.Rectangle Rect;
public String Text;
public String Font;
public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
{
this.Rect = rect;
this.Text = text;
this.Font = font;
}
}

Using this strategy like this

CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();

using (var pdfReader = new PdfReader(testFile))
{
PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}

foreach (var p in strategy.myPoints)
{
Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}

you get the information by character and including the font.

Retrieve the respective coordinates of all words on the page with itextsharp

(I'm mostly working with the Java library iText, not with the .Net library iTextSharp; thus, please ignore some Java-isms here, everything should be easy to translate.)

For extracting contents of a page using iText(Sharp), you employ the classes in the parser package to feed it after some preprocessing to a RenderListener of your choice.

In a context in which you are only interested in the text, you most commonly use a TextExtractionStrategy which is derived from RenderListener and adds a single method getResultantText to retrieve the aggregated text from the page.

As the initial intent of text parsing in iText was to implement this use case, most existing RenderListener samples are TextExtractionStrategy implementations and only make the text available.

Therefore, you will have to implement your own RenderListener which you already seem to have christianed TextWithPositionExtractionStategy.

Just like there is both a SimpleTextExtractionStrategy (which is implemented with some assumptions about the structure of the page content operators) and a LocationTextExtractionStrategy (which does not have the same assumptions but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.

Thus, just like in the case of the SimpleTextExtractionStrategy, you in your first, simple implementation expect the text rendering events forwarded to your listener to arrive line by line, and on the same line from left to right. This way, as soon as you find a horizontal gap or a punctation, you know your current word is finished and you can process it.

In contrast to the text extraction strategies you don't need a StringBuffer member to collect your result but instead a list of some "word with position" structure. Furthermore you need some member variable to hold the TextRenderInfo events you already collected for this page but could not finally process (you may retrieve a word in several separate events).

As soon as you (i.e. your renderText method) are called for a new TextRenderInfo object, you should operate like this (pseudo-code):

if (unprocessedTextRenderInfos not empty)
{
if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn
|| isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space
{
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
}
}

split new TextRenderInfo using its getCharacterRenderInfos() method;
while (characterRenderInfos contain word end)
{
add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos;
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
remove used render infos from characterRenderInfos;
}
add remaining characterRenderInfos to unprocessedTextRenderInfos;

In process(unprocessedTextRenderInfos) you extract the information you need from the unprocessedTextRenderInfos; you concatenate the individual text contents to a word and take the coordinates you want; if you merely want starting coordinates, you take those from the first of those unprocessed TextRenderInfos. If you need more data, you also use the data from the other TextRenderInfos. With these data you fill a "word with position" structure and add it to your result list.

When page processing is finished, you have to once more call process(unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear(); alternatively you may do that in the endTextBlock method.

Having done this, you might feel ready to implement the slightly more complex variant which does not have the same assumptions concerning the page content structure. ;)

How to get the text position from the pdf page in iText 7

First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.

Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.

Possible implementation:

  • implement IEventListener
  • get notified for all events that render text, and store the corresponding TextRenderInfo object
  • once you're finished with the document, sort these objects based on their position in the page
  • loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates

how to:

  1. implement ITextExtractionStrategy (or extend an existing
    implementation)
  2. use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
  3. your strategy should be set up to keep track of locations for the text it processed

ITextExtractionStrategy has the following method in its interface:

@Override
public void eventOccurred(IEventData data, EventType type) {

// you can first check the type of the event
if (!type.equals(EventType.RENDER_TEXT))
return;

// now it is safe to cast
TextRenderInfo renderInfo = (TextRenderInfo) data;
}

Important to keep in mind is that rendering instructions in a pdf do not need to appear in order.
The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to:
render "Ipsum Do"

render "Lorem "

render "lor Sit Amet"

You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.

Once that's done, it should be easy.

Unable to read text in a specific location in a pdf file using iTextSharp

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.

If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.

You can do that by adding after reading the PDF in this line

PdfReader pdfReader = new PdfReader(filePath);

code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:

MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();

memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);

Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.

Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.

You can change this by using a different text extraction strategy, i.e. by replacing this line:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
  • Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.

    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
  • Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.

    ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();


Related Topics



Leave a reply



Submit