Itextsharp - How to Get the Position of Word on a Page

Retrieve the respective coordinates of all words on the page with itextsharp

(I'm mostly working with the Java library iText, not with the .Net library iTextSharp; thus, please ignore some Java-isms here, everything should be easy to translate.)

For extracting contents of a page using iText(Sharp), you employ the classes in the parser package to feed it after some preprocessing to a RenderListener of your choice.

In a context in which you are only interested in the text, you most commonly use a TextExtractionStrategy which is derived from RenderListener and adds a single method getResultantText to retrieve the aggregated text from the page.

As the initial intent of text parsing in iText was to implement this use case, most existing RenderListener samples are TextExtractionStrategy implementations and only make the text available.

Therefore, you will have to implement your own RenderListener which you already seem to have christianed TextWithPositionExtractionStategy.

Just like there is both a SimpleTextExtractionStrategy (which is implemented with some assumptions about the structure of the page content operators) and a LocationTextExtractionStrategy (which does not have the same assumptions but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.

Thus, just like in the case of the SimpleTextExtractionStrategy, you in your first, simple implementation expect the text rendering events forwarded to your listener to arrive line by line, and on the same line from left to right. This way, as soon as you find a horizontal gap or a punctation, you know your current word is finished and you can process it.

In contrast to the text extraction strategies you don't need a StringBuffer member to collect your result but instead a list of some "word with position" structure. Furthermore you need some member variable to hold the TextRenderInfo events you already collected for this page but could not finally process (you may retrieve a word in several separate events).

As soon as you (i.e. your renderText method) are called for a new TextRenderInfo object, you should operate like this (pseudo-code):

if (unprocessedTextRenderInfos not empty)
{
if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn
|| isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space
{
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
}
}

split new TextRenderInfo using its getCharacterRenderInfos() method;
while (characterRenderInfos contain word end)
{
add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos;
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
remove used render infos from characterRenderInfos;
}
add remaining characterRenderInfos to unprocessedTextRenderInfos;

In process(unprocessedTextRenderInfos) you extract the information you need from the unprocessedTextRenderInfos; you concatenate the individual text contents to a word and take the coordinates you want; if you merely want starting coordinates, you take those from the first of those unprocessed TextRenderInfos. If you need more data, you also use the data from the other TextRenderInfos. With these data you fill a "word with position" structure and add it to your result list.

When page processing is finished, you have to once more call process(unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear(); alternatively you may do that in the endTextBlock method.

Having done this, you might feel ready to implement the slightly more complex variant which does not have the same assumptions concerning the page content structure. ;)

c# itextsharp, locate words not chunks in page with their location for adding sticky notes

It looks like the chunk.m_text only contains one letter at a time which is why it this will never be true:

if (chunk.m_text.Trim() == "MCU_MOSI")

What you could do instead is have each chunk text added to a string and see if it contains your text.

    PdfReader reader = new PdfReader(path);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);

LocationTextExtractionStrategyEx strategy;
string str = string.Empty;

for (int i = 5; i <= 5; i++) // reader.NumberOfPages
{
strategy = parser.ProcessContent(i, new LocationTextExtractionStrategyEx("MCU_MOSI", 0));
var x = strategy.m_SearchResultsList;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk chunk in strategy.m_DocChunks)
{
str += chunk.m_text;
if (str.Contains("MCU_MOSI"))
{
str = string.Empty;
Vector location = chunk.m_endLocation;
Console.WriteLine("Bingo");
}
}
}

Note for the example of the location, I made m_endLocation public.

Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

Here is a very, very simple version of an implementation.

Before implementing it is very important to know that PDFs have zero concept of "words", "paragraphs", "sentences", etc. Also, text within a PDF is not necessarily laid out left to right and top to bottom and this has nothing to do with non-LTR languages. The phrase "Hello World" could be written into the PDF as:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

It could also be written as

Draw Hello World at (10,10)

The ITextExtractionStrategy interface that you need to implement has a method called RenderText that gets called once for every chunk of text within a PDF. Notice I said "chunk" and not "word". In the first example above the method would be called four times for those two words. In the second example it would be called once for those two words. This is the very important part to understand. PDFs don't have words and because of this, iTextSharp doesn't have words either. The "word" part is 100% up to you to solve.

Also along these lines, as I said above, PDFs don't have paragraphs. The reason to be aware of this is because PDFs cannot wrap text to a new line. Any time that you see something that looks like a paragraph return you are actually seeing a brand new text drawing command that has a different y coordinate as the previous line. See this for further discussion.

The code below is a very simple implementation. For it I'm subclassing LocationTextExtractionStrategy which already implements ITextExtractionStrategy. On each call to RenderText() I find the rectangle of the current chunk (using Mark's code here) and storing it for later. I'm using this simple helper class for storing these chunks and rectangles:

//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}

And here's the subclass:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);

//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}

And finally an implementation of the above:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();

doc.Add(new Paragraph("This is my sample file"));

doc.Close();
}
}
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

I can't stress enough that the above does not take "words" into account, that'll be up to you. The TextRenderInfo object that gets passed into RenderText has a method called GetCharacterRenderInfos() that you might be able to use to get more information. You might also want to use GetBaseline() instead ofGetDescentLine()` if you don't care about descenders in the font.

EDIT

(I had a great lunch so I'm feeling a little more helpful.)

Here's an updated version of MyLocationTextExtractionStrategy that does what my comments below say, namely it takes a string to search for and searches each chunk for that string. For all the reasons listed this will not work in some/many/most/all cases. If the substring exists multiple times in a single chunk it will also only return the first instance. Ligatures and diacritics could also mess with this.

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();

//The string that we're searching for
public String TextToSearchFor { get; set; }

//How to compare strings
public System.Globalization.CompareOptions CompareOptions { get; set; }

public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}

//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);

//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

//If not found bail
if (startPosition < 0) {
return;
}

//Grab the individual characters
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();

//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();

//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);

//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}

You would use this the same as before but now the constructor has a single required parameter:

var t = new MyLocationTextExtractionStrategy("sample");

Get exact cordinates of the page to add a watermark with different page rotation using iTextSharp

This question is answered in an article written in French that was based on several StackOverflow questions I answered in English: Comment créer un filigrane transparent en PDF?

The questions this blog post was based on, are:

  • How to watermark PDFs using text or images? (This is an important one for you, because it deals with page rotations!)
  • How to add a watermark to a PDF file?
  • How to extend the page size of a PDF to add a watermark?

These questions and their answers can be found in The Best iText Questions on StackOverflow, a free ebook that can be downloaded from the iText site. It also contains a couple of answers that were never published on StackOverflow.

You shouldn't import the page to find out the rotation. There are other ways to get that information. You'll notice that you can use the getPageSize() and the GetPageSizeWithRotation() methods depending on whether or not you want to get the page size along with the rotation (there's also a GetRotation() method).

Furthermore, you should experiment with the RotateContents property:

stamper.RotateContents = false;

It is not exactly clear to me whether or not you want the watermark to follow or ignore the rotation, but the GetPageSize() and the GetPageSizeWithRotation() method, you'll be able to avoid having to use hard-coded values such as x = 20; y = 20 (as done in your code snippet). If you want the middle coordinate of page i, you can use this code:

Rectangle pagesize = reader.GetPageSizeWithRotation(i);
x = (pagesize.Left + pagesize.Right) / 2;
y = (pagesize.Top + pagesize.Bottom) / 2;

How to position iTextSharp paragraph to bottom of the page?

You can create Footer class and use it.

Create a class that inherited by PdfPageEventHelper.

Create table in this class and write footer content.

public partial class Footer : PdfPageEventHelper
{
public override void OnEndPage(PdfWriter writer, Document doc)
{
Paragraph footer = new Paragraph("THANK YOU", FontFactory.GetFont(FontFactory.TIMES, 10, iTextSharp.text.Font.NORMAL));
footer.Alignment = Element.ALIGN_RIGHT;
PdfPTable footerTbl = new PdfPTable(1);
footerTbl.TotalWidth = 300;
footerTbl.HorizontalAlignment = Element.ALIGN_CENTER;
PdfPCell cell = new PdfPCell(footer);
cell.Border = 0;
cell.PaddingLeft = 10;
footerTbl.AddCell(cell);
footerTbl.WriteSelectedRows(0, -1, 415, 30, writer.DirectContent);
}
}

After this

Document document = new Document(PageSize.A4, 50, 50, 25, 25);
var output = new FileStream(Server.MapPath("Demo.pdf"), FileMode.Create);
PdfWriter writer = PdfWriter.GetInstance(document, output);
// Open the Document for writing
document.Open();
//using footer class
writer.PageEvent = new Footer();.
Paragraph welcomeParagraph = new Paragraph("Hello, World!");
document.Add(welcomeParagraph);
document.Close();

Original article

And another way - You can simply add code below

Paragraph copyright = new Paragraph("© 2020 AO XXX. All rights reserved.", calibri8Black);
PdfPTable footerTbl = new PdfPTable(1);
footerTbl.TotalWidth = 300;
PdfPCell cell = new PdfPCell(copyright);
cell.Border = 0;
footerTbl.AddCell(cell);
footerTbl.WriteSelectedRows(0, -1, 30, 30, writer.DirectContent);

How to get the text position from the pdf page in iText 7

First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.

Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.

Possible implementation:

  • implement IEventListener
  • get notified for all events that render text, and store the corresponding TextRenderInfo object
  • once you're finished with the document, sort these objects based on their position in the page
  • loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates

how to:

  1. implement ITextExtractionStrategy (or extend an existing
    implementation)
  2. use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
  3. your strategy should be set up to keep track of locations for the text it processed

ITextExtractionStrategy has the following method in its interface:

@Override
public void eventOccurred(IEventData data, EventType type) {

// you can first check the type of the event
if (!type.equals(EventType.RENDER_TEXT))
return;

// now it is safe to cast
TextRenderInfo renderInfo = (TextRenderInfo) data;
}

Important to keep in mind is that rendering instructions in a pdf do not need to appear in order.
The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to:
render "Ipsum Do"

render "Lorem "

render "lor Sit Amet"

You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.

Once that's done, it should be easy.

Extract coordinates of each separate word into a TextChunk in a pdf file

You can use the method TextRenderInfo.GetCharacterRenderInfos() to get a collection of TextRenderInfo for each and every char in your chunk. Then you can could regroup the individual characters into words and calculate the rectangle that contains the word using the coordinates of the first and last TextRenderInfo in that word.

In your custom text extraction strategy:

 var _separators = new[] { "-", "(", ")", "/", " ", ":", ";", ",", "."};
protected virtual void ParseRenderInfo(TextRenderInfo currentInfo)
{
var resultInfo = new List<TextRenderInfo>();
var chars = currentInfo.GetCharacterRenderInfos();

foreach (var charRenderInfo in chars)
{
resultInfo.Add(charRenderInfo);
var currentChar = charRenderInfo.GetText();
if (_separators.Contains(currentChar))
{
ProcessWord(currentInfo, resultInfo);
resultInfo.Clear();
}
}
ProcessWord(currentInfo, resultInfo);
}
private void ProcessWord(TextRenderInfo charChunk, List<TextRenderInfo> wordChunks)
{
var firstRender = wordChunks.FirstOrDefault();
var lastRender = wordChunks.LastOrDefault();
if (firstRender == null || lastRender == null)
{
return;
}
var startCoords = firstRender.GetDescentLine().GetStartPoint();
var endCoords = lastRender.GetAscentLine().GetEndPoint();
var wordText = string.Join("", wordChunks.Select(x => x.GetText()));
var wordLocation = new LocationTextExtractionStrategy.TextChunkLocationDefaultImp(startCoords, endCoords, charChunk.GetSingleSpaceWidth());
_chunks.Add(new CustomTextChunk(wordText, wordLocation));
}


Related Topics



Leave a reply



Submit