How to Get Text Formatting with Itextsharp

how can i get text formatting with iTextSharp

Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn't handle color information but according to @Mark Storer it might not be too hard to implement yourself.

BEGIN EDIT

I started work on implementing color information. See my blog post here for more details. (Sorry for the bad formatting, heading off to dinner now.)

END EDIT

The code below combines several questions and answers here including this one to get the font height (although its not exact) as well as another one (that for the life of me I can't seem to find anymore) that shows how to detect for faux bold.

The PostscriptFontName returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.

Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.

Screenshot of sample PDF

Screenshot of sample PDF

Sample text extracted as HTML

<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Hello </span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">w</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201">o</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">rl</span>
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">d </span>
<br />
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Test </span>

Code

using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication2
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

private void Form1_Load(object sender, EventArgs e)
{
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
Console.WriteLine(F);

this.Close();
}

public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();

//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;

//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}



public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName;
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
{
curFont += "-Bold";
}

//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}

//Append the current text
this.result.Append(renderInfo.GetText());

//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}

public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}

//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}

Format text controls on pdf using iTextSharp and C#

Finally managed to draw borders around controls with below code.

XmlDocument newXMLDoc = new XmlDocument();
newXMLDoc.LoadXml(@"<border><edge thickness=""1.3mm""><color value=""0, 0, 255""/></edge></border>");

if (Rs.Rows.Count > 0)
{
foreach (DataRow query in Rs.Rows)
{
if(isRET)
{
if (oXFA.DomDocument.SelectSingleNode("//t:*[@name='" + Rs[0] + "']", oNameSpace) != null)
{
XmlNode newNode =
oXFA.DomDocument.ImportNode(newXMLDoc.SelectSingleNode("border"), true);
oXFA.DomDocument.SelectSingleNode("//t:*[@name='" + Rs[0] + "']", oNameSpace).AppendChild(newNode);
}
}
}
}

Change Text style in itextsharp

Change iTextSharp.text.pdf.BaseFont.TIMES_ROMAN to iTextSharp.text.pdf.BaseFont.TIMES_BOLD or use any of the other documented ways to use a bold font.

Extracting text and retaining formatting

It turns out the formatting "\r\n" is indeed retained verified by fetching the value from SQL Server table programatically and invoking Console.writeline(). Initially I was copying the value directly from SQL Server Management studio and pasting into text file - which surely isn't the right way to verify.



Related Topics



Leave a reply



Submit