how can i get text formatting with iTextSharp
Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn't handle color information but according to @Mark Storer it might not be too hard to implement yourself.
BEGIN EDIT
I started work on implementing color information. See my blog post here for more details. (Sorry for the bad formatting, heading off to dinner now.)
END EDIT
The code below combines several questions and answers here including this one to get the font height (although its not exact) as well as another one (that for the life of me I can't seem to find anymore) that shows how to detect for faux bold.
The PostscriptFontName
returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.
Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.
Screenshot of sample PDF
Sample text extracted as HTML
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Hello </span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">w</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201">o</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">rl</span>
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">d </span>
<br />
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Test </span>
Code
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication2
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
Console.WriteLine(F);
this.Close();
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName;
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}
Format text controls on pdf using iTextSharp and C#
Finally managed to draw borders around controls with below code.
XmlDocument newXMLDoc = new XmlDocument();
newXMLDoc.LoadXml(@"<border><edge thickness=""1.3mm""><color value=""0, 0, 255""/></edge></border>");
if (Rs.Rows.Count > 0)
{
foreach (DataRow query in Rs.Rows)
{
if(isRET)
{
if (oXFA.DomDocument.SelectSingleNode("//t:*[@name='" + Rs[0] + "']", oNameSpace) != null)
{
XmlNode newNode =
oXFA.DomDocument.ImportNode(newXMLDoc.SelectSingleNode("border"), true);
oXFA.DomDocument.SelectSingleNode("//t:*[@name='" + Rs[0] + "']", oNameSpace).AppendChild(newNode);
}
}
}
}
Change Text style in itextsharp
Change iTextSharp.text.pdf.BaseFont.TIMES_ROMAN
to iTextSharp.text.pdf.BaseFont.TIMES_BOLD
or use any of the other documented ways to use a bold font.
Extracting text and retaining formatting
It turns out the formatting "\r\n
" is indeed retained verified by fetching the value from SQL Server table programatically and invoking Console.writeline()
. Initially I was copying the value directly from SQL Server Management studio and pasting into text file - which surely isn't the right way to verify.
Related Topics
How to Deal with Files with a Name Longer Than 259 Characters
Compiler Ambiguous Invocation Error - Anonymous Method and Method Group with Func<> or Action
Free File Locked by New Bitmap(Filepath)
"The Remote Certificate Is Invalid According to the Validation Procedure." Using Gmail Smtp Server
How to Turn an Int into an Array of Ints of Each Digit
Are Arrays or Lists Passed by Default by Reference in C#
Why File.Readalllinesasync() Blocks the UI Thread
ASP.NET MVC: How to Display a Byte Array Image from Model
Unauthorizedaccessexception Cannot Resolve Directory.Getfiles Failure
Deserialize Nested, Complex Json Object to Just a String C#
How to Rethrow Innerexception Without Losing Stack Trace in C#
Capture the Screen into a Bitmap
How to Merge Multiple Assemblies into One
Up, Down, Left and Right Arrow Keys Do Not Trigger Keydown Event
Easiest Way to Create a Cascade Dropdown in ASP.NET MVC 3 with C#
How to Access a Variable from Another Script in Another Gameobject Through Getcomponent
How to Access Named Capturing Groups in a .Net Regex
System.Unauthorizedaccessexception While Running .Exe Under Program Files