Convert Arabic"Unicode" Content HTML or Xml to PDF Using Itextsharp

Convert arabicunicode content html or xml to pdf using itextsharp

You need to use container elements which support RunDirection, such as ColumnText or PdfPCell and then set their element.RunDirection = PdfWriter.RUN_DIRECTION_RTL

List<IElement> list = HTMLWorker.ParseToList(new StringReader(resultCache), ST);
doc.Open();

//Use a table so that we can set the text direction
PdfPTable table = new PdfPTable(1);
//Ensure that wrapping is on, otherwise Right to Left text will not display
table.DefaultCell.NoWrap = false;
table.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

//Loop through each element, don't bother wrapping in P tags
foreach (var element in list)
{
//Create a cell and add text to it
PdfPCell text = new PdfPCell(new Phrase(element, font));

//Ensure that wrapping is on, otherwise Right to Left text will not display
text.NoWrap = false;

//Add the cell to the table
table.AddCell(text);
}
//Add the table to the document
document.Add(table);
doc.Close();
pdfWriter.Close();

For addition reference, have a look at this sample.

ItextSharp Html to pdf arabic characters are not combining in pdf

Not a good solution but able to done via following

foreach (var htmlElement in parsedHtmlElements)
{
if (htmlElement is PdfPTable)
{
SetDirection(htmlElement as PdfPTable);
}
pdfDoc.Add(htmlElement);
}

private static void SetDirection(PdfPTable tbl)
{
tbl.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
tbl.HorizontalAlignment = Element.ALIGN_LEFT;
foreach (PdfPRow pr in tbl.Rows)
{
foreach (PdfPCell pc in pr.GetCells())
{
if (pc != null)
{
pc.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
pc.HorizontalAlignment = Element.ALIGN_LEFT;
if (pc.CompositeElements != null)
{
foreach (var element in pc.CompositeElements)
{
if (element is PdfPTable)
{
SetDirection((PdfPTable)element);
}
}
}
}
}
}
}

Also in my html change some cells order

Arabic characters from html content to pdf using iText

Please take a look at the ParseHtml7 and ParseHtml8 examples. They take HTML input with Arabic characters and they create a PDF with the same Arabic text:

A PDF table with HTML content
An HTML table in PDF

Before we look at the code, allow me to explain that it's not a good idea to use non-ASCII characters in source code. For instance: this is not done:

 htmlContentAr = “<p><span> رقم التعميم رقم التعميم</p><p>رقم التعميم ….</span></p>”;

You never know how a Java file containing these glyphs will be stored. If it's not stored as UTF-8, the characters may end up looking like something completely different. Versioning systems are known to have problems with non-ASCII characters and even compilers can get the encoding wrong. If you really want to stored hard-coded String values in your code, use the UNICODE notation. Part of your problem is an encoding problem, and you can read more about this here: Can't get Czech characters while generating a PDF

For the examples shown in the screen shots, I saved the following files using UTF-8 encoding:

This is what you'll find in the file arabic.html:

<html>
<body style="font-family: Noto Naskh Arabic">
<p>رقم التعميم رقم التعميم</p>
<p>رقم التعميم</p>
</body>
</html>

This is what you'll find in the file arabic2.html:

<html>
<body style="font-family: Noto Naskh Arabic">
<table>
<tr>
<td dir="rtl">رقم التعميم رقم التعميم</td>
<td dir="rtl">رقم التعميم</td>
</tr>
</table>
</body>
</html>

The second part of your problem concerns the font. It is important that you use a font that knows how to draw Arabic glyphs. It is hard to believe that you have arial.ttf right at the root of your C: drive. That's not a good idea. I would expect you to use C:/windows/fonts/arialuni.ttf which certainly knows Arabic glyphs.

Selecting the font isn't sufficient. Your HTML needs to know which font family to use. Because most of the examples in the documentation use Arial, I decided to use a NOTO font. I discovered these fonts by reading this question: iText pdf not displaying Chinese characters when using NOTO fonts or Source Hans. I really like these fonts because they are nice and (almost) every language is supported. For instance, I used NotoNaskhArabic-Regular.ttf which means that I need to define the font familie like this:

style="font-family: Noto Naskh Arabic"

I defined the style in the body tag of my XML, it's obvious that you can choose where to define it: in an external CSS file, in the styles section of the <head>, at the level of a <td> tag,... That choice is entirely yours, but you have to define somewhere which font to use.

Of course: when XML Worker encounters font-family: Noto Naskh Arabic, iText doesn't know where to find the corresponding NotoNaskhArabic-Regular.ttf unless we register that font. We can do this, by creating an instance of the FontProvider interface. I chose to use the XMLWorkerFontProvider, but you're free to write your own FontProvider implementation:

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf");

There is one more hurdle to take: Arabic is written from right to left. I see that you want to define the run direction at the level of the PdfPCell and that you add the HTML content to this cell using an ElementList. That's why I first wrote a similar example, named ParseHtml7:

public void createPdf(String file) throws IOException, DocumentException {
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
// step 3
document.open();
// step 4
// Styles
CSSResolver cssResolver = new StyleAttrCSSResolver();
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
// HTML
HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline pdf = new ElementHandlerPipeline(elements, null);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML), Charset.forName("UTF-8"));

PdfPTable table = new PdfPTable(1);
PdfPCell cell = new PdfPCell();
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
for (Element e : elements) {
cell.addElement(e);
}
table.addCell(cell);
document.add(table);
// step 5
document.close();
}

There is no table in the HTML, but we create our own PdfPTable, we add the content from the HTML to a PdfPCell with run direction LTR, and we add this cell to the table, and the table to the document.

Maybe that's your actual requirement, but why would you do this in such a convoluted way? If you need a table, why don't you create that table in HTML and define some cells are RTL like this:

<td dir="rtl">...</td>

That way, you don't have to create an ElementList, you can just parse the HTML to PDF as is done in the ParseHtml8 example:

public void createPdf(String file) throws IOException, DocumentException {
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
// step 3
document.open();
// step 4
// Styles
CSSResolver cssResolver = new StyleAttrCSSResolver();
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/NotoNaskhArabic-Regular.ttf");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());

// Pipelines
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML), Charset.forName("UTF-8"));;
// step 5
document.close();
}

There is less code needed in this example, and when you want to change the layout, it's sufficient to change the HTML. You don't need to change your Java code.

One more example: in ParseHtml9, I create a table with an English name in one column ("Lawrence of Arabia") and the Arabic translation in the other column ("لورانس العرب"). Because I need different fonts for English and Arabic, I define the font at the <td> level:

<table>
<tr>
<td>Lawrence of Arabia</td>
<td dir="rtl" style="font-family: Noto Naskh Arabic">لورانس العرب</td>
</tr>
</table>

For the first column, the default font is used and no special settings are needed to write from left to right. For the second column, I define an Arabic font and I set the run direction to "rtl".

The result looks like this:

English next to Arabic

That's much easier than what you're trying to do in your code.

Display Unicode characters in converting Html to Pdf

When dealing with Unicode characters and iTextSharp there's a couple of things you need to take care of. The first one you did already and that's getting a font that supports your characters. The second thing is that you want to actually register the font with iTextSharp so that its aware of it.

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

Now that we have a font we need to create a StyleSheet object that tells iTextSharp when and how to use it.

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");

The one non-HTML part that you also need to do is set a special encoding parameter. This encoding is specific to iTextSharp and in your case you want it to be Identity-H. If you don't set this then it default to Cp1252 (WINANSI).

//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

Lastly, we need to pass our stylesheet to the ParseToList method:

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

Putting that all together, from open to close you'd have:

doc.Open();

//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(@"<p>This is a test: <strong>α,β</strong></p>");

//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);

//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);

//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);

//Loop through each element, don't bother wrapping in P tags
foreach (var element in list) {
doc.Add(element);
}

doc.Close();

EDIT

In your comment you show HTML that specifies an override font. iTextSharp does not spider the system for fonts and its HTML parser doesn't use font fallback techniques. Any fonts specified in HTML/CSS must be manually registered.

string lucidaTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "l_10646.ttf");
iTextSharp.text.FontFactory.Register(lucidaTff);

Arabic Text is not being shown while converting from html to pdf using iText#

Question marks instead of characters mean wkhtmltopdf cannot find the font with Arabic characters. The most fool-proof solution I've found for this is to Base64-encode your font, and include it directly in the CSS/style declaration:

@font-face {
font-family: 'Amiri';
src: url(data:font/truetype;charset=utf-8;base64,<BASE64-ENCODED-DATA>
}

EDIT: Step by step instructions:

  1. Visit this site.
  2. Upload your font to Encode binary file, then press Encode. That will encode the file and generate the encoded font. The output will look like a bunch of random characters.
  3. Copy the CSS snippet above, and replace <BASE64-ENCODED-DATA> with the Base64 output that you got from encoding.
  4. Add this CSS snippet to your stylesheet, somewhere near the top. It's important to add this before you refer to ARIALUNI font in your CSS code.
  5. Now you can declare HTML elements to use this font, as you normally would:
@font-face {
font-family: 'ARIALUNI';
src: url(data:font/truetype;charset=utf-8;base64,AAEAAAATAQA...
}
body, h1 {
font-family: 'ARIALUNI', sans-serif;
}

Is there a free component to convert arabic html to pdf?

do you have the c# code in hand?
there is this online resource discussing how to use itextsharp to create PDF.
is it a problem of characterset, font or right to left problem?

http://www.devshed.com/c/a/Java/Creating-Simple-PDF-Files-With-iTextSharp/

http://www.codeproject.com/KB/graphics/ITextSharpHelperClass.aspx

I hope it can help



Related Topics



Leave a reply



Submit