Reading PDF Content with Itextsharp Dll in Vb.Net or C#

Reading pdf content using iTextSharp in C#

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

Your problem is this line:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I'm going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

EDIT

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();

if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);

for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

EDIT 2

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

Following along a post about reading a PDF with iTextsharp, but ran into errors

You need to add using System.Text; statement to import the required namespace to use StringBuilder and Encoding classes.

PDF Found but failed to open for iTextSharp

Dim myBytes = System.IO.File.ReadAllBytes(sourceFile) reader = New iTextSharp.text.pdf.PdfReader(myBytes) from @Chris-Haas was the answer without changing any settings.

Reading PDF documents in .Net

Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PdfParser
{
public static class PdfTextExtractor
{
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
}
}

Reading PDF document with iTextSharp creates string with repeating first page

Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.

The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.

Also the line where the StringBuilder is being appended can be changed from:

sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));

to

sb.Append(text);

Thus the following code gives the correct result:

using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();

for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}

Strange characters reading PDF with iTextSharp

The file does not contain information required for text extraction. Furthermore, the file is invalid as a PDF/A file.

Information for text extraction

The sample file contains a background (located in a form XObject resource) showing the empty form and a foreground (immediately in the page content stream) of filled-in values.

The text in the form XObject is drawn using a Type 3 font without a standard encoding or standard names in its encoding. There also is no ToUnicode map in it.

This means that text drawing instructions in that form XObject have arguments which are sequences of bytes, and for each byte value the Type 3 font object provides a stream containing simple drawing instructions (path definitions using lines and curves; path filling instructions), but there is no information which Unicode value corresponds to that byte value or set of drawing instructions.

Thus, PDF viewers can draw the page but they cannot correctly put a Unicode string of characters into the clipboard which we as humans would read from that drawing, and neither can iTextSharp.

Short of OCR there is no reasonable way to extract text from the form.


The text immediately in the foreground, on the other hand, is drawn using a font with a standard encoding (WinAnsiEncoding) and, therefore, can be extracted. Thus, at the end of the output of the OP's code you'll find

\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000

...

\u0000 \u0000 \u0000 x s \u0000 l t n q o x m l \u0000 z \u0000 ~ { \u0000 } } \u0000 l w x
2016
14874587948 DITTA PROVA SRL
CREMA CR 26013 VIA DANTE 17
011110
LPRGCM82T26D150H LEOPARDI GIACOMO
M 26 12 1982 CREMONA CR
MILANO MI F205
28 02 2017
DITTAP0101 / LEOGIA01001

i.e. the filled-in values of the form.

PDF/A conformance

The file indeed claims to be PDF/A-1a but inspecting it one quickly sees that this is a blatant lie. E.g. Adobe Acrobat Preflight says:

Preflight report screenshot

These entries indicate that the document actually does not even try to actually be PDF/A-a1 conform, it merely claims so.



Related Topics



Leave a reply



Submit