Reading PDF Content Using Itextsharp in C#

Reading pdf content using iTextSharp in C#

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

Your problem is this line:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I'm going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

EDIT

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();

if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);

for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

EDIT 2

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

Cannot read text from pdf by ITextSharp in C#

I work as Social Media Developer at Aspose. I would suggest you to download and try Aspose.Pdf for .NET to convert PDF to Text file. In case your file contains images and you need to extract the text from those images, you can use Aspose.Pdf to convert Pdf file to images and then perform OCR using Aspose.OCR for .NET.

Following is the sample code to convert PDf to Text using Aspose.Pdf for .NET

//open document
Document pdfDocument = new Document("input.pdf");
//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
//get the extracted text
string extractedText = textAbsorber.Text;
// create a writer and open the file
TextWriter tw = new StreamWriter("extracted-text.txt");
// write a line of text to the file
tw.WriteLine(extractedText);
// close the stream
tw.Close();

Please download a free trial and try it.

Unable to read text in a specific location in a pdf file using iTextSharp

The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.

If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.

You can do that by adding after reading the PDF in this line

PdfReader pdfReader = new PdfReader(filePath);

code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:

MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();

memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);

Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.

Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.

You can change this by using a different text extraction strategy, i.e. by replacing this line:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
  • Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.

    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
  • Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.

    ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();

iTextSharp How to read Table in PDF file

To make my comment an actual answer...

You use the LocationTextExtractionStrategy for text extraction:

ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);

This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.

Depending on the document in question there are different approaches one can take:

  • Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
  • Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
  • Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.

In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.



Related Topics



Leave a reply



Submit