Reading PDF Documents in .Net

Reading PDF documents in .Net

Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PdfParser
{
public static class PdfTextExtractor
{
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
}
}

Reading a part of PDF file in c#

PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).

there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools

To control the size of the output text files you should

  • get number of pages from PDF
  • run pdf to text conversion page by page meanwhile checking the output text file size
  • once file size is over 15 MB just stop the conversion and move to another file

Read text from PDF on .NET Core using Any open source / non-licensed packages

Approach: PDFPig (Apache:2.0 License)

Install Nuget Package PdfPig

Tested on .Net Core 3.1

using (var stream = File.OpenRead(pdfPath1))
using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
{
var page = document.GetPage(2);
return string.Join(" ", page.GetWords());
}

Approach: iTextSharp.LGPLv2.Core(GNU General Public License)

Install Nuget iTextSharp.LGPLv2.Core

It is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core.

Tested on .Net Core 3.1

var reader = new PdfReader(pdfPath1);
var streamBytes = reader.GetPageContent(1);
var tokenizer = new PrTokeniser(new RandomAccessFileOrArray(streamBytes));
var sb = new StringBuilder();
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PrTokeniser.TK_STRING)
{
var currentText = tokenizer.StringValue;
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
sb.Append(tokenizer.StringValue);
}
}

Console.WriteLine("Extracted text "+sb);

Approach: GrapeCity.Documents.PDF(Licensed)

Install Nuget-Package *GrapeCity.Documents.Pdf
Is crossplatform library allows for creation, modification and analysis of PDF docs

Tested on .Net Core 3.1

    var doc = new GcPdfDocument();
FileStream fs = new FileStream(pdfPath1, FileMode.Open, FileAccess.ReadWrite);
doc.Load(fs);
//To extract Page 1
var tmap_page2 = doc.Pages[0].GetTextMap();
tmap_page2.GetFragment(out TextMapFragment newFragment, out string Extractedtext);

Console.WriteLine("Extracted Text: \n\n" +Extractedtext);

How to read Text from pdf file in c#.net web application

Hve a look to the following links:

How to read pdf files using C# .NET

and

Reading PDF in C#

Hopefully they can guide you to the correct direction.

Reading a PDF File using iText5 for .NET

Try this, use the LocationTextExtractionStrategy instead of the SimpleTextExtractionStrategy
it will add new line characters to the text returned. Then you can use strText.Split('\n') to split your text into a string[] and consume it on a per line basis.

How to read a PDF file line by line in c#?

Hi I had this problem too, I used this code, it worked.

You will need a reference to the iTextSharp lib.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;

for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
}
}

words array contains lines of pdf file

How to read a PDF file from a url of external site in C#

Try making a GET request to your company's website using the HttpClient class in C#. You can do something along the lines.

using System.Net.Http;
using System.IO;

public async Task<HttpResponseMessage> GetPDFFromCompanyWebsite()
{
string currentDirectory = System.Web.Hosting.HostingEnvironment.MapPath("~");
string filePath = Path.Combine(currentDirectory, "App_Data", "someDocument.pdf");

using(HttpClient client = new HttpClient())
{
HttpResponseMessage msg = await client.GetAsync($"http://example/sites/index.php/2011-10-30-12-29-04/finish/11/1234");

if(msg.IsSuccessStatusCode)
{
using(var file = File.Create(filePath))
{
// create a new file to write to
var contentStream = await msg.Content.ReadAsStreamAsync(); // get the actual content stream
await contentStream.CopyToAsync(file); // copy that stream to the file stream
await file.FlushAsync(); // flush back to disk before disposing
}
}
return msg;
} }


Related Topics



Leave a reply



Submit