Reading PDF documents in .Net
Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PdfParser
{
public static class PdfTextExtractor
{
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
}
}
Reading a part of PDF file in c#
PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).
there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools
To control the size of the output text files you should
- get number of pages from PDF
- run pdf to text conversion page by page meanwhile checking the output text file size
- once file size is over 15 MB just stop the conversion and move to another file
Read text from PDF on .NET Core using Any open source / non-licensed packages
Approach: PDFPig (Apache:2.0 License)
Install Nuget Package PdfPig
Tested on .Net Core 3.1
using (var stream = File.OpenRead(pdfPath1))
using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
{
var page = document.GetPage(2);
return string.Join(" ", page.GetWords());
}
Approach: iTextSharp.LGPLv2.Core(GNU General Public License)
Install Nuget iTextSharp.LGPLv2.Core
It is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core.
Tested on .Net Core 3.1
var reader = new PdfReader(pdfPath1);
var streamBytes = reader.GetPageContent(1);
var tokenizer = new PrTokeniser(new RandomAccessFileOrArray(streamBytes));
var sb = new StringBuilder();
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PrTokeniser.TK_STRING)
{
var currentText = tokenizer.StringValue;
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
sb.Append(tokenizer.StringValue);
}
}
Console.WriteLine("Extracted text "+sb);
Approach: GrapeCity.Documents.PDF(Licensed)
Install Nuget-Package *GrapeCity.Documents.Pdf
Is crossplatform library allows for creation, modification and analysis of PDF docs
Tested on .Net Core 3.1
var doc = new GcPdfDocument();
FileStream fs = new FileStream(pdfPath1, FileMode.Open, FileAccess.ReadWrite);
doc.Load(fs);
//To extract Page 1
var tmap_page2 = doc.Pages[0].GetTextMap();
tmap_page2.GetFragment(out TextMapFragment newFragment, out string Extractedtext);
Console.WriteLine("Extracted Text: \n\n" +Extractedtext);
How to read Text from pdf file in c#.net web application
Hve a look to the following links:
How to read pdf files using C# .NET
and
Reading PDF in C#
Hopefully they can guide you to the correct direction.
Reading a PDF File using iText5 for .NET
Try this, use the LocationTextExtractionStrategy
instead of the SimpleTextExtractionStrategy
it will add new line characters to the text returned. Then you can use strText.Split('\n')
to split your text into a string[]
and consume it on a per line basis.
How to read a PDF file line by line in c#?
Hi I had this problem too, I used this code, it worked.
You will need a reference to the iTextSharp lib.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
}
}
words array contains lines of pdf file
How to read a PDF file from a url of external site in C#
Try making a GET request to your company's website using the HttpClient class in C#. You can do something along the lines.
using System.Net.Http;
using System.IO;
public async Task<HttpResponseMessage> GetPDFFromCompanyWebsite()
{
string currentDirectory = System.Web.Hosting.HostingEnvironment.MapPath("~");
string filePath = Path.Combine(currentDirectory, "App_Data", "someDocument.pdf");
using(HttpClient client = new HttpClient())
{
HttpResponseMessage msg = await client.GetAsync($"http://example/sites/index.php/2011-10-30-12-29-04/finish/11/1234");
if(msg.IsSuccessStatusCode)
{
using(var file = File.Create(filePath))
{
// create a new file to write to
var contentStream = await msg.Content.ReadAsStreamAsync(); // get the actual content stream
await contentStream.CopyToAsync(file); // copy that stream to the file stream
await file.FlushAsync(); // flush back to disk before disposing
}
}
return msg;
} }
Related Topics
Passing Dynamic Object to C# Method Changes Return Type
Entity Framework/Linq Expression Converting from String to Int
How Does the .Tostring() Method Work
Why Does It Appear That My Random Number Generator Isn't Random in C#
Divide Array into an Array of Subsequence Array
Drawing Glitches When Using Creategraphics Rather Than Paint Event Handler for Custom Drawing
Timeout Pattern - How Bad Is Thread.Abort Really
Remote Validation for List of Models
When I Post Back to My Controller All Values for My Model Are Null
Difference Between Wiring Events with and Without "New"
How to Cast String to Int. Error Msg: Input String Was Not in a Correct Format
Error on If Statement - Cannot Implicitly Convert Type to 'Bool'
Many-To-Many with Extra Columns Nhibernate
Convert Arabic"Unicode" Content HTML or Xml to PDF Using Itextsharp