Looking for Recommendation on How to Convert PDF into Structured Format

PDF to Structured Format

I had a similar problem a while back and ended up writing my own solution. It's called PDFX and it's free to use. It converts PDF to a structured-format XML and also renders any bitmap images (not vector graphics) found in the PDF separately.

Example input/output can be found here. You might want to give it a try.

How to extract data from a PDF file while keeping track of its structure?

There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

PDF Data Extraction - Need Suggestions

PDF has only weak structures, nothing like divs or containers. There are layer groups and similar, but coordinates are the only thing, you can count on.

Try to describe type of text and margins from left and right, to make your capture page independent.

trying to find php library to convert pdf to chm

I'd say very hardly. PDF is a document format that can contain all sorts of things. CHM is a structured help documentation format. Correct me if I'm mistaken, but I don't think these two mix at all, at least not in a way that you run a converter on a file and get a finished result.

What exactly are you trying to do?

Structure of a PDF file?

Here is a link to Adobe's reference material

http://www.adobe.com/devnet/pdf/pdf_reference.html

You should know though that PDF is only about presentation, not structure. Parsing will not come easy.

How to conserve the pdf layout after converting content from English to French using Python

There are no easy ways to open, edit and rewrite pdfs in Python. However, depending on the complexity of the PDF/structure you might have success converting the PDF to HTML, translating and then generating a PDF from the HTML.

For converting PDF to HTML, there is pdf2html which has a basic Python wrapper.

Once the translation is done you can reverse this process with various degrees of success using e.g. weasyprint, html2pdf (Mac only), wkhtmltopdf (requires Qt).



Related Topics



Leave a reply



Submit