How to Get All Text from a PDF in Swift

How can I get all text from a PDF in Swift?

That is unfortunately not possible.

At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDFs are (generally) a one-way street.

They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.

How can I get a selected word in PDF so that I can have the word pronounced? [Swift, PDFKit]

  1. Add UITapGestureRecognizer to pdfView:

    let tapgesture = UITapGestureRecognizer(target: self, action: #selector(tapGesture(_:)))
    pdfView.addGestureRecognizer(tapgesture)
  2. Handle tap gesture:

    @objc func tapGesture(_ gestureRecognizer: UITapGestureRecognizer) {
    let point = gestureRecognizer.location(in: pdfView)

    if let page = pdfView.page(for: point, nearest: false) {
    //convert point from pdfView coordinate system to page coordinate system
    let convertedPoint = pdfView.convert(point, to: page)

    //ensure that there is no link/url at this point
    if page.annotation(at: convertedPoint) == nil {
    //get word at this point
    if let selection = page.selectionForWord(at: convertedPoint) {
    if let wordTouched = selection.string {
    //pronounce word
    let utterance = AVSpeechUtterance(string: wordTouched)
    utterance.voice = AVSpeechSynthesisVoice(language: "en-US")

    let synth = AVSpeechSynthesizer()
    synth.speak(utterance)

    //if you also want to show selection of this word for one second
    pdfView.currentSelection = selection
    DispatchQueue.main.asyncAfter(deadline: .now() + 1) {
    self.pdfView.clearSelection()
    }
    }
    }
    }
    }
    }

Draw text on all pages of PDF using PDFKit

The main issue here is that context recreated, for multiple pages we should write into the same context (it manages pages by beginPDFPage/endPDFPage pair).

Here is fixed code. Tested with Xcode 13.4 / macOS 12.4

let pdffile = PDFDocument(url: input)
let data = NSMutableData()
let consumer = CGDataConsumer(data: data as CFMutableData)!

// create common context with no mediaBox, we will add it later
// per-page (because, actually they might be different)
let context = CGContext(consumer: consumer, mediaBox: nil, nil)!

for y in stride(from: 0, to: pdffile!.pageCount, by: 1)
{
let page: PDFPage = pdffile!.page(at: y)!

// re-use media box of original document as-is w/o changes !!
var mediaBox = page.bounds(for: PDFDisplayBox.mediaBox)
NSGraphicsContext.current = NSGraphicsContext(cgContext: context, flipped: false)

// prepare mediaBox data for page setup
let rectData = NSData(bytes: &mediaBox, length: MemoryLayout.size(ofValue: mediaBox))

context.beginPDFPage([kCGPDFContextMediaBox as String: rectData] as CFDictionary) // << here !!

page.draw(with: .mediaBox, to: context) // << original !!
text.draw(in:drawrect,withAttributes:textFontAttributes) // << over !!

context.endPDFPage()
}
context.closePDF() // close entire document

let anotherDocument = PDFDocument(data:data as Data)
// ... as used before

Read contents of pdf as string

If you want to avoid a lot of programming, you probably need to use some library which will help you extract text from PDFs.

You have two options:

1) Use OCR library. Since PDF can contain images besides text, performing OCR to get the text is the most generic solution. To perform OCR on a PDF document, you need to convert it to UIImage object. Another approach can be to convert contents of the WebView to UIImage, but this might result with image with lower resolution, which can affect OCR performance.

The downside to using OCR library is that you will not get 100% accurate text, since the OCR engine always introduces errors.

The best options for OCR are Tesseract for iOS (free, but with higher error rate and a bit more complex to tweak for results). A more robust option is BlinkOCR, which is free to try, paid when in commercial use, but you can get a ton of help from their engineers.

2) You can also use PDF library. PDF libraries can reliably extract text written in the document, with exception of text which is part of the images inside the PDF. So depending on the documents you want to read this might be a better option (or not).

Some options for PDF libraries can be found here, and in our experience, PDFlib gives very good results and is the most customizable.

Extract data from fillable pdf swift

I know it has been a while but I found the answer in case anyone else finds this useful. My annotations are of the widget variety.

for annotation in page.annotations {
print(annotation.widgetStringValue)
}


Related Topics



Leave a reply



Submit