Swift Cgpdfdocument Parsing

swift CGPDFDocument parsing

Your parsing retrieving high level dictionary and info data is correct, but you need to expand the decoding in CGPDFDictionaryApplyFunction to display the values of PDF data according their types (integer, string, array, dictionary, and so on). The syntax of the CGPDFDictionaryApplierFunction you are calling is:

typealias CGPDFDictionaryApplierFunction = (UnsafePointer<Int8>, COpaquePointer, UnsafeMutablePointer<()>) -> Void

Your program is displaying the pointers to the data, you could access the data values according their types as below (Swift 2):

    let filepath = "/Users/ben/Desktop/Test.pdf"
    let urlDocument = NSURL(fileURLWithPath: filepath)
    let myDocument = CGPDFDocumentCreateWithURL(urlDocument)
    if myDocument != nil {
        let numPages = CGPDFDocumentGetNumberOfPages(myDocument)
        print("Number of pages: \(numPages)")
        // Get complete catalog
        let myCatalog = CGPDFDocumentGetCatalog(myDocument)
        CGPDFDictionaryApplyFunction(myCatalog, printPDFKeys, nil)
        let myInfo = CGPDFDocumentGetInfo(myDocument)
        CGPDFDictionaryApplyFunction(myInfo, printPDFKeys, nil)
    } else {
        print("Cannot open PDF document")
    }

In order to be called from the CGPDFDictionaryApplyFunction, the printPDFKeys is to be called as a global function (outside your main class), alternately you could insert the code in a closure of CGPDFDictionaryApplyFunction as in your example above. The below code is shortened and is not including complete protection against errors and null values.

func printPDFKeys( key: UnsafePointer<Int8>, object: COpaquePointer, info: UnsafeMutablePointer<()>) {
    let contentDict: CGPDFDictionaryRef = CGPDFDictionaryRef(info)
    let keyString = String(CString: UnsafePointer<CChar>(key), encoding: NSISOLatin1StringEncoding)
    let objectType = CGPDFObjectGetType(object)
    if keyString == nil {
        return
    }
    print("key \(keyString!) is present in dictionary, type \(objectType.rawValue)")
    var ptrObjectValue = UnsafePointer<Int8>()
    switch objectType {
    // ObjectType is enum of:
    //   Null
    //   Boolean
    //   Integer
    //   Real
    //   Name
    //   String
    //   Array
    //   Dictionary
    //   Stream
    case .Boolean:
        // Boolean
        var objectBoolean = CGPDFBoolean()
        if CGPDFObjectGetValue(object, objectType, &objectBoolean) {
            let testbool = NSNumber(unsignedChar: objectBoolean)
            print("Boolean value \(testbool)")
        }
    case .Integer:
        // Integer
        var objectInteger = CGPDFInteger()
        if CGPDFObjectGetValue(object, objectType, &objectInteger) {
            print("Integer value \(objectInteger)")
        }
    case .Real:
        // Real
        var objectReal = CGPDFReal()
        if CGPDFObjectGetValue(object, objectType, &objectReal) {
            print("Real value \(objectReal)")
        }
    case .Name:
        // Name
        if (CGPDFObjectGetValue(object, objectType, &ptrObjectValue)) {
            let stringName = String(CString: UnsafePointer<CChar>(ptrObjectValue), encoding: NSISOLatin1StringEncoding)
            print("Name value: \(stringName!)")
        }
    case .String:
        // String
        let valueFound = CGPDFObjectGetValue(object, objectType, &ptrObjectValue)
        let stringValue = CGPDFStringCopyTextString(COpaquePointer(ptrObjectValue))
        print("String value: \(stringValue!)")
    case .Array:
        // Array
        print("Array")
        var objectArray = CGPDFArrayRef()
        if (CGPDFObjectGetValue(object, objectType, &objectArray))
        {
            print("array: \(arrayFromPDFArray(objectArray))")
        }
    case .Dictionary:
        // Dictionary
        var objectDictionary = CGPDFDictionaryRef()
        if (CGPDFObjectGetValue(object, objectType, &objectDictionary)) {
            let count = CGPDFDictionaryGetCount(objectDictionary)
            print("Found dictionary with \(count) entries")
            if !(keyString == "Parent") && !(keyString == "P") {
                //catalogLevel = catalogLevel + 1
                CGPDFDictionaryApplyFunction(objectDictionary, printPDFKeys, nil)
                //catalogLevel = catalogLevel - 1
            }
        }
case .Stream:
    // Stream
    print("Stream")
    var objectStream = CGPDFStreamRef()
    if (CGPDFObjectGetValue(object, objectType, &objectStream)) {
        let dict: CGPDFDictionaryRef = CGPDFStreamGetDictionary( objectStream )
        var fmt: CGPDFDataFormat = .Raw
        let streamData: CFDataRef = CGPDFStreamCopyData(objectStream, &fmt)!;
        let data = NSData(data: streamData)
        let dataString = NSString(data: data, encoding: NSUTF8StringEncoding)
        let dataLength: Int = CFDataGetLength(streamData)
        print("data stream (length=\(dataLength)):")
        if dataLength < 400 {
            print(dataString)
        }
    }
default:
    print("Null")
}
}

// convert a PDF array into an objC one
func arrayFromPDFArray(pdfArray: CGPDFArrayRef ) -> NSMutableArray {
var i:Int = 0
var tmpArray: NSMutableArray = NSMutableArray()

let count = CGPDFArrayGetCount(pdfArray)
for i in 0..<count {
    var value = CGPDFObjectRef()
    if (CGPDFArrayGetObject(pdfArray, i, &value)) {
        if let object = objectForPDFObject(value) {
            tmpArray.addObject(object)
        }
    }
}

return tmpArray
}

func objectForPDFObject( object: CGPDFObjectRef) -> AnyObject? {
let objectType: CGPDFObjectType = CGPDFObjectGetType(object)
var ptrObjectValue = UnsafePointer<Int8>()
switch (objectType) {
case .Boolean:
    // Boolean
    var objectBoolean = CGPDFBoolean()
    if CGPDFObjectGetValue(object, objectType, &objectBoolean) {
        let testbool = NSNumber(unsignedChar: objectBoolean)
        return testbool
    }
case .Integer:
    // Integer
    var objectInteger = CGPDFInteger()
    if CGPDFObjectGetValue(object, objectType, &objectInteger) {
        return objectInteger
    }
case .Real:
    // Real
    var objectReal = CGPDFReal()
    if CGPDFObjectGetValue(object, objectType, &objectReal) {
        return objectReal
    }
case .String:
    let valueFound = CGPDFObjectGetValue(object, objectType, &ptrObjectValue)
    let stringValue = CGPDFStringCopyTextString(COpaquePointer(ptrObjectValue))
    return stringValue
case .Dictionary:
    // Dictionary
    var objectDictionary = CGPDFDictionaryRef()
    if (CGPDFObjectGetValue(object, objectType, &objectDictionary)) {
        let count = CGPDFDictionaryGetCount(objectDictionary)
        print("In array, found dictionary with \(count) entries")
        CGPDFDictionaryApplyFunction(objectDictionary, printPDFKeys, nil)
    }
case .Stream:
    // Stream
    var objectStream = CGPDFStreamRef()
    if (CGPDFObjectGetValue(object, objectType, &objectStream)) {
        let dict: CGPDFDictionaryRef = CGPDFStreamGetDictionary( objectStream )
        var fmt: CGPDFDataFormat = .Raw
        let streamData: CFDataRef = CGPDFStreamCopyData(objectStream, &fmt)!;
        let data = NSData(data: streamData)
        let dataString = NSString(data: data, encoding: NSUTF8StringEncoding)
        print("data stream (length=\(CFDataGetLength(streamData))):")
        return dataString
    }
default:
    return nil
}
return nil
}

How can I read-modify-write a PDF (CGPDFDocument) on iOS?

I'm not aware of a way with CoreGraphics than to iterate over each page and print them into the new document. At least that's what I'm doing in PSPDFKit. And re-rendering a whole document is quite slow for larger documents.

You also loose some metadata when you go that way - maybe a much more direct manipulation of the PDF is a better way - but note that you need to update the xref trailer in the PDF if you directly edit strings, since the trailer is a map of byte-coordinates that will change once you add/delete characters.

How can I get all text from a PDF in Swift?

That is unfortunately not possible.

At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDFs are (generally) a one-way street.

They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.

How to use documentAttributes of CGPDFDocument

Had a second look at the link you provided. It's not CGPDFDocument but Quartz.PDFDocument. Heres one way to access it:

let pdfDoc = PDFDocument(url: URL(fileURLWithPath: "/path/to/file.pdf"))!

if let attributes = pdfDoc.documentAttributes {
    let keys = attributes.keys              // the set of keys differ from file to file
    let firstKey = keys[keys.startIndex]    // get the first key, whatever the turns out to be
                                            // since Dictionaries are not ordered

    print("\(firstKey): \(attributes[firstKey]!)")
    print("Title: \(attributes["Title"])")
}

The list of keys differ from file to file so you need to check each one and deal with nil when the key is not available.

To change the attributes:

pdfDoc.documentAttributes?["Title"] = "Cheese"
pdfDoc.write(to: URL(fileURLWithPath: "/path/to/file.pdf")) // save the PDF file

Highlight in CGPDFDocument

You need to roll your own solution. Apple added text selection as private API in iOS, but they haven't exposed anything. You can simply use UIWebView to get that feature, or, if you need more control or also want features like highlighting, you need to write your own solution.

For general PDF readers, there's quite some open source code out there. (see vfr/Reader as an example)

To also get text selection/annotations, it's a whole different level of pain. I worked full-time the better part of the year on those features, and just released v2 of my commercial framework PSPDFKit.

There are two problems you would need to solve when rolling your own; first getting the glyph rects and finding words from the single glyphs. I used CoreGraphics as a parser - calculating the correct frame and managing the current drawing stack and different font styles is the hard part. There's some open source code out there (PDFKitten) but I finally rolled my own because that one really isn't great and has many problems with various PDF font formats.

Second, writing actual annotation objects. Here you need to parse the PDF yourself to make a tree of all the objects, then write a new trailer that replaces the /Pages object (PDF is mostly just text, so it's well-parsable; but still quite hard to get right). I'm not aware of any open source code out there that could help you with that. I spent a lot of long nights with the official PDF reference.

Swift Cgpdfdocument Parsing