Reading Text and Images from a PDF Document in iOS

Reading text and images from a pdf document in iOS

There's no simple answer to this. PDF's are nested dictionaries composed of more dictionaries & arrays. You'll have to dig into CGPDFDocument. Voyeur is an excellent tool to use while digging around in PDF's. Reader is a good suggested starting point for rendering PDF's.

pdf file text reading and searching

Look at PDFKitten, it's a good start - it does all the glyph width analysis for you, but it's not perfect either.

Read contents of pdf as string

If you want to avoid a lot of programming, you probably need to use some library which will help you extract text from PDFs.

You have two options:

1) Use OCR library. Since PDF can contain images besides text, performing OCR to get the text is the most generic solution. To perform OCR on a PDF document, you need to convert it to UIImage object. Another approach can be to convert contents of the WebView to UIImage, but this might result with image with lower resolution, which can affect OCR performance.

The downside to using OCR library is that you will not get 100% accurate text, since the OCR engine always introduces errors.

The best options for OCR are Tesseract for iOS (free, but with higher error rate and a bit more complex to tweak for results). A more robust option is BlinkOCR, which is free to try, paid when in commercial use, but you can get a ton of help from their engineers.

2) You can also use PDF library. PDF libraries can reliably extract text written in the document, with exception of text which is part of the images inside the PDF. So depending on the documents you want to read this might be a better option (or not).

Some options for PDF libraries can be found here, and in our experience, PDFlib gives very good results and is the most customizable.

Reading PDF files as string through iPhone application

i have a library that can do this exact thing linked over here : https://bitbucket.org/zachron/pdfiphone/overview

Extracting images from a PDF

Yes ! I found it. But It looks very scary - huge code.

NSMutableArray *aRefImgs;
void setRefImgs(NSMutableArray *ref){
aRefImgs=ref;
}

NSMutableArray* ImgArrRef(){
return aRefImgs;
}

CGPDFDocumentRef MyGetPDFDocumentRef (const char *filename) {
CFStringRef path;
CFURLRef url;
CGPDFDocumentRef document;
path = CFStringCreateWithCString (NULL, filename,kCFStringEncodingUTF8);
url = CFURLCreateWithFileSystemPath (NULL, path, kCFURLPOSIXPathStyle, 0);
CFRelease (path);
document = CGPDFDocumentCreateWithURL (url);// 2
CFRelease(url);
int count = CGPDFDocumentGetNumberOfPages (document);// 3
if (count == 0) {
printf("`%s' needs at least one page!", filename);
return NULL;
}
return document;
}

CGFloat *decodeValuesFromImageDictionary(CGPDFDictionaryRef dict, CGColorSpaceRef cgColorSpace, NSInteger bitsPerComponent) {
CGFloat *decodeValues = NULL;
CGPDFArrayRef decodeArray = NULL;

if (CGPDFDictionaryGetArray(dict, "Decode", &decodeArray)) {
size_t count = CGPDFArrayGetCount(decodeArray);
decodeValues = malloc(sizeof(CGFloat) * count);
CGPDFReal realValue;
int i;
for (i = 0; i < count; i++) {
CGPDFArrayGetNumber(decodeArray, i, &realValue);
decodeValues[i] = realValue;
}
} else {
size_t n;
switch (CGColorSpaceGetModel(cgColorSpace)) {
case kCGColorSpaceModelMonochrome:
decodeValues = malloc(sizeof(CGFloat) * 2);
decodeValues[0] = 0.0;
decodeValues[1] = 1.0;
break;
case kCGColorSpaceModelRGB:
decodeValues = malloc(sizeof(CGFloat) * 6);
for (int i = 0; i < 6; i++) {
decodeValues[i] = i % 2 == 0 ? 0 : 1;
}
break;
case kCGColorSpaceModelCMYK:
decodeValues = malloc(sizeof(CGFloat) * 8);
for (int i = 0; i < 8; i++) {
decodeValues[i] = i % 2 == 0 ? 0.0 :
1.0;
}
break;
case kCGColorSpaceModelLab:
// ????
break;
case kCGColorSpaceModelDeviceN:
n =
CGColorSpaceGetNumberOfComponents(cgColorSpace) * 2;
decodeValues = malloc(sizeof(CGFloat) * (n *
2));
for (int i = 0; i < n; i++) {
decodeValues[i] = i % 2 == 0 ? 0.0 :
1.0;
}
break;
case kCGColorSpaceModelIndexed:
decodeValues = malloc(sizeof(CGFloat) * 2);
decodeValues[0] = 0.0;
decodeValues[1] = pow(2.0,
(double)bitsPerComponent) - 1;
break;
default:
break;
}
}

return (CGFloat *)CFMakeCollectable(decodeValues);
}

UIImage *getImageRef(CGPDFStreamRef myStream) {
CGPDFArrayRef colorSpaceArray = NULL;
CGPDFStreamRef dataStream;
CGPDFDataFormat format;
CGPDFDictionaryRef dict;
CGPDFInteger width, height, bps, spp;
CGPDFBoolean interpolation = 0;
// NSString *colorSpace = nil;
CGColorSpaceRef cgColorSpace;
const char *name = NULL, *colorSpaceName = NULL, *renderingIntentName = NULL;
CFDataRef imageDataPtr = NULL;
CGImageRef cgImage;
//maskImage = NULL,
CGImageRef sourceImage = NULL;
CGDataProviderRef dataProvider;
CGColorRenderingIntent renderingIntent;
CGFloat *decodeValues = NULL;
UIImage *image;

if (myStream == NULL)
return nil;

dataStream = myStream;
dict = CGPDFStreamGetDictionary(dataStream);

// obtain the basic image information
if (!CGPDFDictionaryGetName(dict, "Subtype", &name))
return nil;

if (strcmp(name, "Image") != 0)
return nil;

if (!CGPDFDictionaryGetInteger(dict, "Width", &width))
return nil;

if (!CGPDFDictionaryGetInteger(dict, "Height", &height))
return nil;

if (!CGPDFDictionaryGetInteger(dict, "BitsPerComponent", &bps))
return nil;

if (!CGPDFDictionaryGetBoolean(dict, "Interpolate", &interpolation))
interpolation = NO;

if (!CGPDFDictionaryGetName(dict, "Intent", &renderingIntentName))
renderingIntent = kCGRenderingIntentDefault;
else{
renderingIntent = kCGRenderingIntentDefault;
// renderingIntent = renderingIntentFromName(renderingIntentName);
}

imageDataPtr = CGPDFStreamCopyData(dataStream, &format);
dataProvider = CGDataProviderCreateWithCFData(imageDataPtr);
CFRelease(imageDataPtr);

if (CGPDFDictionaryGetArray(dict, "ColorSpace", &colorSpaceArray)) {
cgColorSpace = CGColorSpaceCreateDeviceRGB();
// cgColorSpace = colorSpaceFromPDFArray(colorSpaceArray);
spp = CGColorSpaceGetNumberOfComponents(cgColorSpace);
} else if (CGPDFDictionaryGetName(dict, "ColorSpace", &colorSpaceName)) {
if (strcmp(colorSpaceName, "DeviceRGB") == 0) {
cgColorSpace = CGColorSpaceCreateDeviceRGB();
// CGColorSpaceCreateWithName(kCGColorSpaceGenericRGB);
spp = 3;
} else if (strcmp(colorSpaceName, "DeviceCMYK") == 0) {
cgColorSpace = CGColorSpaceCreateDeviceCMYK();
// CGColorSpaceCreateWithName(kCGColorSpaceGenericCMYK);
spp = 4;
} else if (strcmp(colorSpaceName, "DeviceGray") == 0) {
cgColorSpace = CGColorSpaceCreateDeviceGray();
// CGColorSpaceCreateWithName(kCGColorSpaceGenericGray);
spp = 1;
} else if (bps == 1) { // if there's no colorspace entry, there's still one we can infer from bps
cgColorSpace = CGColorSpaceCreateDeviceGray();
// colorSpace = NSDeviceBlackColorSpace;
spp = 1;
}
}

decodeValues = decodeValuesFromImageDictionary(dict, cgColorSpace, bps);

int rowBits = bps * spp * width;
int rowBytes = rowBits / 8;
// pdf image row lengths are padded to byte-alignment
if (rowBits % 8 != 0)
++rowBytes;

// maskImage = SMaskImageFromImageDictionary(dict);

if (format == CGPDFDataFormatRaw)
{
sourceImage = CGImageCreate(width, height, bps, bps * spp, rowBytes, cgColorSpace, 0, dataProvider, decodeValues, interpolation, renderingIntent);
CGDataProviderRelease(dataProvider);
cgImage = sourceImage;
// if (maskImage != NULL) {
// cgImage = CGImageCreateWithMask(sourceImage, maskImage);
// CGImageRelease(sourceImage);
// CGImageRelease(maskImage);
// } else {
// cgImage = sourceImage;
// }
} else {
if (format == CGPDFDataFormatJPEGEncoded){ // JPEG data requires a CGImage; AppKit can't decode it {
sourceImage =
CGImageCreateWithJPEGDataProvider(dataProvider,decodeValues,interpolation,renderingIntent);
CGDataProviderRelease(dataProvider);
cgImage = sourceImage;
// if (maskImage != NULL) {
// cgImage = CGImageCreateWithMask(sourceImage,maskImage);
// CGImageRelease(sourceImage);
// CGImageRelease(maskImage);
// } else {
// cgImage = sourceImage;
// }
}
// note that we could have handled JPEG with ImageIO as well
else if (format == CGPDFDataFormatJPEG2000) { // JPEG2000 requires ImageIO {
CFDictionaryRef dictionary = CFDictionaryCreate(NULL, NULL, NULL, 0, NULL, NULL);
sourceImage=
CGImageCreateWithJPEGDataProvider(dataProvider, decodeValues, interpolation, renderingIntent);

// CGImageSourceRef cgImageSource = CGImageSourceCreateWithDataProvider(dataProvider, dictionary);
CGDataProviderRelease(dataProvider);

cgImage=sourceImage;

// cgImage = CGImageSourceCreateImageAtIndex(cgImageSource, 0, dictionary);
CFRelease(dictionary);
} else // some format we don't know about or an error in the PDF
return nil;
}
image=[UIImage imageWithCGImage:cgImage];
return image;
}

@implementation DashBoard

// Implement viewDidLoad to do additional setup after loading the view, typically from a nib.
- (void)viewDidLoad {
[super viewDidLoad];
filePath=[[NSString alloc] initWithString:[[NSBundle mainBundle] pathForResource:@"per" ofType:@"pdf"]];
}

-(IBAction)btnTappedText:(id)sender{
if(arrImgs!=nil && [arrImgs retainCount]>0 ) { [arrImgs release]; arrImgs=nil; }
arrImgs=[[NSMutableArray alloc] init];

setRefImgs(arrImgs);
// if(nxtTxtDtlVCtr!=nil && [nxtTxtDtlVCtr retainCount]>0) { [nxtTxtDtlVCtr release]; nxtTxtDtlVCtr=nil; }
// nxtTxtDtlVCtr=[[TxtDtlVCtr alloc] initWithNibName:@"TxtDtlVCtr" bundle:nil];
// nxtTxtDtlVCtr.str=StringRef();
// [self.navigationController pushViewController:nxtTxtDtlVCtr animated:YES];

// 1. Open Document page
CGPDFDocumentRef document = MyGetPDFDocumentRef ([filePath UTF8String]);

int pgcnt = CGPDFDocumentGetNumberOfPages( document );

for( int i1 = 0; i1 < pgcnt; ++i1 ) {

CGPDFPageRef pg = CGPDFDocumentGetPage (document, i1+1);
if( !pg ) {
NSLog(@"Couldn't open page.");
} else {

// 2. get page dictionary
CGPDFDictionaryRef dict = CGPDFPageGetDictionary( pg );
if( !dict ) {
NSLog(@"Couldn't open page dictionary.");
} else {
// 3. get page contents stream
CGPDFStreamRef cont;
if( !CGPDFDictionaryGetStream( dict, "Contents", &cont ) ) {
NSLog(@"Couldn't open page stream.");
} else {
// 4. copy page contents steam
// CFDataRef contdata = CGPDFStreamCopyData( cont, NULL );

// 5. get the media array from stream
CGPDFArrayRef media;
if( !CGPDFDictionaryGetArray( dict, "MediaBox", &media ) ) {
NSLog(@"Couldn't open page Media.");
} else {
// 6. open media & get it's size
CGPDFInteger mediatop, medialeft;
CGPDFReal mediaright, mediabottom;
if( !CGPDFArrayGetInteger( media, 0, &mediatop ) || !CGPDFArrayGetInteger( media, 1, &medialeft ) || !CGPDFArrayGetNumber( media, 2, &mediaright ) || !CGPDFArrayGetNumber( media, 3, &mediabottom ) ) {
NSLog(@"Couldn't open page Media Box.");
} else {
// 7. set media size
// double mediawidth = mediaright - medialeft, mediaheight = mediabottom - mediatop;
// 8. get media resources
CGPDFDictionaryRef res;
if( !CGPDFDictionaryGetDictionary( dict, "Resources", &res ) ) {
NSLog(@"Couldn't Open Page Media Reopsources.");
} else {
// 9. get xObject from media resources
CGPDFDictionaryRef xobj;
if( !CGPDFDictionaryGetDictionary( res, "XObject", &xobj ) ) {
NSLog(@"Couldn't load page Xobjects.");
} else {
CGPDFDictionaryApplyFunction(xobj, pdfDictionaryFunction, NULL);
}
}
}
}
}
}
}
}

NSLog(@"Total images are - %i",[arrImgs count]);

if(nxtImgVCtr!=nil && [nxtImgVCtr retainCount]>0 ) { [nxtImgVCtr release]; nxtImgVCtr=nil; }
nxtImgVCtr=[[ImgVCtr alloc] initWithNibName:@"ImgVCtr" bundle:nil];
nxtImgVCtr.arrImg=arrImgs;
[self.navigationController pushViewController:nxtImgVCtr animated:YES];
}

How can I get all text from a PDF in Swift?

That is unfortunately not possible.

At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDFs are (generally) a one-way street.

They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.



Related Topics



Leave a reply



Submit