What Are Best Parameters to Run Imagemagick to Convert Low Quality PDF to Images (For Ocr)

What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

  1. By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
  2. The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:

    • PDF can handle transparencies, which PostScript can not.
    • PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
      Conversion in the direction PS => PDF is not that critical....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

gswin32c.exe ^
-sDEVICE=pngalpha ^
-o output/page_%03d.png ^
-r600 ^
d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \
/path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.


[*] D'oh! I missed to see your "linux" tag at first...

how to convert pdf scanned image to high resolution tiff with best for ocr?

This has happened because ImageMagick is a raster image processor and it has rasterised your PDF using its default 72dpi grid - which is too coarse for your needs. You need to set a higher density before rasterising:

convert -density 288 input.pdf -compress lzw result.tiff

You may be better off installing Poppler tools and using its pdfimages tool to extract the images.

Convert PDF to image with high resolution

It appears that the following works:

convert           \
-verbose \
-density 150 \
-trim \
test.pdf \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
24-18.jpg

It results in the left image. Compare this to the result of my original command (the image on the right):

  

(To really see and appreciate the differences between the two, right-click on each and select "Open Image in New Tab...".)

Also keep the following facts in mind:

  • The worse, blurry image on the right has a file size of 1.941.702 Bytes (1.85 MByte).
    Its resolution is 3060x3960 pixels, using 16-bit RGB color space.
  • The better, sharp image on the left has a file size of 337.879 Bytes (330 kByte).
    Its resolution is 758x996 pixels, using 8-bit Gray color space.

So, no need to resize; add the -density flag. The density value 150 is weird -- trying a range of values results in a worse looking image in both directions!

ImageMagick Convert PDF to low resolution JPG file

You need two things that CodeIgniter probably doesn't support, so you have to use ImageMagick directly.

First, you have to set the resolution of the PDF for a high-quality result. On the ImageMagick command line, this can be done with the -density option. With PHP imagick, use setResolution.

To get rid of the black background, you have to flatten the PDF on a white background first. On the command line, use the options -background white -flatten. With PHP imagick, setImageBackgroundColor and flattenImages should work.

ImageMagick command line: converting PDF to high definition images

You probably want something like this - note that you put the -density before the PDF filename:

for f in *.pdf; do convert -density 144 "$f" "${f%pdf}jpg"; done

The tricky part is removing the pdf extension and replacing it with jpg, I used "bash Parameter Substitution" which is pretty well described here.


In long-hand, that is

for f in *.pdf; do 
convert -density 144 "$f" "${f%pdf}jpg"
done

Another option is with mogrify:

mogrify -density 144 -format jpg *pdf

If you have GNU Parallel installed, you can do it more readably and faster like this:

parallel convert -density 144 {} {.}.jpg ::: *pdf

Speed up (yet keep file size low) conversion of multiple PNGs to PDF?

ImageMagick is not a good processor for vector images such as PDF. It will rasterize your PDF and save each dot as an element of the pdf. That may be why it takes so long. The PDF is now a raster image (much larger than the original vector image) in vector shell.

If your input PDF is already black/white, then you only need the compress group 4.

Starting with a 25 KB PDF

If I just convert it.

time magick ImageOnly.pdf result1.pdf

real 0m0.276s
user 0m0.563s
sys 0m0.038s

time magick ImageOnly.pdf -compress Group4 result2.pdf

real 0m0.275s
user 0m0.562s
sys 0m0.036s

So it is not the group 4 compression that is slowing it dow.

However, the quality will not be terrific. So one should add -density 300 before reading the PDF. But that will slow it down.

time magick -density 300 ImageOnly.pdf -compress Group4 result3.pdf

real 0m2.026s
user 0m2.863s
sys 0m0.182s


Related Topics



Leave a reply



Submit