What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)
You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing
convert -list delegate
(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:
convert -list delegate | findstr /i png
Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:
convert -list delegate | grep -i png
You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:
convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF
Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.
About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:
- By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add
-density 600
which tells Ghostscript to use a 600 dpi resolution for its image output. - The detour of IM to call Ghostscript twice to convert first
PDF => PS
and thenPS => PNG
is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:- PDF can handle transparencies, which PostScript can not.
- PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
Conversion in the directionPS => PDF
is not that critical....)
That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:
gswin32c.exe ^
-sDEVICE=pngalpha ^
-o output/page_%03d.png ^
-r600 ^
d:/path/to/your/input.pdf
(This is the commandline for Windows. On Linux, use gs
instead of gswin32c.exe
, and \
instead of ^
.) This command expects to find an output
subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try
gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \
/path/to/your/input.pdf
(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.
[*] D'oh! I missed to see your "linux" tag at first...
how to convert pdf scanned image to high resolution tiff with best for ocr?
This has happened because ImageMagick is a raster image processor and it has rasterised your PDF using its default 72dpi grid - which is too coarse for your needs. You need to set a higher density before rasterising:
convert -density 288 input.pdf -compress lzw result.tiff
You may be better off installing Poppler tools and using its pdfimages
tool to extract the images.
Convert PDF to image with high resolution
It appears that the following works:
convert \
-verbose \
-density 150 \
-trim \
test.pdf \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
24-18.jpg
It results in the left image. Compare this to the result of my original command (the image on the right):
(To really see and appreciate the differences between the two, right-click on each and select "Open Image in New Tab...".)
Also keep the following facts in mind:
- The worse, blurry image on the right has a file size of 1.941.702 Bytes (1.85 MByte).
Its resolution is 3060x3960 pixels, using 16-bit RGB color space. - The better, sharp image on the left has a file size of 337.879 Bytes (330 kByte).
Its resolution is 758x996 pixels, using 8-bit Gray color space.
So, no need to resize; add the -density
flag. The density value 150 is weird -- trying a range of values results in a worse looking image in both directions!
ImageMagick Convert PDF to low resolution JPG file
You need two things that CodeIgniter probably doesn't support, so you have to use ImageMagick directly.
First, you have to set the resolution of the PDF for a high-quality result. On the ImageMagick command line, this can be done with the -density
option. With PHP imagick, use setResolution
.
To get rid of the black background, you have to flatten the PDF on a white background first. On the command line, use the options -background white -flatten
. With PHP imagick, setImageBackgroundColor
and flattenImages
should work.
ImageMagick command line: converting PDF to high definition images
You probably want something like this - note that you put the -density
before the PDF filename:
for f in *.pdf; do convert -density 144 "$f" "${f%pdf}jpg"; done
The tricky part is removing the pdf
extension and replacing it with jpg
, I used "bash Parameter Substitution" which is pretty well described here.
In long-hand, that is
for f in *.pdf; do
convert -density 144 "$f" "${f%pdf}jpg"
done
Another option is with mogrify
:
mogrify -density 144 -format jpg *pdf
If you have GNU Parallel installed, you can do it more readably and faster like this:
parallel convert -density 144 {} {.}.jpg ::: *pdf
Speed up (yet keep file size low) conversion of multiple PNGs to PDF?
ImageMagick is not a good processor for vector images such as PDF. It will rasterize your PDF and save each dot as an element of the pdf. That may be why it takes so long. The PDF is now a raster image (much larger than the original vector image) in vector shell.
If your input PDF is already black/white, then you only need the compress group 4.
Starting with a 25 KB PDF
If I just convert it.
time magick ImageOnly.pdf result1.pdf
real 0m0.276s
user 0m0.563s
sys 0m0.038s
time magick ImageOnly.pdf -compress Group4 result2.pdf
real 0m0.275s
user 0m0.562s
sys 0m0.036s
So it is not the group 4 compression that is slowing it dow.
However, the quality will not be terrific. So one should add -density 300 before reading the PDF. But that will slow it down.
time magick -density 300 ImageOnly.pdf -compress Group4 result3.pdf
real 0m2.026s
user 0m2.863s
sys 0m0.182s
Related Topics
Opening Sockets to The Xserver Directly
Running R in Batch Mode on Linux: Output Issues
Using Google as a Dictionary Lookup via Bash, How Can One Grab The First Definition
Arm Performance Counters Vs Linux Clock_Gettime
How to Start a Nodejs Process on a Remote Server
When Does a Process Handle a Signal
Cron Expression to Run on Different Days for Different Months
In Bash, How to Expand Variables Twice in Double Quotes
Change a String in a File with Sed
Linux: Find a List of Files in a Dictionary Recursively
Find Based Filename Autocomplete in Bash Script
Webdrivererror Error: Chrome Failed to Start: Exited Abnormally
Write and Read from Ttyusb0, Can't Get Response