How to Downsample Images Within PDF File

ghostscript downsampling of pdf images, downsample factor error

No, there is no way to specify the factor (this is the Adobe specification for distiller params, we are currently limited to those). You cannot specify an approximation for rounding either, without modifying the source code.

You can use a different downsampling algorithm.

[much later]

In fact I just checked the current code, and you must be using an old version of Ghostscript.

The current default downsampling filter is the Bicubic filter, and if you do force the Subsample filter, then the code checks to see if the downsample factor requested is an integer.

If the factor is not an integer but is within 0.1 of an integer then it forces factor to the nearest integer.

If its outside 0.1 of an integer factor then it aborts the subsample filter and switches to Bicubic.

I'd recommend upgrading.

[later edit]

So avoiding the bogus ColorDownsampleOption, the problem is actually not colour images at all, its monochrome images, or more precisely in your case, imagemasks.

I set up this command line:

gs 
-sDEVICE=pdfwrite \
-sOutputFile=pdfwrite.pdf \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageDownsampleThreshold=1 \
-dGrayImageDownsampleThreshold=1 \
-dMonoImageDownsampleThreshold=1 \
-dColorImageDownsampleType=/Bicubic \
-dGrayImageDownsampleType=/Bicubic \
-dMonoImageDownsampleType=/Bicubic \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=100 "gs sample.pdf"

And that produces an error message that the only filter available for monochrome images is Subsample, followed by the error messages you quote about the imprecise factor.

I guess basically this makes my point that an example file is pretty much vital in order to investigate problems.

So there is a problem there, and I will look into it, obviously for monochrome images it should be clamped to the nearest integer resolution, since no other filter is possible. However, Gray and Colour images do work as expected.

Reporting a bug, as I suggested in an early comment would probably have got to this point much sooner. I'd still suggest you do that, so that this is not overlooked.

You may be interested to note that, for me, the resulting file when I don't downsample monochrome images, but do downsample the others, as per the command line above, is 785KB the original being 2.5MB.

Apache-FOP - reduce image sizes that are embedded into PDF

You have two options:

  1. postprocess the PDF. You can open it in Acrobat and save it using
    the 'reduce file size' option.

  2. use another FO formatter. e.g. the Antennahouse formatter can
    downsample images.

How can I remove all images from a PDF?

I'm putting up the answer myself, but the actual code is by courtesy of Chris Liddell, Ghostscript developer.

I used his original PostScript code and stripped off its other functions.
Only the function which removes raster images remains.
Other graphical page objects -- text sections, patterns and vector objects -- should remain untouched.

Copy the following code and save it as remove-images.ps:

%!PS

% Run as:
%
% gs ..... -dFILTERIMAGE -dDELAYBIND -dWRITESYSTEMDICT \
% ..... remove-images.ps <your-input-file>
%
% derived from Chris Liddell's original 'filter-obs.ps' script
% Adapted by @pdfkungfoo (on Twitter)

currentglobal true setglobal

32 dict begin

/debugprint { systemdict /DUMPDEBUG .knownget { {print flush} if}
{pop} ifelse } bind def

/pushnulldevice {
systemdict exch .knownget not
{
//false
} if

{
gsave
matrix currentmatrix
nulldevice
setmatrix
} if
} bind def

/popnulldevice {
systemdict exch .knownget not
{
//false
} if
{
% this is hacky - some operators clear the current point
% i.e.
{ currentpoint } stopped
{ grestore }
{ grestore moveto} ifelse
} if
} bind def

/sgd {systemdict exch get def} bind def

systemdict begin

/_image /image sgd
/_imagemask /imagemask sgd
/_colorimage /colorimage sgd

/image {
(\nIMAGE\n) //debugprint exec /FILTERIMAGE //pushnulldevice exec
_image
/FILTERIMAGE //popnulldevice exec
} bind def

/imagemask
{
(\nIMAGEMASK\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_imagemask
/FILTERIMAGE //popnulldevice exec
} bind def

/colorimage
{
(\nCOLORIMAGE\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_colorimage
/FILTERIMAGE //popnulldevice exec
} bind def

end
end

.bindnow

setglobal

Now run this command:

gs -o no-more-images-in-sample.pdf \
-sDEVICE=pdfwrite \
-dFILTERIMAGE \
-dDELAYBIND \
-dWRITESYSTEMDICT \
remove-images.ps \
sample.pdf

I tested the code with the official PDF specification, and it worked.
The following two screenshots show page 750 of input and output PDFs:


If you wonder why something that looks like an image is still on the output page:
it is not really a raster image, but a 'pattern' in the original file, and therefor it is not removed.

Compress PDF after manipulation

Its not really possible to comment very much without seeing an example file, to determine exactly what has happened at each stage.

However, I very strongly suspect that you have 'lost quality', its just that, at screen resolutions, you can't tell. Your original PDF file was created using ImageMagick at a resolution of 150 dpi. Most probably the image is stored uncompressed in the PDF file, which is why its large.

When you run that PDF file back through Ghostscript there are two effects. Firstly you've used the PDFSETTINGS canned set of job configuration. That (amongst many other things) downsamples grey images to a resolution of 150 dpi (so fortunately for you, no effect). It also compresses the image data using JPEG compression.

Now I've no idea what's in the original PDF file, but if the data there was compressed using JPEG, as seems likely, then you are double applying JPEG quantisation. That's a lossy process and will result in a loss of quality.

Since you are altering the original image data (to change the colour) you have no choice about decompressing the image data. However, to preserve quality you should then not use JPEG compression again, instead you should use Flate compression. The compression ration won't be as good, but it will keep the quality unchanged. To do that you would need to specify the GrayImageFilter using distillerparams, you can't use a PDFSETTINGS for that.

I can't imagine what Acrobat has done to decrease the file size still further (and you haven't said how you 'recreate the PDF file'), but I would imagine it involves reducing the quality of the image still further. Its hard to see how it could save 50% of the file size without doing so. Its also possible it is (like Ghostscript) JPEG compressing the grayscale data but using a more aggressive set of JPEG parameters (resulting in still more loss of quality, of course).

If you posted examples of the original, Ghostscript output, and Acrobat output I might be able to tell you more, but not from this.

For what its worth, there's a new feature in Ghostscript (requires version 9.23 or better) which allows you to create a PDF file which consists only of an image, and choose the colour model. You could run the original PDF file through Ghostscript using something like:

gs -sDEVICE=pdfimage8 -r150 -sOutputFile=gs.pdf

which would produce a pretty minimal PDF file where the original input has been rendered to a gray scale image (at 150 dpi), and that image wrapped up as a PDF file. I've no idea if that might work better for you.

Later EDIT

Yep, its pretty much what I expected.

The original file has what appears to be marked JPEG compression artefacts (all the rectangular 'speckles' round the text). Obviously without seeing the original document I can't tell whether this is because the original document was a JPEG printed to paper, or whether the artefacts were introduced by the scanner, or (more likely) whatever application converted the scanned image into a PDF. Checking the image stored in the PDF file I see that it is indeed a JPEG image.

Nevertheless, the original image is (in my opinion) really very noisy.

Now the output from 'convert' is arguably slightly better (in terms of legibility) than the original. I presume this is 'something' to do with your convert command line, can't be sure. The image in this case is not a JPEG, its compressed with RunLength encoding which is of course lossless. Its also less efficient as a compression method, so the image is bigger. For reasons best known to ImageMagick it also applies a soft mask to the image data. So that's two images per page now instead of just 1. Not too surprising that its larger than the original!

I suspect that the soft mask is due to your command line including RGBA. I assume that produces an alpha channel, and PDF doens't support simple alpha channel blending, its own transparency model is much more sophisticated. So I sort of suspect you are actually making the output file here larger than it needs to be. I'm afraid I can't help you with ImageMagick, I don't know anything about it, but getting rid of that second image would help a great deal.

Note that both your original file and the output from ImageMagick are essentially uncompressed (in terms of the PDF file 'structure').

Then we come to the Ghostscript produced PDF. The 'structure' of the PDF file is itself compressed, giving small size benefits. The images are all JPEG compressed, giving additional compression, but at the cost of quality. Applying JPEG quantisation multiple times always costs quality. By simply comparing the output from 'convert' with the output from Ghostscript I can easily see the degradation in quality.

Now we come to the Acrobat output. Ccomparing it with the other files it shows the worst quality. The JPEG artefacts are very clearly visible in the displayed image. In this case both the image and the soft mask have been compressed with the JPEG2000 compression scheme, which is a 'better' compression than JPEG. However, it looks like applying it to data which has already been quantised for JPEG yields pretty poor quality results. Or at least, applying it to a soft-masked JPEG image does :-)

The main problem with JPEG2000 is that it is patent encumbered. While decoders can be written royalty-free, to write an encoder you must licence the patented technology from the (many) patent holders, an expensive process.

So the AGPL version of Ghostscript does not include a JPEG2000 decoder, and as such cannot write JPEG2000 images.

Obviously you could use a copy of Acrobat to rewrite your PDF file with JPEG2000 compression as you have done here.

Assuming you want to avoid doing that, then my suggestion would be to investigate why convert is producing an image with a soft mask applied. I strongly suspect this is due to the use of rgba instead of rgb.

Avoiding the creation of the second (soft mask) image would (I believe) significantly decrease the size of the PDF file produced by 'convert'. You could gain at least some additional benefit, without any loss of quality, by running it through Ghostscript's pdfwrite device and specifying /FlateEncode for the GrayImageFilter. That would produce a PDF file where the PDF furniture is compressed, and where a better compression scheme is applied to the image data.

You could also just leave the Ghostscript line as it is, the quality degradation may be enough for you to live with.



Related Topics



Leave a reply



Submit