Optimize PDF Files (With Ghostscript or Other)

Optimize PDF files (with Ghostscript or other)

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.

Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):

How to downsample images within pdf file?

Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):

-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \

If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:

gs \
  -o output.pdf \
   [...other options...] \
  -dEmbedAllFonts=false \
  -dSubsetFonts=true \
  -dConvertCMYKImagesToRGB=true \
  -dCompressFonts=true \
  -c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" \
  -c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" \
  -f input.pdf

Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....

Update

One option which I had overlooked in my original answer is to add

-dDetectDuplicateImages=true

to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).

It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:

pdfinfo p1.pdf

 Producer:       libtiff / tiff2pdf - 20120922
 CreationDate:   Tue Jan  6 19:36:34 2015
 ModDate:        Tue Jan  6 19:36:34 2015
 Tagged:         no
 UserProperties: no
 Suspects:       no
 Form:           none
 JavaScript:     no
 Pages:          1
 Encrypted:      no
 Page size:      595 x 842 pts (A4)
 Page rot:       0
 File size:      20983 bytes
 Optimized:      no
 PDF version:    1.1

Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:

pdfimages -list p1.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image    423   600   rgb    3   8 jpeg     no     7  0    52    52 19.2K 2.6%

This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.

If we concatenate 3 copies of this file with the help of pdftk like so:

pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf

then the result shows these image properties via pdfimages -list:

pdfimages -list p3.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no     4  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no     8  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    12  0    52    52 19.2K 2.6%

This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:

pdfinfo p3.pdf | grep Pages:

 Pages:          3

Optimize PDF by replacing duplicate images with references

Now we can apply the above mentioned optimization with the help of Ghostscript

 gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf

Checking:

 pdfimages -list p3-optim.pdf

 page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
 --------------------------------------------------------------------------------------
    1   0 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    2   1 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%
    3   2 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%

There is still one image listed per page -- but the PDF object ID is always the same now: 10.

 ls -ltrh p1.pdf p3.pdf p3-optim.pdf

   -rw-r--r--@ 1 kp  staff    20K Jan  6 19:36 p1.pdf
   -rw-r--r--  1 kp  staff    60K Jan  6 19:37 p3.pdf
   -rw-r--r--  1 kp  staff    16K Jan  6 19:40 p3-optim.pdf

As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.

The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

Ghostscript PDF batch compression

With the following script you can define all directories needed in the array variable filesDir.

It will loop over all these directories and search for all pdf files in all directories including subdirectories.

For all found pdf files it will use this ghostscript command (GitHub) and output the file with name e.g. fileabc.pdf with a new name: compr_fileabc.pdf.

Edit #1:

I changed the script as requested by the comments to either write new pdf files or overwrite the input pdf file. To select between these to options change the createNewPDFs variable to 1 (new files) or 0 (overwrite).

Because ghostscript can't write to the input file the output file will be written at the users temporary path (%TEMP%) and moved to the original input file to overwite this file. It will only overwrite the input pdf file if the new file is smaller in size.

Further the ghostscript command is substituted by a variable with the same name because under Windows it can be either gswin64c (64 bit) or gswin32c (32 bit).

If the outcomming sizes are not small enough play with these ghostscript command switch: -dPDFSETTINGS=/printer, it is explained below.

Batch script:

@echo off
setlocal EnableDelayedExpansion

rem ghostscript executable name
set "ghostscript=gswin64c"

rem directories to scan for files
set "filesDir[0]=FOLDER1"
set "filesDir[1]=FOLDER2"
set "filesDir[2]=FOLDER3"

rem extension of files to be scanned
set "ext=pdf"

rem new file be creation or input file overwrite
set "createNewPDFs=0"
rem file prefix for new files (if they should be created)
set "filepre=compr_"

rem loop over all directories defined in filesDir array
for /f "tokens=2 delims==" %%d in ('set filesDir[') do (
   if exist "%%~d" (
      pushd "%%~d"
      rem loop over all files in all (sub)directories with given extension
      for /f "delims=*" %%f in ('dir "*.%ext%" /b /s /a:-d') do (
         if [%createNewPDFs%] EQU [1] (
            %ghostscript% -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile="%%~dpf%filepre%%%~nxf" "%%~f"
         ) else (
            %ghostscript% -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile="%TEMP%\%%~nxf" "%%~f"
            for %%t in ("%TEMP%\%%~nxf") do ( set "newSize=%%~zt" )
            for %%t in ("%%~f") do ( set "oldSize=%%~zt" )
            if [!newSize!] LSS [!oldSize!] (
               rem new file is smaller --> overwrite
               move /y "%TEMP%\%%~nxf" "%%~f"
            ) else (
               rem new file is greater --> delete it of the temp dir
               del "%TEMP%\%%~nxf"
            )
         )
      )
      popd
   )
)

Found GitHub ghostscript command to reduce pdf size:

This can reduce files to ~15% of their size (2.3M to 345K, in one case) with no obvious degradation of quality.
ghostscript -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
Other options for PDFSETTINGS:

/screen selects low-resolution output similar to the Acrobat Distiller "Screen Optimized" setting.

/ebook selects medium-resolution output similar to the Acrobat Distiller "eBook" setting.

/printer selects output similar to the Acrobat Distiller "Print Optimized" setting.

/prepress selects output similar to Acrobat Distiller "Prepress Optimized" setting.

/default selects output intended to be useful across a wide variety of uses, possibly at the expense of a larger output file.

Source: http://ghostscript.com/doc/current/Ps2pdf.htm

Command reference links from ss64.com:

set
DelayedExpansion
for /f
dir
if
pushd
popd
rem

best pdf compression technique?

You probably won't be able to compress any PDF to half of its size.

BUT

There are some operations that can reduce size of your PDF. Sometimes these operations yield very good results.

So here they are:

Remove unused objects from PDF
Replace indirect objects with direct ones (were applicable)
Use object streams
Use cross-reference streams
Compress streams
Use absolute minimum of whitespace chars

There are even more operations. But they are no as safe as the above ones. They are lossy, so please think twice before applying them.

Remove metadata or some of metadata.
Remove structure information
Un-embed fonts (remove font bytes from PDF)

All of the above can be done with help of Docotic.Pdf library (disclaimer: I work for the company).

You probably can use an other library for this. The library should provide corresponding APIs or give access to inner structure of PDF.

Ghostscript is increasing file size after compressing

Ghostscript (more accurately its pdfwrite device) doesn't 'compress' files.

It is possible, by judicious use of settings which will do things like downsample images to trade quality for file size, to get a smaller file produced but there is absolutely no guarantee that this is the case.

Without seeing the input file, there is no possible way to comment on why your file increases in size, but (for example) a PDF 1.5 file can use compressed streams and xref, and the pdfwrite device never uses those, so that could be one reason.

The canned 'PDFSETTINGS' cover a multitude of different controls, you should read those and understand what is actually going on. If your original file happens to already have traded quality for size, then it's entirely likely that the printer settings (which are reasonably conservative) will not actually do anything at all.

Optimize PDF Files (With Ghostscript or Other)