Watermark in Existing PDF in Ruby

Watermark in existing PDF in Ruby

This will do it:

PDF::Reader to count the number of pages in the file.

Prawn to create a new PDF document using each page of the input pdf as a template.

require 'prawn'
require 'pdf-reader'

input_filename = 'input.pdf'
output_filename = 'output.pdf'

page_count = PDF::Reader.new(input_filename).page_count

Prawn::Document.generate(output_filename, :skip_page_creation => true) do |pdf|

page_count.times do |num|
pdf.start_new_page(:template => input_filename, :template_page => num+1)
pdf.text('WATERMARK')
end

end

However, in my testing the output file size was huge with the latest Gem version of Prawn (0.12), but after pointing my Gemfile at the master branch on github, all worked fine.

How to edit or write on existing PDF with Ruby?

you have to definitely check out Prawn gem, by which you can generate any custom pdf files. You can actually use prawn to write in text into existing pdfs by treating the existing PDF as a template for your new Prawn document.

For example:

filename = "#{Prawn::DATADIR}/pdfs/multipage_template.pdf"
Prawn::Document.generate("full_template.pdf", :template => filename) do
text "THis content is written on the first page of the template", :align => :center
end

This will write text onto the first page of the old pdf.

See more here:
http://prawn.majesticseacreature.com/manual.pdf

Unable to extract text and images from specific PDF

1.

Extracting text:

pdftotext -layout the.pdf -

Extract all pages' text to <stdout>.

pdftotext -layout -nopgbrk the.pdf the-3-5.txt

Extract all pages' text to file the.txt, and don't insert these pesky ^L characters signifying new pages.

pdftotext -f 3 -l 5 -layout the.pdf -

Extract pages' 3--5 text to the-3-5.txt.

2.

Extracting images

pdfimages -f 4 -l 7 -j the.pdf myprefix--

Extract all images from pages 4 through 7 as JPEGs (if possible!) and name them with the prefix myprefix---.

If extracting as JPEGs is not possible, the images will be extracted as pure raster PPM or PGM.

The latest versions of pdfimages (Poppler fork) lets you specify -png (and more) to get all images as PNGs.

Using the latest version of pdfimages gives you these options:

$ pdfimages -h

pdfimages version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
-f <int> : first page to convert
-l <int> : last page to convert
-png : change the default output format to PNG
-tiff : change the default output format to TIFF
-j : write JPEG images as JPEG files
-jp2 : write JPEG2000 images as JP2 files
-jbig2 : write JBIG2 images as JBIG2 files
-ccitt : write CCITT images as CCITT files
-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
-list : print list of images instead of saving
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-p : include page numbers in output file names
-q : don't print any messages or errors
[....]

What more image formats do you want? If you need other formats use ImageMagick's convert command.

Also, there are no other "formats" embedded in PDFs.

Basically, the only compression methods for images embedded in PDFs are:

  • JPEG (then /DCTDEcode filter is mentioned as uncompression hint to the PDF viewer),
  • JBIG2 (/JBIG2Encode),
  • Fax compression (CCITTFaxDecode) and
  • JPEG2000 (JPXDecode).

All other images embedded in PDFs basically are pure raster data anyway (PPM or PGM), and their PDF-internal compression is one of the other standard compression methods available for general stream compression:

  • /FlateDecode (ZIP/Deflate algorithm),
  • /LZWDecode (Lempel-Ziv-Welch algorithm) and
  • /RunLengthDecode.

Update

I only now had time to look at your linked sample PDF, sorry.

As @mkl wrote in his comment, what looks like an image isn't always an image in PDF technical parlance. For example, on your PDF's page 7 there is the (famous) tiger head. This is completely composed from vector elements, which are placed inline into the page's /Contents stream.
The same is true for the depicted chess board.

I believe the tiger image was designed with the help some vector graphics program a few decades ago (Adobe Illustator?) when it had freshly been released, and exported to EPS. A PDF viewer in may cases has now way to identify inline vector elements (which could be simple horizontal lines) from other contents. Unless these vector elements are "grouped" into an XObject (which pdfimages would no be able to extract either, but which would help with manual isolation and extraction...)

These vector elements cannot be automatically extracted by any (Free and Open Source Software, or gratis closed source software) tool I know.

A "real" image in PDF parlance is a rectangle of pixel data. These are the only type of images which can be extracted by a tool like pdfimages.

Background images using prawn on all pages

Prawn cannot seem to stretch a background image to fit the page via an option, unfortunately. However, you can. The page the image needs to fit is static, it never changes size. Moreover, we know the exact dimensions:

p Prawn::Document::PageGeometry::SIZES['A4']
# => [595.28, 841.89]

Thus, the best solution is to use an image editor of your choice and create an image of that size. That is, create a canvas of that size and fit your original image whichever way you want. You can choose to stretch it, although I would probably just proportionally shrink it to fit the width and center it vertically. But that's up to you. Just make sure to create a 595px x 842px image and it should fit nicely.



Related Topics



Leave a reply



Submit