a light solution to convert text to pdf in Linux
One way would be to use enscript
followed by ps2pdf
enscript -p file.ps file.txt
ps2pdf file.ps file.pdf
Extracting Text from a PDF file with embedded font
In this case I recommend to NOT use ImageMagick for the PDF -> TIFF conversion. Instead, use Ghostscript. Two reasons:
Using Ghostscript directly will give you more control over individual parameters of the conversion.
ImageMagick cannot do that particular conversion itself -- it will call Ghostscript as its 'delegate' anyway, but will not allow you to give all the same fine-grained control that your own Ghostscript command will give you.
Most of the text in the table of your sample PDF is extremely small (I guess, only 4 or 5 pt high). This makes it rather difficult to run a successful OCR unless you increase the resolution considerably.
Ghostscript uses -r72
by default for image format output (such as TIFF). Tesseract works best with r=300 or r=400 -- but only for a font size from 10-12 pt or higher. Therefor, to compensate for the small text size you should make Ghostscript using a resolution of at least 1200 DPI when it renders the PDF to the image.
Also, you'll have to rotate the image so the text displays in the normal reading direction (not bottom -> top).
This is the command which I would try first:
gs \
-o sample.tif \
-sDEVICE=tiffg4 \
-r1200 \
-dAutoRotatePages=/PageByPage \
sample_rotate-0.pdf
You may need to play with variations of the -r1200
parameter (higher or lower) for best results.
Bash To Convert PDF Files In Multiple Subdirectories
You can use the find
command with the option -exec
to trigger the conversion:
find /path/to/your/root/pdf/folder -type f -name "*.pdf" -exec bash -c 'pdftohtml -c -i -s "$1"' _ {} \;
The pdftohtml
is executed for every pdf file found. Note that {}
represents the pdf file.
Merge / convert multiple PDF files into one PDF
I'm sorry, I managed to find the answer myself using google and a bit of luck : )
For those interested;
I installed the pdftk (pdf toolkit) on our debian server, and using the following command I achieved desired output:
pdftk file1.pdf file2.pdf cat output output.pdf
OR
gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf file1.pdf file2.pdf file3.pdf ...
This in turn can be piped directly into pdf2ps.
How to convert a PDF into JPG with command line in Linux?
You can try ImageMagick's convert
utility.
On Ubuntu, you can install it with this command:
$ sudo apt-get install imagemagick
Use convert
like this:
$ convert input.pdf output.jpg
# For good quality use these parameters
$ convert -density 300 -quality 100 in.pdf out.jpg
Is there any wkhtmltopdf option to convert html text rather than file?
You can pipe content into wkhtmltopdf using the command line. For Windows, try this:
echo "<h3>blep</h3>" | wkhtmltopdf.exe - test.pdf
This reads like "echo <h3>blep</h3>
, output it's stdout (standard out stream) to wkhtmltopdf stdin (standard in stream)".
The dash -
in the wkhtmltopdf command means that it takes it's input from stdin and not a file.
You could also echo HTML into a file, feed that file to wkhtmltopdf and delete that file inside a script.
Related Topics
How to Ignore Line Breaks in Input Using Nasm Assembly
Gdb: Redirect Target Stdout Temporarly
Chrome Certificate Selection Appears Multiple Times
Printing an Integer with X86 32-Bit Linux Sys_Write (Nasm)
Printing Variable to Command Line Using Assembly in Linux
How to Wrap Lines Within Columns in Linux
Qt Creator: Add Qt Module to Project
Ftdi D2Xx Conflict with Ftdi_Sio on Linux - How to Remove Ftdi_Sio Automatically
Sed: Insert a Line in a Certain Position
Bash Indirect Variable Reference
How to Use Grep to Match But Without Printing the Matches
Using a Glob Expression Passed as a Bash Script Argument
Android Studio 3.0 Emulator Does Not Start
How to Fix "Go Not Root Owned"
Why Does the Wc Command Count One More Character Than Expected
Will Adding the -Rdynamic Linker Option to Gcc/G++ Impact Performance