What Ocr Options Exist Beyond Tesseract

What OCR options exist beyond Tesseract?

I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.

Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.

Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:

require 'fileutils'
tmp = 'directory'
file = 'file.png'

`ocropus book2pages #{tmp}/out #{file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`

text = File.read("#{tmp}/output.html")
FileUtils.rm_rf(tmp)

Best Python/Ruby lib for reading text inside images

You can use OpenCV, an opensource computer vision library and It has Python API. It is considered to be an industry-standard library nowadays.

OpenCV official site : http://opencv.org/

If you need some tutorials on OpenCV-Python, visit : opencvpython.blogspot.com

You can also check this SOF : Simple Digit Recognition OCR in OpenCV-Python

In addition to that, OpenCV samples has got some OCR implementations.

But I would recommend you to use Tesseract for OCR. It is the best Open source OCR engine, developed by HP, but now handled by Google.

Tesseract site : https://github.com/tesseract-ocr/tesseract

Python API of tesseract, Pytesser : https://github.com/RobinDavid/Pytesser

Also check this SOF : How do I choose between Tesseract and OpenCV?

So you can use OpenCV to preprocess the image and use Tesseract for OCR.

How does PaddleOCR performance compare to Tesseract?

I found a comparison between PaddleOCR 2 and Tesseract 4, but only for English texts. Briefly summarized:

  1. PaddleOCR is slightly slower than Tesseract on CPUs, but with GPU support it beats Tesseract by 46% on a standard-GPU.
  2. Without post-processing, PaddleOCR mainly makes mistakes with missing white spaces between words and punctuation symbols. However, these errors can be easily corrected. After postprocessing the accuracy is comparable to Tesseract (1% less).
  3. The pre-trained model for English has only 10% of the file size of Tesseracts English train data (2MB vs 23MB).

For Chinese texts, which seem to be the main priortiy of PaddleOCR at the moment, the situation could be different.

Using Java to capture an area of the screen and identify text found there

The OCR implementation is complicated, but using an SDK like http://asprise.com/product/ocr/index.php?lang=java is simple.



Related Topics



Leave a reply



Submit