What OCR options exist beyond Tesseract?
I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.
Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.
Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h
and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:
require 'fileutils'
tmp = 'directory'
file = 'file.png'
`ocropus book2pages #{tmp}/out #{file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`
text = File.read("#{tmp}/output.html")
FileUtils.rm_rf(tmp)
Best Python/Ruby lib for reading text inside images
You can use OpenCV, an opensource computer vision library and It has Python API. It is considered to be an industry-standard library nowadays.
OpenCV official site : http://opencv.org/
If you need some tutorials on OpenCV-Python, visit : opencvpython.blogspot.com
You can also check this SOF : Simple Digit Recognition OCR in OpenCV-Python
In addition to that, OpenCV samples has got some OCR implementations.
But I would recommend you to use Tesseract for OCR. It is the best Open source OCR engine, developed by HP, but now handled by Google.
Tesseract site : https://github.com/tesseract-ocr/tesseract
Python API of tesseract, Pytesser : https://github.com/RobinDavid/Pytesser
Also check this SOF : How do I choose between Tesseract and OpenCV?
So you can use OpenCV to preprocess the image and use Tesseract for OCR.
How does PaddleOCR performance compare to Tesseract?
I found a comparison between PaddleOCR 2 and Tesseract 4, but only for English texts. Briefly summarized:
- PaddleOCR is slightly slower than Tesseract on CPUs, but with GPU support it beats Tesseract by 46% on a standard-GPU.
- Without post-processing, PaddleOCR mainly makes mistakes with missing white spaces between words and punctuation symbols. However, these errors can be easily corrected. After postprocessing the accuracy is comparable to Tesseract (1% less).
- The pre-trained model for English has only 10% of the file size of Tesseracts English train data (2MB vs 23MB).
For Chinese texts, which seem to be the main priortiy of PaddleOCR at the moment, the situation could be different.
Using Java to capture an area of the screen and identify text found there
The OCR implementation is complicated, but using an SDK like http://asprise.com/product/ocr/index.php?lang=java is simple.
Related Topics
Curl Error 60: Ssl Certificate in Laravel 5.4
Codeigniter Check for User Session in Every Controller
Why Are PHP Function Calls *So* Expensive
Programmatically Creating New Order in Woocommerce
Get Div Content from External Website
Codeigniter Multiple File Upload Messes File Extension
How to Use Break or Continue Within for Loop in Twig Template
Calculate Total Seconds in PHP Dateinterval
Adding Three Months to a Date in PHP
How to Alias a Function in PHP
How to Echo the Whole Content of a .HTML File in PHP
Stop Script Execution Upon Notice/Warning
Email Validation Using Regular Expression in PHP
PHP Generated Xml Shows Invalid Char Value 27 Message
Create an Order Programmatically with Line Items in Woocommerce 3+