How do I train tesseract 4 with image data instead of a font file?
Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png
that is a picture of the text foobar
and you should have a text file named 001.gt.txt
that has the text foobar
.
These files need to be single lines of text.
In the tesstrain
repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -
Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?
You've seen it: it isn't there.
So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator
API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T")
after Init
method).
Tesseract hOCR iOS
You are getting an NSString if you proceed as follows.
- (NSString *)getHOCRText {
char *boxtext = _tesseract->GetHOCRText(0);
return [NSString stringWithUTF8String:boxtext];
}
Later you can convert this NSString to NSData.
NSData *xmlData = [xmlString dataUsingEncoding:NSASCIIStringEncoding];
So that you can parse this data using NSXMLParser
NSXMLParser *xmlParser = [[NSXMLParser alloc] initWithData:xmlData];
Hope you are aware remaining parsing procedures.
How to preserve document structure in tesseract
Newer versions of tesseract (3.04) have an option called preserve_interword_spaces
which should do what you want.
Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces
option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.
Details on this option are here.
Related Topics
Caputre Opengl Window in X11 with Fast Framerate - Possible
Cmakelist File to Generate Llvm Bitcode File from C Source File
What Is The Downside of Updating Arm Ttbr(Translate Table Base Register)
How to Distinguish Between Different Operating System Distros in Node.Js
How to Change Port Gitlab on Centos 6
How to Get a Faster Output Pipe Than /Dev/Null
Can an Rpm Spec File "Include" Other Files
Tmux .Tmux.Conf Doesn't Load Properly
Goroutine in Io Wait State for Long Time
How to Enable Mixed Mode Debugging in Visual Studio Code
Shared Libraries in Same Folder with App in Tcsh
How to Measure Net Used Disk Space Change Due to Activity by a Given Process in Linux