How to Enable Hocr Font Info in Tesseract 4

How do I train tesseract 4 with image data instead of a font file?

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

These files need to be single lines of text.

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

You've seen it: it isn't there.

So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T") after Init method).

Tesseract hOCR iOS

You are getting an NSString if you proceed as follows.

- (NSString *)getHOCRText {
        char *boxtext = _tesseract->GetHOCRText(0);
        return [NSString stringWithUTF8String:boxtext];
}

Later you can convert this NSString to NSData.

    NSData *xmlData = [xmlString dataUsingEncoding:NSASCIIStringEncoding];

So that you can parse this data using NSXMLParser

        NSXMLParser *xmlParser = [[NSXMLParser alloc] initWithData:xmlData];

Hope you are aware remaining parsing procedures.

How to preserve document structure in tesseract

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.

How to Enable Hocr Font Info in Tesseract 4

How do I train tesseract 4 with image data instead of a font file?

Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

Tesseract hOCR iOS

How to preserve document structure in tesseract

Related Topics

Leave a reply