Events, Programs


So that past few days, I’ve got interested in character recognition. After searching, the best open-source program is Tesseract, which scored in the top 3 OCR engines in 1995.

On the Tesseract homepage, there are downloads for it, and it comes in many different languages.

If the text was clear, the character recognition was able to recognize almost 100% of the text, however, when the picture was taken from a iPad/iTouch, the text was unreadable.

Training Tesseract is much harder than just downloading the pre-made language files, so here’s what I learned.

To begin with, you need to know a font you want to create, then you need to create a picture (tiff) to begin the font creation process.

Rename your language to <Language>.<FontName>.exp<PictureNumber>.tiff

FOR redd.NewLanguage.exp0.tiff

Create the .box file (goes with the .tiff file)

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 batch.nochop makebox

Create the .tr and .txt files

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 nobatch box.train

**NO IDEA** Bootstrapping new character set

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 -l NewLanguage batch.nochop makebox

Create UniCharSet (add more boxes for better accuracy)

“C:\Program Files (x86)\Tesseract-OCR\unicharset_extractor” …

Create font_properties file (Needed for Tesseract 3)

NewLanguage 0 0 0 1 0

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

Create shapetable file

“C:\Program Files (x86)\Tesseract-OCR\shapeclustering” -F font_properties -U unicharset

Create inttemp and pffmtable files

“C:\Program Files (x86)\Tesseract-OCR\mftraining” -F font_properties -U unicharset -O redd.NewLanguage

Create normproto file

“C:\Program Files (x86)\Tesseract-OCR\cntraining”

Add dictionaries if wanted.

Rename all the files to begin with the prefix of your language (eg. normproto => <language>.normproto)

Combine to create trained data file (If the <language> is lang, and remember the dot)

“C:\Program Files (x86)\Tesseract-OCR\combine_tessdata” lang.

Copy files to tesseract/tessdata folder

To decode a picture (like Education.png):

“C:\Program Files (x86)\Tesseract-OCR\tesseract” Education.png output -l lang

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.