Events, Programs

Tesseract

February 3, 2013July 12, 2021 by Kevin

So that past few days, I’ve got interested in character recognition. After searching, the best open-source program is Tesseract, which scored in the top 3 OCR engines in 1995.

On the Tesseract homepage, there are downloads for it, and it comes in many different languages.

If the text was clear, the character recognition was able to recognize almost 100% of the text, however, when the picture was taken from a iPad/iTouch, the text was unreadable.

Training Tesseract is much harder than just downloading the pre-made language files, so here’s what I learned.

To begin with, you need to know a font you want to create, then you need to create a picture (tiff) to begin the font creation process.

Rename your language to <Language>.<FontName>.exp<PictureNumber>.tiff

FOR redd.NewLanguage.exp0.tiff

Create the .box file (goes with the .tiff file)

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 batch.nochop makebox

Create the .tr and .txt files

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 nobatch box.train

**NO IDEA** Bootstrapping new character set

“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 -l NewLanguage batch.nochop makebox

Create UniCharSet (add more boxes for better accuracy)

“C:\Program Files (x86)\Tesseract-OCR\unicharset_extractor” redd.NewLanguage.exp0.box redd.NewLanguage.exp1.box …

Create font_properties file (Needed for Tesseract 3)

NewLanguage 0 0 0 1 0

Create shapetable file

“C:\Program Files (x86)\Tesseract-OCR\shapeclustering” -F font_properties -U unicharset redd.NewLanguage.exp0.tr

Create inttemp and pffmtable files

“C:\Program Files (x86)\Tesseract-OCR\mftraining” -F font_properties -U unicharset -O redd.NewLanguage redd.NewLanguage.exp0.tr

Create normproto file

“C:\Program Files (x86)\Tesseract-OCR\cntraining” redd.NewLanguage.exp0.tr

Add dictionaries if wanted.

Rename all the files to begin with the prefix of your language (eg. normproto => <language>.normproto)

Combine to create trained data file (If the <language> is lang, and remember the dot)

“C:\Program Files (x86)\Tesseract-OCR\combine_tessdata” lang.

Copy files to tesseract/tessdata folder

To decode a picture (like Education.png):

“C:\Program Files (x86)\Tesseract-OCR\tesseract” Education.png output -l lang

Want updates?

Categories

Four Context

Tesseract

Leave a Reply Cancel reply

Want updates?

Tags

Categories

Tesseract

Leave a Reply Cancel reply

Related Posts

Review of the New York Holiday Train Show

30 in 30 about MIT (0/30 @ MIT)