So that past few days, I’ve got interested in character recognition. After searching, the best open-source program is Tesseract, which scored in the top 3 OCR engines in 1995.
On the Tesseract homepage, there are downloads for it, and it comes in many different languages.
If the text was clear, the character recognition was able to recognize almost 100% of the text, however, when the picture was taken from a iPad/iTouch, the text was unreadable.
Training Tesseract is much harder than just downloading the pre-made language files, so here’s what I learned.
To begin with, you need to know a font you want to create, then you need to create a picture (tiff) to begin the font creation process.
Rename your language to <Language>.<FontName>.exp<PictureNumber>.tiff
Create the .box file (goes with the .tiff file)
“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 batch.nochop makebox
Create the .tr and .txt files
“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 nobatch box.train
**NO IDEA** Bootstrapping new character set
“C:\Program Files (x86)\Tesseract-OCR\tesseract” redd.NewLanguage.exp0.tiff redd.NewLanguage.exp0 -l NewLanguage batch.nochop makebox
Create UniCharSet (add more boxes for better accuracy)
“C:\Program Files (x86)\Tesseract-OCR\unicharset_extractor” redd.NewLanguage.exp0.box redd.NewLanguage.exp1.box …
Create font_properties file (Needed for Tesseract 3)
NewLanguage 0 0 0 1 0
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
Create shapetable file
“C:\Program Files (x86)\Tesseract-OCR\shapeclustering” -F font_properties -U unicharset redd.NewLanguage.exp0.tr
Create inttemp and pffmtable files
“C:\Program Files (x86)\Tesseract-OCR\mftraining” -F font_properties -U unicharset -O redd.NewLanguage redd.NewLanguage.exp0.tr
Create normproto file
“C:\Program Files (x86)\Tesseract-OCR\cntraining” redd.NewLanguage.exp0.tr
Add dictionaries if wanted.
Rename all the files to begin with the prefix of your language (eg. normproto => <language>.normproto)
Combine to create trained data file (If the <language> is lang, and remember the dot)
“C:\Program Files (x86)\Tesseract-OCR\combine_tessdata” lang.
Copy files to tesseract/tessdata folder
To decode a picture (like Education.png):
“C:\Program Files (x86)\Tesseract-OCR\tesseract” Education.png output -l lang