Parsing Shorthand Dictionary with RMagick and RTesseract [Ruby]

Parsing Shorthand Dictionary with RMagick and RTesseract [Ruby]
shorthand dictionary

Ruby Wrappers: RMagick, RTesseract (image manipulation and image recognition respectively)

You can Google how to install those gems.

Here was an example page of the dictionary. The shorthand translation was pretty much adjacent to the English word.

Example Gregg Dictionary Page

require 'RMagick' require 'RTesseract' im = Magick::Image.read(file_name) { self.quality = 100 self.density = 300 } #1 img = im.first allWords = RTesseract::Box.new(img).words #2 for word in allWords if word[:word] == "GREGG" or word[:word] == "SHORTHAND" or word[:word] == "DICTIONARY" next end #3 if word[:word].length < 2 next end if words[word[:word]] #4 begin #5 img.crop( Magick::NorthWestGravity, word[:x_start] - 130, word[:y_start]-30, 120, 85).write("image/" + word[:word] + ".jpg") rescue puts "Error" end end end

Here is what each # does.

  1. Loads the image
  2. This is where all the words are recognized
  3. Check for non important words that aren't actually dictionary definitions
  4. Check all the words against a dictionary. This makes sure that the RTesseract didn't mistaken a shorthand word for an English word.
  5. Then I grab the pixels to the left of the detected English word and save it to an jpeg image

The problem I encountered then was that some images had overlapping shorthand words.

abandon (gregg

Like the word "abandon."

One naive solution I used to solve this problem was essentially trim the tops and bottoms of the image. If there is a complete row of pixel that was white in between the center and the middle of the image, I'd fill it up completely with white. Some images (like the word "abandon") worked perfectly, but there were other images that still had problems.

Try the site here.