Improving OCR Results: Adding Spellcheck

March 18, 2014 Kevin Hulse

With the new Tesseract 3.2 engine available as an add-on for Atalasoft DotImage, I have been more interested in the quality of OCR results. When I scour the internet for OCRed documents, I find that many of them have words that are misspelled due to a misinterpreted character or omitted letter. I thought about spellcheck being able to solve this issue, and after experimentation I believe it can only make minor improvements to the overall OCR results without very sophisticated integration.

With DotImage the OcrEngine object is setup to be very extensible; giving hooks into many major steps of the OCR process. Using DotImage I came up with two simple algorithms to use an open source .NET spell checking engine, “Missing Letter” and “Single Incorrect Letter:”

Missing Letter

In several of the raw OCR results from my sample set I noticed that there would be words that were completely missing a letter. The spell check engine provided good guesses when a letter was missing, so I used the following function to determine if the guess was a good one:

        private bool CheckMissingLetter(OcrWord w, string guess)
        {
            int gap = FindGap(w);
            for(int i=0;i<guess.Length;i++)
            {
                if (i != gap && w.Text[i + (i > gap ? -1 : 0)] != guess[i])
                    return false;
            }
            return true;
        }
 
        int GapBoundsThreshold = 10;
        private int FindGap(OcrWord w)
        {
            OcrGlyph prev = w.Glyphs[0];
            for(int i=1;i<w.Text.Length;i++)
            {
                OcrGlyph g = w.Glyphs[i];
                if (g.Bounds.Left > prev.Bounds.Right + GapBoundsThreshold)
                    return i;
                prev = g;
            }
            return 0;
        }

Single Incorrect Letter

In my mind, this was the most basic use of a spell check engine to automatically improve OCR results. The plan was to take misspelled words and see if the guesses only changed one letter. This could be done simply with this function:

        double ConfidenceThreshold = .75;
        private bool CheckIncorrectLetter(OcrWord w, string guess)
        {
            bool flag =false;
            for(int i = 0; i<w.Glyphs.Count;i++)
            {
                if (w.Glyphs[i].Char != guess[i])
                    if (Char.IsLetter(w.Glyphs[i].Char))
                    {
                        if (w.Glyphs[i].Confidence > ConfidenceThreshold) return false; //escape if the engine is really sure
                        if (flag) return false; //there are at least two incorrect letters for this guess so we discard it
                        flag = true;
                    }
            }
            return flag;
        }

Results

“Missing Letter” was very successful. I suspect a combination of the behavior of the OCR engine that led to these bad results and the spellchecker being good at guesses for a single missing letter. Almost all of the missing letter results were found and successfully fixed by this function. “Single Incorrect Letter” however was much trickier. Sometimes a single incorrect letter resulted in another word correctly spelled, so the engine would not find it. Other times there were multiple guesses that satisfied the conditionals and the automatic engine would not be able to solve that issue without human intervention or more complicated automatic algorithms.

Improving Results Further

The easiest way to improve results more would be that if multiple good guesses were made by the engine, a human or another OCR engine would make the final recommendation. Other algorithms could be created including one that observes multiple incorrect letters. In conclusion, a spellchecker is best used to observe errors in OCR results, but it is difficult to correct them successfully without human intervention.

You can download my sample solution here. If you do not own a DotImage license, download a free 30-day trial to get a license.

About the Author

Kevin Hulse

Kevin is the Associate Solutions Enablement Specialist (a Technical Marketing position) at Atalasoft. He has worked prior in both engineering and support at Atalasoft. He also runs the company sponsored softball team and is an avid game player.

Follow on Twitter More Content by Kevin Hulse
Previous Article
Your Whole Programming Language is a Set of Domain-Specific-Languages
Your Whole Programming Language is a Set of Domain-Specific-Languages

  A Domain-Specific-Language (DSL) is a small language used to make...

Next Article
Some Introduction, Some Tesseract
Some Introduction, Some Tesseract

Hi there! I’m Kevin Hulse, the newish Solutions Enablement Specialist at...

Try any of our Imaging SDKs free for 30 days with Full Support

Download Now