Document Recognition: OCR


Overview

While OCR (Optical Character Recognition) has been widely regarded as a solved problem, this is only true if the documents are clean and scanned at very high resolution. OCR performance degrades significantly with even small amounts of noise present in the document image. We aim to overcome this limitation by incorporating modern statistical language modelling techniques into OCR, to produce a more robust system that will be resistant to high levels of noise in the document. In addition, we aim to do this without using a large number of stored font models and instead rely on statistical properties of English language.

Faculty


Graduate Students


References

  • Michael Wick, Michael G. Ross and Erik Learned-Miller.
    Context-Sensitive Error Correction: Using Topic Models to Improve OCR.
    Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2007.
    [pdf]
  • Gary C. Huang, Erik Learned-Miller, and Andrew McCallum.
    Cryptogram Decoding for OCR using Numerization Strings.
    Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2007.
    [pdf]