Document Recognition: OCR
Overview
While OCR (Optical Character Recognition) has been widely regarded as a solved problem, this is only true if the documents are clean and
scanned at very high resolution. OCR performance degrades significantly with even small amounts of noise present in the document image.
We aim to overcome this limitation by incorporating modern statistical language modelling techniques into OCR, to produce a more robust system that will be resistant
to high levels of noise in the document. In addition, we aim to do this without using a large number of stored font models and instead
rely on statistical properties of English language.
Faculty
Graduate Students
References
- Michael Wick, Michael G. Ross and Erik Learned-Miller.
Context-Sensitive Error Correction: Using Topic Models to Improve OCR.
Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2007.
[pdf]
- Gary C. Huang, Erik Learned-Miller, and Andrew McCallum.
Cryptogram Decoding for OCR using Numerization Strings.
Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2007.
[pdf]