GitHub - AI-Passionner/word-recognition-ocr

Deep-Learning-Based Word Recognition OCR

Background

This repo is directly inspired by Baoguang Shi et al.'s CRNN (CNN+LSTM) paper published in 2015. The novel neural network architecture introduced in this paper could be the foundation of modern OCR technology. There are numerous text recognition repos in GitHub after this paper, which are more or less the adaption of the CNN+LSTM architecture, including this one.

I have been using both conventional OCR (OpenText) and deep-learning-based OCR (Tesseract and AWS Textract) quite a long time. The former, I like to call it as a conventional OCR, simply because I want to distinguish it from the modern deep-learning-based OCR. The CNN-based OCR outperforms the conventional one with higher accuracy and less image pre-processing.

Think about the famous MNIST handwritten digit recognition problem. If you build a Logistic Regression model (softmax), probably you will get an accuracy of around 93%. Applying a Feed-Forward Neural Network will boost the accuracy of up to 98%. However, a convolutional neural network could push the accuracy up to >99% easily.

The conventional OCR extracts characteristics out of each isolated shape and then assigns a symbol. With feature extraction, the bitmap of each symbol was broken up into a set of characteristics, such as lines, strokes, curves, loops, etc. Rules were then applied to find the closest symbol. The attached is an example of a detailed terminology available to describe the "geography" of a letter form.

One big benefit of using the convolutional neural network is the automated feature extraction. This works very well in image-related recognition and classification.

However, before the actual character recognition, there is a very challenging part, called character segmentation, separating the various letters of a word. If you look at the next snapshot, you will see what I mean. Some letters are touching and even degraded. It is not a easy to segment individual letters out. It is also mission impossible to recognize those degraded letters!

However, the character segmentation can be avoided if the OCR engine uses word recognition with an artificial neural network. After all, separating a word of the text line is much easier than separating individual letters of a word. But why word recognition, rather than character recognition? It is because of the particular advantages of the novel CRNN architecture mentioned in the paper. The CNN+LSTM architecture is specifically designed for sequence-like object recognition in images. It can learn directly words without detailed character annotation or segmentation.

My philosophy to Machine Learning and Artificial Intelligence is that if you want the machine to predict the data more accurately, you had better let it “see” it. This sounds a little bit of “cheating”. But it is the truth. In machine learning, it is very common that the new model works pretty well at the beginning after the deployment. However, it becomes worse and worse as time going. There is nothing wrong with the model. It is the data because new data are not similar to the training data pool. Back to the text recognition, I developed a word recognition model first, trained on millions of synthetic word images. It achieves >99% accuracy and works pretty well on regular text images (like book pages, newspaper, etc.). When I applied the model on business documents, its performance drops. Why? Because those training synthetic word images are obtained from regular and clean text images.

The text recognition is relatively static. You won’t see big changes in text styles. The training data are cheap and accessible, no matter synthetic or real text images. Developing a new OCR model won’t take a long time. This is why I conduct this research, developing a customized OCR for some business documents. My goal is to achieve a comparable and even higher recognition rate on some business documents than the AWS Textract. The word recognition is the first step in this research. The next step is to conduct document layout analysis, including font style, fonts size, line, cell, box, table, and block. All of these could be very indicative and discriminative features, used for building robust models in downstream.

Mind you, machine learning is not about the machine’s “intelligence”, it is all about automation.

Reference

Neural Network Structure

Download the pre-trained CNN+LSTM model

A pre-trained model (two files) was saved in google drive. Please put them to './best_model'.

Training a new model

A sample of synthetic word images (50K) was included for playing only. They are not good enough to achieve a high recognition rate in real application. The pre-trained model was trained on millions of synthetic and real word images. If you like to have more samples for your research, please contact me (dlaohu.github@gmail.com).

Compare to Textract

Given an text image and its Textract ocr, you can quickly check the model performance. However, there are still some improvements which need to be done, such as image pre-processing and spell check. Removing lines, increasing the contrast, de-noising, etc.

$ python ./test/compare_textract.py --image ./test/test_1/test_1.png --response ./test/test_1/apiResponse.json --output ./test/test_1^C

Document Layout Analysis

A simple version of document layout analysis was updated. The current version was used for detecting the word images in a document. However, an investigation and research are needed. This will be in a new repository, in which a better document layout extraction will be developed, extracting more useful information than texts and coordinates. If you are interested, please join.

$ python serve.py --image ./test/test_1/test_1.png --output ./test/test_1

Some thought about Spell Check

Spell check is usually used for the OCR post-processing. However, it seems to be easily over used. It works well for text-only recognition tasks. In other tasks, the spell check should be adapted for the specific use case. For example, we can consider the spell check on key words for ML/AI applications. Therefore, effors should be focused on improving the CRNN accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
best_model		best_model
data/train		data/train
images		images
src		src
support_data		support_data
test		test
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
serve.py		serve.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best_model

best_model

data/train

data/train

images

images

src

src

support_data

support_data

test

test

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

serve.py

serve.py

train.py

train.py

Repository files navigation

Deep-Learning-Based Word Recognition OCR

Background

Reference

Neural Network Structure

Download the pre-trained CNN+LSTM model

Training a new model

Compare to Textract

Document Layout Analysis

Some thought about Spell Check

About

Releases

Packages

Languages

AI-Passionner/word-recognition-ocr

Folders and files

Latest commit

History

Repository files navigation

Deep-Learning-Based Word Recognition OCR

Background

Reference

Neural Network Structure

Download the pre-trained CNN+LSTM model

Training a new model

Compare to Textract

Document Layout Analysis

Some thought about Spell Check

About

Topics

Resources

Stars

Watchers

Forks

Languages