Technology has become an integral part of our lives and is influencing everyday decisions not only for individuals but for society as well. Pick any section of society or the areas we work in such as education, transportation, medicine, entertainment, etc. technology has a vital role to play in all these departments. Have you ever wondered what technology lies behind when you excitedly scan the vouchers and promotional codes on the packets of any grocery item or maybe other products on your mobile phones to avail offers? Or, have you ever wondered how you got a challan for a traffic rule that you broke when you thought there was no cop to see it? Well, the technology that is working behind the scenes in these scenarios is OCR (Optical Character Recognition).
OCR is a computer technology that helps machines to understand the text in a given sample. The sample could be an image, a written or a printed document. OCR is mostly used to differentiate between any printed or handwritten text characters inside a digital image of physical texts such as a scanned document or a vehicle registration plate. It is one of the earliest addressed computer vision tasks and when the data is normal and simplified only normal algorithms are sufficient. However, for large and complex datasets, deep learning is required. There are a lot of areas where OCR plays a vital role. The data available nowadays is in terabytes & petabytes and it requires deep learning to analyze such large data for valuable outputs. OCR functions on a very basic process that involves examining the text of any particular document and then translating the characters into code to be used for data processing.
In this blog, I will talk about some strategies, methods, and logic to address different OCR tasks. To make it more simple, I will provide available datasets for you to play and understand the fundamentals of OCR.
1. Factors influencing OCR
In layman’s term, OCR works in extracting all the possible textual information from an image, for example reading a license plate number or road signs. The OCR technology is used in data entry automation, to assist blind and visually impaired and to index documents for search engines. What makes OCR a gem in everyday life is its ability to strengthen the systems and services. However, there are some factors which one needs to keep an eye on before we start using OCR. Here are some of those -
- Text density: Text density varies from one sample to be analyzed to another. For a written/printed page the text is dense, whereas the image of a street with a single street sign is scattered. Such variation will result in a change in density for different images.
- Structure of text: The printed text on a page is well-structured in strict rows. The handwritten text appears to be sparse in different rotations.
- Fonts: Printed fonts are easy to read and easy to extract because they all are well-structured but the handwritten could be unorganized or not that well-structured to what you call noisy.
- Character type: Text can be written in many languages, may be different from each other. Whereas, in the case of numbers they are completely different from the text.
- Artifacts: Low-resolution images are noisier than high-resolution images.
- Text location: Some images may contain text at different locations like in the middle, corner, or any other random location.
Now there are several datasets available viz. NIST Database, MNIST Database, Devanagari Characters, Mathematics Expressions, Chinese Characters, Arabic Printed Text, Document database, Street View House Numbers, Natural Environment OCR, etc. and each dataset has its challenges. So, I will elaborate on two of the most commonly used.
MNIST, which stands for Modified National Institute of Standards and Technology database, is the most commonly used dataset for beginners to help them to build their confidence in OCR technology. MNIST is an extension of the NIST database. It collects the low-complexity data of handwritten digits used to train as well as test various supervised machine learning algorithms. MNIST is a well-known computer vision challenge. It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. MNIST is normally preferred only for digit recognition and is usually not considered as a full-fledged OCR task.
SVHN stands for the Street View House Number dataset. It is similar to the MNIST data-set. As the name suggests, it contains 73257 digits of house street numbers which are extracted from Google street view images. It is the best dataset for intermediate learners. A great benchmark dataset to learn from and train models that identify street numbers accurately. It can be easily incorporated into various projects.
3. Practical use cases
As mentioned earlier, OCR is widely used in businesses and day-to-day life. Some of the practical uses include business card readers, parking and traffic monitoring, document moderation and processing, mobile payments, auto-translations, cheque deposit machines, etc. Let’s discuss some of the important ones in detail.
Vehicle Number Plate Recognition
Perhaps, the most significant application area of OCR right now is in vehicle recognition. Once the image of a vehicle is captured the number plate is detected first and then the characters are recognized. Since plate shape is relatively standard, some approaches use simple reshaping before actually recognizing the digit. Following steps are involved to apply OCR to a Number Plate:
- The image gets divided into small images.
- The number plate area is recognized.
- Analysis of number plates to extract characters.
- OCR applied to the analyzed characters.
Following is an example of a processed output
The most common tools used to recognize license plates are OpenALPR and CRNN. OpenALPR is a very powerful tool used to recognize license plates from different countries. CRNN is used to recognize Korean license plates.
The printed or pdf OCR can be cited as the most common use of OCR. The nature of the printed documents is structured which makes them easy to parse. Tesseract is one of the most commonly used OCR tools to address structured documents and to achieve successful results. I will be talking about this in detail along with test cases and code snippets in my next blog about ‘real-life examples of OCR’.
Implementing OCR in the real world is a little more complex than it looks. There are challenges that computer vision poses because of the samples affected due to scenarios like lightning conditions, noise, and artifacts, especially in the case of SVTs, machine-printed images, and handwritten images. So, is there an easy way to sail your boat through these troubled waters? The answer is ‘Yes’. My next blog will focus entirely on the different approaches to OCR and in-depth analysis of the practical OCR scenarios. Stay tuned!