Deep Learning methods for checkboxes detection and classification
by Elliot Hofman, Data Scientist at Zelros
In Machine Learning -and that is probably what makes it so interesting- there are often several ways to tackle the same problem. Depending on how we choose to model the task, each step will be impacted. How are we going to obtain the data and which degree of labeling do we need? Which kind of ML algorithm will be better suited, and which loss and metrics will be computed to train the model efficiently? All the approaches will not necessarily lead to the same results, and we usually look for the best trade-off between :
- The performance (how good is the model?)
- The inference time (how fast is the model?)
- The scalability and simplicity of the approach (how light and maintainable is the approach when we will need to adapt it, retrain it, or explain it to our customers)
For instance, stacking 50 models in order to gain a little accuracy boost does not usually make much sense in a production context, as maintainability, infrastructure costs and inference time would be severely impacted. On the other hand, in a Kaggle competition this is the absolute opposite, as sometimes a few thousandths or even ten-thousandths of unit metric can lead to important gaps in the leaderboard.
While designing our CheckBoxesAnalyzer for insurance documents, we went for a first approach that turned out to be not so convenient in the long run. We will present this solution to you (here in the case of accident reports), and explain why it was finally a bad idea and how we built a much more efficient architecture afterwards. Because eh, part of the Data scientist’s job consists in making mistakes, being wrong and learning from it, right?
The “heading straight to disaster” approach
In our problem, there are two possible states for the checkboxes, either they are ticked or unticked. It felt natural to isolate each checkbox in order to train a binary classification model on them. Such an approach assumes that we are able to first locate and crop each checkbox. As we already had an OCR module to parse the different textual zones of the document, we decided to use them in order to find the checkboxes.
More exactly, on the accident report each box is assigned to a fixed textual label, for instance boxes number 1 are the one corresponding to the “* en stationnement/à l’arrêt” (parked/stationary) label, boxes number 2 to the “* quittait un stationnement/ouvrait une portière” (leaving a parking space/opening a door) label, and so on. By using the textual zones returned from the OCR, we can compute Levenshtein ratios, identify labels with their number, and then compute some relative offsets from the text coordinates to find the checkboxes.
To use this solution, one needs first to compute manually each offset for each of the 17 labels and 2 drivers, which leads to 34 offset functions that must be defined. Of course this is an expensive process, but which needs to be done only once. Although the approach might seem a bit heavy, it was quite effective to crop correctly all the checkboxes on our accident reports.
We are now left with the task of classifying them as ticked or not. How are we going to do this?
It is always a good idea to try not to inject Machine Learning models where they are not needed. Here, we first thought about counting the pixels, as a ticked box would imply more black pixels than an empty one. However, to count pixels effectively, images must first be binarized in order to have pure black and white inputs. Depending on the acquisition process of the accident report (scanned, photograph, photograph of a scanned document, …), the lightning, contrast and noise of the image, the kind of mark (tick, cross, circle,…) and the thickness of the pen, there can be some important variations regarding the optimal pixel thresholding values. This makes it hard to find binarization parameters that would work on all the checkboxes crops, even when using advanced methods such as Gaussian Blur or OTSU Adaptative Thresholding.
So we tried to flatten the image and train a shallow ML model on the vectorized inputs, such as SVM or XGBoost. By doing so, we are essentially making the model learn an “intelligent pixel counting”, but we are facing the same issues as above, even when vectorizing images with methods that take into account visual properties (Histogram of Oriented Gradients vectors, SIFT* descriptors, …)
As often when dealing with Computer Vision tasks, Deep Learning models turn out to be the most effective. Indeed, convolutional layers allow the model to build its own features, and understand itself the spatial structure of the images (corners, lines, curves, shapes, …). In our case, no need to invoke a full ResNet50, a few layers of Convolution and Dropout allowed to obtain quickly a near 1 validation accuracy, errors being usually made on ambiguous inputs.
So we implemented a solution to detect and classify each checkbox and reached an almost perfect accuracy. Super, right? It turns out that this approach has some strong weaknesses, some of them you might have identified already :
- It depends directly on a first OCR step to detect textual zones on the image. This means that, each time the OCR module will fail to identify some text, we won’t be able to provide a prediction for the corresponding checkboxes
- It is not robust to template changes, even when the new template shows minor differences. That is especially the case when the language of the accident report changes : all the words are translated and as such, we loose all our markers used to locate the checkboxes. All the offsets must be manually computed again!
- The approach can easily be fooled when the input document is rotated or shows some deformation, which is very often the case when the accident report has been scanned on an irregular surface or shot from a non-flat perspective
Enough time had been lost trying unstable methods, we had to find an other solution!
The “we should have done it from the beginning” approach
After we realized that the first approach was too sensitive, we decided to remove all the pieces that made it hard to maintain and scale … and finally ended up with a much simpler approach. In fact, when we as human take a look at the accident report, we don’t mentally isolate each checkbox and decide if it is ticked or not, we just get an ensemble view of the checkboxes, detect with a single look which are the ticked ones and check to which number they correspond.
We decided that we would not think about the problem at the scale of the single “checkbox”, but rather give the whole checkboxes section to the model and let the magic figure out itself which are ticked or not. This means that the problem is now a multi-label classification task with 17*2=34 classes. This time, targets are one-hot encoded vectors of size 34, with 1’s for checkboxes that are indeed ticked.
The input image being now more complex, we chose a deeper architecture, namely an EfficientNet B3, with a sigmoid activation for each coordinate of the output vector and trained with a binary cross entropy. Although this approach is a bit longer to converge than the first one, after a sufficient number of epochs we reached the same near-perfect performances.
But this new architecture was far better than the first one!
- It does not depend on another OCR module, so risks of failure are reduced by a large margin, and it also makes it agnostic to the language of the accident report
- It adapts much better to the variation of templates, and to the eventual rotations and shearing angles. In fact, these variations can be handled directly during training using random data augmentation
- By taking a look at the big picture, it learns to model the associations between the different checkboxes. That is, most of them are mutually exclusive (for instance, the driver can not be both turning to the left and turning to the right), and some combinations of checkboxes are impossible in practice. There was no way the first approach could figure this, because it only looked at single checkboxes
- When we need to retrain the model, we will have close to zero additional work to do. In the first approach we would have needed to first extract the new checkboxes with the OCR extracts, and retrain the classification weights afterwards
- At inference time, the model needs to perform a single forward instead of 34 distinct predictions
As seen through this example, there are often several ways to tackle a Machine Learning problem, and unfortunately your first shot will not always turn out to be the best option!
However, choosing the model architecture is not the only issue that you will face in your Data Science projects. In fact, problems might arise as soon as you receive the data! Your data might be incomplete, skewed, highly biased, insufficient … See in the next chapter how we fought the biases of accident reports with our synthetic data generator!