How to fight the dataset biases with synthetic data generation?
by Elliot Hofman, Data Scientist at Zelros
(This article is the third part of a blog-post series about accident reports automation. Check the introduction if you need a refreshing on previous chapters!)
Data generation is a trend topic nowadays. Some absolutely astonishing applications have recently emerged in various machine learning domains. In NLP for instance, the OpenAI GPT-3 language model paved the way for new standards of text generation, pushing the limits for chatbots and other entertaining applications such as frontend code generation from the description of the widget. In Computer Vision, Generative Adversarial Networks (GAN) and their variants are one major breakthrough that allowed to reach new levels of image generation, as in DeepFake’s for instance. NLP and CV have also been mixed up with models able to generate captions from images, or conversely generate images from captions.
While generative systems get more and more criticized because of their potential dangerous applications, synthetic data is also a major challenge for data-scientists themselves. In fact, being able to produce fake credible training data might help to build stronger Machine Learning models :
- When the acquisition process or the labeling of the data is a costly process, generative systems can allow to produce many labelled samples programmatically
- The new standards of data protection and privacy imposed by the GDPR regulation control the way algorithms can store and use sensitive personal data. Being able to build synthetic datasets that follow the same distribution as the original data and use it as a proxy can help data-scientists to create models that guarantee the full confidentiality of the customers personal data.
In our case (automating the analysis of checkboxes), we implemented a synthetic samples generator in order to fight an other well known datasets issue: the data biases. Typically, the structure of the checkboxes on an accident report is sparse by construction. Most of the checkboxes are mutually exclusive, and many combinations are impossible in practice: for instance, the driver can not be both overtaking an other vehicle (checkbox 11) and driving backward (checkbox 14). Unless …
Also, some kind of accidents are much more common than others. This means that in our dataset, most of the checkboxes are left unticked, and some of them are almost always empty. However, for our model described in the previous section to operate correctly, the neural network must see a high number of ticked and unticked checkboxes during training, and this for each one of the 34 locations. By training on real accident reports, we expose our model to some real risk of overfitting these biases.
Checkboxes generation with GAN’s?
Although GAN’s seemed a bit over-dimensioned for our simple problem of checkboxes generation, we decided to have fun and train a StyleGAN-2 architecture to generate some fake checkboxes. This is the architecture that is behind the websites thispersondoesntexist.com or thiscatdoesntexist.com.
We fed some ticked and unticked checkboxes images to the StyleGAN-2, and the model started to generate some fake checkboxes with credibility increasing over the iterations. The model actually learned how to draw a square and put a cross in it! 🔥
We had a good time playing with GAN’s, however we noticed that we were unable to control the images generated by the neural network. That is, we had no way to enforce a certain kind of pattern or impose the size of the check marks for instance. That’s why we quickly switched to …
The good old handcrafted generation method
The structure we are trying to generate here is fairly simple (as compared to a human face or a cat for instance), so we decided to implement ourselves our generator with stochastic handcrafted rules. By inspecting carefully the different accident reports, we notice that humans tend to tick the checkboxes following recurrent patterns:
- The type of the mark: Is it a cross ✖️ ? A circle ⭕ ️? A single stroke 〰️ ? A tick ✔️ ? An other weird mark ➰ ?
- Each type of mark is itself described by various parameters. For instance, a cross might have one or two strokes, the strokes can have varying sizes, varying angles between them, … Same things for circles, that can be either filled or empty, with different ellipse axes, …
- Finally, the mark appears with a particular contrast, thickness of the pen and offset distance from the center of the printed checkbox
Using standard image processing Python libraries such as Pillow and OpenCV, our generator randomly chooses a mark pattern and sub-parameters to fill the pixels of a blank RGBA background. In order to model the fact that a human will never produce the exact same mark twice, we also disturb the image with random scaling, erosion/dilation and channels permutations, and finish with an elastic distortion whose magnitude α is set proportionally to the size of the mark.
All we need now is to insert these marks on an empty template. For this, we gathered around 50 different blank accident reports that did not have any checked boxes and annotated the coordinates of the 34 checkboxes for each of them. This is a manual process, but quite fast using any image annotation tool, and that only needs to be done once for each background template.
During the training loop, the model is fitted on both real and synthetic images. Images are generated on-the-fly with the following steps:
- Choose randomly one of the 50 background templates
- Select the checkboxes that will be marked. For this, we used our real accident reports database to compute a correlation matrix between the 34 checkboxes. We then sample from this distribution to decide which checkboxes we are going to artificially set as ticked. This allows to generate coherent and credible configurations, as we already discussed above
- Starting from the annotated coordinates, compute random offsets from the centers and paste a generated mark on each one of the selected checkboxes.
Using Albumentations and Pytorch’s wrappers, we send these images to data augmentation pipelines (shearing, rotation, noise, scaling, …), and train the weights so that after a large number of epochs, the model has seen many possible combinations of checkboxes. See previous part for more details on this!
Going further: Data Generation for OCR tasks
Being able to detect and classify the checkboxes on the accident report is a good thing, but that would be quite useless if the engine was not able to read also the informations about the drivers, in order to know who are the customers involved. The extraction of fields such as the date, time and location of the accident or the name, phone number, license plate, insurance company, contract number, … of the drivers is mandatory in order to efficiently automate the accident reports parsing. It is a challenging task, considering the fact that:
- The accident report is a fully handwritten document
- People after suffering a car accident are usually nervous at best, if not in a state of shock. They would fill the accident report on their car hood, which is prone to make their writing unreadable — doctor style.
- They will sometimes send their insurer the carbon copy of the report, which makes it even harder to read once digitalized
For these reasons, we absolutely needed to train our models on huge datasets. Indeed, models need to see a great number of distinct words and writing variations in order to get a reasonable understanding of how and what people write on the accident reports.
Starting from a set of around 10 000 scrapped public fonts labelled as “cursive” or “handwritten”, we generate random names/phone numbers/dates/… and use similar techniques to generate fake images so that we can train our OCR models. (See next figure for some visual examples)
The bottleneck here is that a non-negligible number of generated images did not look too realistic as compared to real examples. We used a two-steps approach to filter them out:
- We gathered all the real images and labeled them as positive, then gathered all the generated images and labeled them as negative. We composed the dataset and trained a binary classifier to distinguish between real and fake inputs. We monitored the AUC metric rather than the F1-score, because what will matter to us is not the classification power of the model. We rather wanted the output probabilities to act somehow as a ranking measure of the “likelihood” of the input. That’s exactly what the AUC does. The F1-score stayed low anyway, but for once it was quite a good thing, meaning it is globally hard for the model to tell the difference between some fake and real input.
- Once the model was trained, we used it to score each fake images and rank them by “how likely they look” to the neural network. We then kept the top-N images that seemed the most realistic to the model. As our generation process allows to produce an almost endless stream of synthetic inputs, we did not mind much about removing some of them.
Here are a few examples of what our OCR models take as input :
To finish with, we are left with the burning question: was all of this worthy? We were happy with our synthetic data generators that were quite fun to implement, but did they really help our models be more robust ? Fortunately, the answer is yes! Using data generation for our OCR models yielded a >30% increase in validation accuracy, also reducing drastically the average Levenshtein character error rate (CER: something that you can interpret as “when my model fails, it doesn’t fail too much”). Yaay! 💪
As we trained our ML pipelines, we ended up with a bunch of different models. In the next section, we will present you an efficient — yet not so well-known — trick that you can use to validate and assemble various ML models in order to boost performance. Stay tuned!