Beyond simple OCR: an outlook of modern AI vision for insurance automation

4 min readDec 22, 2020

--

By Elliot, Data Scientist at Zelros

Our mission at Zelros is to bring Artificial Intelligence to insurance companies, and enable them to offer a better service to both the insurers and their policyholders. In fact, even if the digital area has already started to reshape the insurance industry, many processes still require heavy human interventions, from underwriting to claim handling. Of course, reaching a full automation rate by AI algorithms often is an unreachable goal, but being able to automate certain parts of the processes with high levels of confidence allows to reduce processing delays. AI becomes an additional tool to strike the right balance between repetitive low-valued tasks and complex high-valued tasks. Insurers can automate parts of them, while focusing on the other ones that require full human experience and knowledge.

Zelros Documents2Insights is a module of our platform to perform processing automation of all the pieces of legal documents that are commonly handled in the insurance industry: national ID cards, driving licenses, car registrations, insurance statements, tax notices, …

Optical Character Recognition (also known as OCR) is a Machine Learning subfield, in which models are given an image as input and are asked to predict what is the textual content displayed on the image :

If OCR is obviously one of the main pillars in order to automate the parsing of such documents, it is worth mentioning that many other ML tasks are also crucial when analyzing these documents. Here are a few examples of the Computer Vision problems we work on at Zelros:

Checkboxes analysis

Detection and recognition of signatures

Is the document correctly signed? Does the signature correspond to the user database-registered signature?

Forgery detection in order to prevent fraud automation

Observe how characters seem to have been software edited

Drawings classification

Were the two vehicles riding in the same direction? Did one of them cross a white line?

Automatic image processing in order to remove non OCR-friendly artefacts

The new French driving license encapsulates fields with this weird symbol, known as the “Alien character”. While it is an effective way to prevent forgery, it will also confuse most OCR systems, as they will try to interpret it as a regular character. Here, ML algorithms have been used to remove them before going into OCR

An example of french redemption request in which the strokes that are printed on the form ruin the OCR prediction, typically interpreted as ‘I’ or handwritten characters continuation. Removing them with Computer Vision methods leads to an increased OCR accuracy

Out of all the documents that are handled during insurance processes, we chose here to discuss in more detail how we tackled the problem of the European Accident Report automation. In Europe, when two drivers are involved in an accident, they must fill in a form describing the circumstances of this accident and provide some information about both drivers.

It is an interesting piece of document because it is very rich from a Computer Vision perspective. It features both typed and handwritten, structured and unstructured fields, along with drawings, checkboxes and signatures. Also, even if insurance companies can decide freely of the content and aspect of this document, it has been normalized so that differences between companies and countries are often minor. In France, it is estimated that 5 million accident reports are produced each year, which represents a huge volume.

In this blog-post series we will discuss more specifically about how we succeeded to automate the checkboxes analysis, that is, predict whether the checkboxes are ticked or not. This is an easy task to understand, and that is also encountered in many other kinds of documents. In fact, the methods we will present here have also been applied with success for other insurance forms, from both subscription and claims.

The checkboxes analysis will serve us as a pretext to discuss various general Machine Learning concepts:

(1/4) What are some good and bad ways to model a ML problem?
(2/4) How to fight the dataset biases with synthetic data generation?
(3/4) How to assemble and validate various ML models?
(4/4) How can we explain and interpret our ML model predictions?

EfficientNet B3, one of the ConvNet featured in this blog post series

Enjoy the reading!

Beyond simple OCR: an outlook of modern AI vision for insurance automation

Written by Zelros AI

Responses (2)