Entity Matching: the secret sauce in people-centric AI

Capgemini

2020-06-02

What is Entity Matching and why is it important?

Entity matching, entity resolution and de-duplication are terms used to describe the transformation of entity records (normally people or organisations/groups) into a single version of the truth that can be used as input for machine learning models.

Entity Matching is needed for various reasons and here I’ll focus on its use for Artificial Intelligence (AI).

Entity Matching is essential for getting accurate predictions. For example, simple tweaks here can double or half the value of your target variable. Any tuning to get a 0.5% increase in AUC will be pointless without reliable Entity Matching.

Jump to the end for the recommended way to deliver your Entity Matching needs.

Example

Predict the likelihood that an applicant will repay a loan

To build a predictive model use the applicant’s characteristics and credit history and compare them with previous loan applicants using machine learning. If they’re called John Smith and I want to run a credit check then how can I be sure that every “John Smith” in my data is the same or different when we’re rarely have reliable foreign keys or identifiers? What about “J Smiths”, “Smith, John” or “Jon Smith”? I might consider other characteristics like date of birth or passport but now I need to account for different date formats, passport formats and of course my applicant is likely to have replaced passports in the past. Also, what if they’re a fraudster and purposefully altering or
obscuring their information?

This is the kind of problem where effective Entity matching is essential.

How to do Entity Matching

You can manually build matching rules.

This involves creating IF THEN ELSE statements that account for every possible scenario in your population of records to be matched. You’ll want to use as many characteristics as possible to maximise correct matches (true-positives) whilst minimising false matches (false-positives). You’ll use string re-formatting, Regular Expressions, look-up tables, translation and Soundex engines.

Rules-based Entity Matching splits into Deterministic and Probabilistic; with the latter using fuzzy string similarity metrics (e.g. Levenshtein Distance or Edit Distance) and applying weights to different characteristics.

This method is very straight-forward, transparent and quick to implement. It works very well in scenarios where there are a small number of possible differences in characteristics.

For example, there are not many ways in which a UK postcode can be written so UK postcode matching is a good use case for rules-based Entity Matching. However, this approach becomes very time-consuming and inaccurate as these differences increase or new differences regularly emerge. For example, a matching engine for customers of a retailer who is expanding operations into new geographies and therefore the matching engine will need to be continually updated with new naming conventions and address conventions; forming a cathedral of rules which becomes unmanageable (pictured below).

**Large corpus of matching rules (Image source: Medium.com)**

When rules-based matching is unsuitable you could use machine learning. This will train a model (generate matching rules) based upon the input data provided; avoiding the need to manually build matching rules. This method is extremely fast for building accurate Entity Matching engines (typical process shown below). However, it has weaknesses, such as how do you obtain good training data and how can you be sure that every possibly permutation is captured in that training data?

**Machine Learning workflow (Image source: D da Silva)**

As such, the best enterprise Entity Matching solutions use a combination of the two approaches and:

Pre-load with existing knowledge – for reliably matching smaller datasets or skewed datasets and for speed improvements with larger datasets
Augment with known relationships – to improve matching, especially when you have limited or unreliable characteristics that regularly change
Built-in translation – for global coverage and for international populations. Not as simple as just using the Google Translate API as names have different conventions and meanings that aren’t easily translated between languages
Blocking strategy – essential for speed and to lessen the effects of exponential growth of matching time

How do you choose a commercial solution or build your own?

It’s not practical to build your own Entity Matching engine equivalent to commercial solutions which have been developed over a number of years and with significant investment. However, if you have a very well-understood and narrow population with a small number of simple differences (such as the UK postcode example earlier) then it may be cost-effective to build your own. As a crude check; take one sheet of A4 paper and try to write all the rules that account for differences in characteristics. If you can’t do this within 60 minutes then your use case is more complex.

Good news – this is a solved problem. There are many excellent commercial-off-the-shelf solutions. Now your challenge is choosing the right one for your current & future use cases, IT landscape and budget. When drafting your procurement business case, you must look beyond licence prices. Consider set-up and maintenance effort, hardware requirements, tunability, extendibility, skills required, modularity and use of open standards (i.e. interfaces), stability of company and/or size of user community, matching speed, bulk/batch versus stream matching, etc.

Recommendation

To develop an Entity Matching solution to meet your needs:

Investigate if your problem can be solved with rules-based matching (perhaps using the ‘A4 paper method’ above)
- If YES, consider using existing Open Source Python libraries (e.g. Dedupe, Record Linkage) or commercial solutions (e.g. Datactics, Informatica, SAS) to accelerate development
- If NO, scan the market for suitable commercial products (e.g. AWS Glue, IBM Big Match, NetOwl, Palantir, Quantexa, Senzing)
Compare the Whole Life Cost (WLC) of implementing the different solutions versus the expected business value
- Looking beyond just licence fees as discussed above
Test / prototype your preferred solutions using (free) trials and your data; not using a manicured dataset provided by the vendor!

The above steps can be done within a few weeks and enable you to reach a prototype stage with robust evidence for your business case for the Production solution.

Author

Dave da Silva

CTO for Data Science & AI

Dave works with clients designing and building useful and effective AI and Machine Learning solutions to meet their business needs. He helps clients navigate the wide range of commercial and open source solutions. He helps them see past the sales material by rapidly prototyping to guide future strategy and budgetary decisions.

Predict the likelihood that an applicant will repay a loan

How to do Entity Matching

How do you choose a commercial solution or build your own?

Recommendation

Author

Related