“You have 60% fewer customers than you thought,” one of our analysts broke the news to a client’s CEO. It’s a sinking feeling, but duplicate customers are a real problem for many businesses.

In the days of face to face business, it was easier. You recognized regular customers and saw opportunity in new and unfamiliar faces. Now, even a small online presence relies on tools to understand their business. It’s an uncomfortable feeling when your tools are wrong.

The number of customers your business has is foundational to many business-critical calculations. Growth, churn rates, lifetime value and the true cost of customer acquisition are all dependent. Inaccuracies compound and trusted dashboards begin to lie. Decisions are made on inaccurate analysis.

Duplicates cause problems for your customers too. Multiple marketing messages from the same campaign are quickly irritating. It’s difficult to reward loyalty or encourage growth when you don’t know who’s new and who’s not.



Strategies to avoid duplication are a trade-off between ease of use and friction introduced to collect identifying information.

Most businesses, and the market-leading sales & CRM tools, rely on an email address to uniquely identify a customer. Double-opt-in email collection is the gold standard for a positive reputation (and good deliverability) when it comes to email marketing and can help prevent duplicate customer accounts at the cost of a required “sign-up” process. Slow message delivery or emails lost to over-eager spam filters cost sales.

Double-opt-in comes with a cost — increased friction, particularly during checkout or conversion steps. You can choose to skip the confirmation step and rely on other identifying fields but options are limited.

Phone numbers can be a safer choice. Many people have multiple email accounts but only one cell phone number. The inconvenience of carrying two devices has encouraged a trend for a shared personal and business number. It’s not surprising home phones are decreasing in popularity too, with 52% of US adults having wireless service only.

A postal address can help, though conventions and formats differ from country-to-country more than you might expect. Depending on the service, customers may prefer to take delivery to their office. With almost 50% of 25-to-34-year-olds in the UK privately renting expect addresses to change more frequently for a multi-year lifetime customer.

No technique will completely prevent duplication, there are diminishing gains with increased sign-up and checkout friction. The best approach is regular cleaning of your customer data, matching new customer records against your historical base.



Cleaning customer data requires taking a step back from your tools. Your CRM may differentiate by email address but what defines a unique customer for your business?

Products typically bought by households (Internet/TV services, furniture, home security) may have multiple people as their customer, with a lifetime value spanning multiple years and several delivery addresses. B2B sales often have a financial or billing contact alongside the consumer of the service — both of which can change as staff move both inside and out of the company.

With a definition of a single customer that better fits your business, you can begin to build a process for cleaning your data.

This process must be both flexible and adaptable. Simple heuristics will have poor performance on real-world data: it’s unlikely you have many records with an exact match of name and address.


Human intelligence at scale

As humans, we’re excellent at identifying patterns. We quickly recognize two customer records refer to the same person.


Recognising duplicates among ten or twenty records is fine but even low hundreds are problematic.

Machine learning may be a fashionable industry buzzword but the concepts and foundations come from decades-old research. We’ve seen a layered combination of rules-based and “human-in-the-loop” machine learning techniques produce excellent results.


Layered cleaning

For example, we can specify a rule to exclude salutations, punctuation, less valuable terms (e.g “Apartment” or “Building”), and expand common abbreviations:

While still not an exact match, there are fewer differences. But how to quantify fewer?

Levenshtein distance is a mathematical concept, devised by Vladimir Levenshtein in 1965, to describe the similarities between two pieces of text. It’s one of a whole suite of tools for making textual comparisons. Instead of a letter-by-letter comparison, these measures reveal a “fuzzy” match — the kind of match humans do intuitively. In this instance, we get a Levenshtein distance of 7 (try for yourself with this online Levenshtein Distance calculator). We can use the values of these similarity measures as inputs to a training model.

We’ve found a training session with 100 carefully chosen pairs of records, showing two records side-by-side and asking a human if they are the same or different, can produce a model capable of cleaning a database of more than half a million users with high confidence.


Getting started

If you suspect you have duplicates in your customer base, we recommend starting simple. Put together a small test framework, a subset of your users exported in a convenient to work with format (like CSV). Creating an environment with a fast feedback loop will allow you tweak your model and validate results quickly.

We have experience across many clients and industries and can help you define and implement a deduplication process for your data.


Get in touch to find out more.