One of our analysts, Anthony, recently broke the news to a client’s CEO. “You have 40% fewer customers than you thought.” Ouch. Anthony had discovered duplicate customers in the client’s database. It’s a sinking feeling.

But, this discovery wasn’t just a minor frustration, something the client could tackle “one day.”

Their bottom-line metrics – everything from their business growth, churn rate, lifetime value and true cost of customer acquisition were – well, wrong.

Unfortunately, the existence of duplicate customers is a problem that affects a lot of online businesses, and it’s particularly prevalent in the world of ecommerce and SaaS.

According to a recent Gartner study, bad data could cost your business up to $15 million per year.

We’d hate to see you duped. So, here’s everything you need to know about duplicate customers, the missed revenue opportunities you’re leaving on the table and our best tips for deduplication.


First, how do duplicate customers occur?

Back when businesses were mostly brick and mortar, it was easy to spot a fresh-faced customer walk into your store.

Now, most businesses use CRMs to uniquely identify new customers via their email addresses. This, unfortunately, is where the problem begins.  

For example, if you have an ecommerce site and you offer new customers a 10% discount to purchase a product, some customers will take advantage by signing up multiple times.

Great for them, not so good for you.

Duplicate sign-up on Ecommerce site
Same customer signing up for offer twice

Or, sometimes customers simply forget which email ID they used when they made their first purchase or signed up for a service. Many people have multiple emails IDs and use them interchangeably, with no specific motives. And finally, some customers simply mistype their email ID when checking out for an ecommerce purchase.


Duplicate customers warp mission-critical acquisition metrics

Duplicate customers can have a negative impact on your customer acquisition metrics, and ultimately – your growth. Let’s look at an example:

Imagine your CRM tells you that you have 100,000 customers and your average customer lifetime value (LTV) is $100. Based on your LTV, your target customer acquisition cost (CAC) is $33 (a good ballpark ratio for LTV to CAC is 3:1).

But, what if half of your customers were actually duplicates? You would, in fact, have 50,000 customers, but their average LTV would be $200.

In this scenario, your company would be missing a huge opportunity to acquire more customers. Why? Because when your LTV metrics are wrong, your target CAC is also likely to be wrong. In this example, your ideal CAC would actually be $66 (remember, the ideal ratio for LTV to CAC is 3:1). With a higher CAC, you could increase your bids across relevant channels and campaigns, and acquire more of these profitable customers.

The bottom-line is that duplicate customers are costing you real, high-value customers.


Duplicate customers damage brand reputation

Consider this –

Every time you hit ‘send’ on a marketing message your duplicate customers will receive multiple, mistimed messages. That’s annoying at best and spam-triggering at worst.

Your brand reputation and email sending reputation will both suffer.


Duplicate, mistimed messages
Duplicate, mistimed messages are costly too


Can I prevent duplicate customers?

The short answer is no – there’s no way to completely prevent duplication. However, there are ways to reduce the number of duplicate customers you have. For example, you can require:

  • Double-opt-in email – but confirmation messages can be slow to deliver and may get lost to over-eager spam filters.
  • A phone number – but many customers are reluctant to share phone numbers.
  • A postal address – but addresses are often entered in different formats, and change frequently for multi-year lifetime customers.


If there’s no prevention, then what’s the cure?

The good news is there is a cure. You should regularly clean your customer data by matching new customer records against your historical base. This is not a straightforward task, as your records won’t be exact matches. For example, duplicate customers might format their details differently, like this:

Customer with different address formats
Same customer, different address formats

As humans, we’re excellent at identifying patterns. We can quickly recognize that these two customer records refer to the same person. Manually identifying duplicates among ten or twenty records is achievable, but once you get into even low hundreds of records, this becomes practically impossible.

Machine learning may be a fashionable industry buzzword but the concepts and foundations come from decades-old research. We’ve seen a layered combination of rules-based and “human-in-the-loop” machine learning techniques produce excellent results.

One technique we have explored is a mathematical concept called the Levenshtein distance, devised by Vladimir Levenshtein in 1965, which describes the similarities between two pieces of text. It’s one of a whole suite of tools for making textual comparisons. Instead of a letter-by-letter comparison, these measures reveal a “fuzzy” match — the kind of match humans do intuitively.

We’ve found a training session with 100 carefully chosen pairs of records and a human verifying whether they are the same or different, can produce a model capable of cleaning a database of more than half a million users with high confidence.


Tips for getting started

If you or one of your technical teammates would like to give this a go, here are our tips for getting started:

  1. Define what a unique customer is for your business. Just because most CRMs differentiate by email address doesn’t mean you have to.
  2. Put together a small test framework and export a subset of your users into a simple format, like a CSV.
  3. Run your model in an environment with a fast feedback loop. This will allow you to tweak the rules of your model and validate results quickly.

If you’d like further advice on how to get started, please let us know. We’d be happy to assist.


Reclaim your revenue, metrics and brand reputation

A deduplication process should never be a rainy day project – something to tick off the “business bucket list” one day. It’s something you need to address right now, so you can reclaim lost revenue opportunities and customer trust.

You might walk away with fewer “customers” but they’ll be higher quality and happier.

P.S. If you’re stuck, we’d love to help you define and implement a deduplication process for your data. Get in touch to find out more.  

Brian White
Brian White

VP of Engineering at Conjura

All author posts

Privacy Preference Center