Best Practices for Matching and Merging

Best Practices for Matching and Merging in CluedIn

Matching and merging is a core capability of CluedIn. It allows you to identify duplicate records across systems and consolidate them into a single golden record. A well-designed matching strategy ensures that the platform can reliably detect duplicates while maintaining high confidence in automated merges.

This article outlines the available approaches to matching in CluedIn and provides practical best practices for designing an effective matching strategy.

Approaches to Matching and Merging

CluedIn provides three primary approaches to identifying duplicates. These approaches can be used independently or together depending on the complexity and quality of your data.

1. Matching Rules (Deterministic and Probabilistic)

The most common approach to matching in CluedIn is through Matching Rules.

Matching rules allow you to define how records should be compared in order to determine if they represent the same entity. CluedIn supports both:

Deterministic matching – exact or strict comparisons
Probabilistic matching – similarity or fuzzy comparisons

You can create multiple matching rule groups, and CluedIn supports cascading matching.

This means:

The first rule group attempts to find matches.
If no match is found, CluedIn automatically attempts the next rule group.
This continues through all configured groups until a match is found or all groups have been evaluated.

This cascading approach allows you to start with highly confident rules and gradually fall back to broader matching logic.

Using Normalizers

Matching rules can also use Normalizers, which allow you to influence how values are compared without modifying the underlying data.

Examples include:

Ignoring casing differences
Standardizing whitespace
Normalizing formatting differences

For example:

John.Smith@Company.com

john.smith@company.com

With appropriate normalization, these values can be matched without requiring changes to the original data.

Normalizers allow you to improve matching accuracy while keeping your source data intact.

2. Matching Using Identifiers or Keys

Another approach is matching using identifiers or keys.

This is the simplest and most deterministic form of matching.

In this approach:

Records are matched based on a shared unique identifier
The identifier is typically sourced from upstream systems
Configuration is managed in the CluedIn mapping screens

Examples of identifiers include:

CRM IDs
Customer numbers
External system keys
Source system record IDs

When reliable identifiers exist, this method can produce extremely accurate matching with minimal configuration.

However, many real-world datasets do not always contain consistent identifiers, which is why rule-based or AI-driven matching may also be required.

3. AI Agents and the Find Duplicates Skill

CluedIn also supports AI-powered matching through AI Agents using the Find Duplicates skill.

This approach allows you to analyse large volumes of data without relying on traditional matching rules.

AI Agents use a rule-less matching approach, meaning they evaluate records based on semantic similarity rather than predefined rules.

This can identify potential duplicates that traditional rule-based methods would struggle to detect.

For example:

Value A	Value B
NASA	Space Company

Although these values are semantically related, a traditional rule-based approach based on string similarity would likely not detect them as related.

AI Agents can identify these kinds of relationships because they analyze meaning rather than just string distance.

It is important to understand that this approach does not replace rule-based matching. Instead, it complements it by identifying matches that rules may miss.

Best Practice: Clean Data Before Matching

One of the most effective but sometimes counterintuitive strategies when designing matching rules is:

Clean your data to fit your matching rules rather than creating increasingly complex rules to fit messy data.

Many teams initially attempt to solve data quality issues by building large numbers of matching rules. However, this often leads to:

Complex rule configurations
Higher maintenance overhead
Increased risk of incorrect matches

Instead, a better approach is to standardize and clean the data first, and then apply simpler matching rules.

CluedIn provides CluedIn Clean to support this process.

Examples of data cleaning include:

Standardizing phone number formats
Normalizing company names
Removing inconsistent casing
Aligning address structures
Removing punctuation or formatting inconsistencies

Benefits of Cleaning Data First

Cleaning your data before matching provides several advantages.

1. Improved Data Quality

The most obvious benefit is that your data becomes more consistent and easier to use across the platform.

2. Simpler Matching Rules

Once data is normalized, you can often use deterministic matching rules rather than fuzzy comparisons.

For example, instead of fuzzy matching:

Company Name ~ Company Name

You may be able to use deterministic matching:

Normalized Company Name = Normalized Company Name

3. Increased Automation

Deterministic rules are typically more reliable and easier to automate. This can reduce manual review and improve overall trust in the matching process.

Best Practice: Use AI Matching to Find Additional Candidates

AI-based matching works particularly well as a secondary layer of duplicate detection.

Because AI Agents analyze semantic relationships, they can identify connections between records that would not normally be detected through rule-based matching.

This is especially useful for:

Inconsistent naming conventions
Indirect entity relationships
Organizations with multiple naming variations
Industry or semantic similarities

For example:

Record A	Record B
NASA	Space Company

Although these values have little string similarity, they clearly represent related concepts. AI-based matching can detect these types of relationships.

This approach is particularly useful for:

Discovering hidden duplicates
Generating candidate matches for review
Augmenting rule-based matching strategies

Rather than replacing rules, AI Agents provide an additional technique that can surface unique matches that are otherwise difficult to find.

Summary

CluedIn supports three complementary approaches to matching and merging:

Approach	Description
Matching Rules	Deterministic and probabilistic rules with cascading logic
Identifier Matching	Matching based on unique system identifiers
AI Agents	Semantic matching using the Find Duplicates skill

To maximize matching accuracy:

Clean and standardize data before creating complex rules
Use deterministic rules wherever possible
Leverage identifiers when available
Use AI Agents to uncover additional matches that rules cannot detect

Combining these approaches allows organizations to build a robust and scalable matching str