Deduplication reference

Matching rules

Matching rules allow you to set up complex logic for detecting duplicates. In the deduplication project, matching rules are combined using the OR logical operator, while matching criteria within a rule are combined using the AND logical operator. So, if you want a record to be identified as a duplicate if it meets at least one of your criteria, then you need a separate rule for each criteria. On the other hand, if you want a record to be identified as a duplicate if it meets all of your criteria, then you need one rule that includes all of your criteria.

For example, in the following configuration, either records with the same email address or records with the same first and last name are identified as duplicates.

Matching functions

Matching functions specify how values of a vocabulary key or property should be compared. The following table provides the description of matching functions.

Function	Description
Equals	Identifies records as duplicates if two values are the same.
Fuzzy match - Sift4	Detects approximate duplicates using a string similarity algorithm. This algorithm takes into account the number of matching characters and the transpositions (swaps) needed to make the strings identical. For example, `algorithm` and `algorrithm` would be identified as fuzzy matches. In this method, you have to specify the threshold— a parameter that sets the maximum number of characters by which two strings can differ and still be considered a match.
Fuzzy match - phonetic, DoubleMetaphone	Detects approximate duplicates using a phonetic matching algorithm. This algorithm encodes words into a phonetic representation, allowing for matching based on pronunciation rather than spelling. For example, `night` and `knight` would be identified as fuzzy matches.
Contains	Matches two values if one contains the other.
Starts with	Matches two values if one starts with the other.
Ends with	Matches two values if one ends with the other.
First token equals	Extracts the first word from both values and compares for exact match ignoring casing.
Email	Matches two email address values if they are semantically equal.

Watch the following video where we provide an explanation of matching functions along with practical examples.

Normalization rules

Normalization rules clean up values before comparing them to identify duplicates. The following table provides the description of normalization rules.

Normalization rule	Description
To lowercase	Converts values to lover case.
Trim whitespace	Removes whitespace characters from values.
Remove punctuation	Removes punctuation marks from the values. For example, `CluedIn.!?,` will be converted to `CluedIn`.
Remove special characters	Removes special characters from the values. For example, `CluedIn”#€%&(/)€` will be converted to `CluedIn`.
Remove diacritic marks	Removes diacritic marks from the values. For example, `spăn'ĭsh` will be converted to `spanish`.
Person name variations	Generates person name variations. For example, `Timothy Daniel Ward` will be converted to `Tim D. Ward`, `Tim Ward`.
First name nickname variants	Generates nickname variations for the person’s first name. For example, `Timothy` will be converted to `Tim`, `Timmy`.
Organization name	Normalizes organization name. For example, `CluedIn ApS` will be converted to `CluedIn`.
Phone number	Removes well-known phone number formatting characters. For example, `+45-123 456 789 (001)` will be converted to `45123456789001`.
Email mask	Normalizes email address to match other variants.
Street address	Normalizes common street name tokens.
Geography country	Converts country to ISO code. For example, `Australia` will be converted to `AU`.
Replace text	Replaces text using configured pattern. You need to enter both the value to be replaced and the replacement value.
Regex replace text	Replaces text using configured regex pattern.
Split text	Splits text using configured separator.
Regex split text	Splits text using configured regex separator pattern.
Split text by whitespace	Splits text by whitespace into tokens.
Transliterate	Transliterates text value to an ASCII string. For example, `Ελληνική Δημοκρατία` will be converted to `Ellēnikē Dēmokratia`.

Watch the following video where we provide an explanation of normalization rules along with practical examples.

Street address normalization

Street address normalization converts input addresses into standardized versions before comparing them to identify duplicates.

For example, let’s consider two addresses: 123 Fantasy strabe and 123 Fantasy street. When street address normalization is selected, any time CluedIn sees strabe, strasse, street, or str, it turns it into [STR]. If the only thing that differs in an address is the way street is spelled, then the addresses will match.

Input	Normalized version
`strabe`, `strasse`, `street`, `str\.?`, `st`	`[STR]`
`Allee`, `aly`	`[ALY]`
`Boulevard`, `BLVD`	`[BLVD]`
`Lane`, `Ln`	`[LANE]`
`Drive`, `Dr`	`[DRIVE]`
`avenue`, `ave`	`[AVE]`
`University`, `univ`	`[UNIVERSITY]`
`West`, `w`	`[WEST]`
`North`, `n`	`[NORTH]`
`East`, `e`	`[EAST]`
`South`, `s`	`[SOUTH]`
`Floor`, `flr`	`[FLOOR]`
`First`, `1st`	`[1ST]`
`Second`, `2nd`, `2rd`	`[2ND]`
`Third`, `3rd`	`[3RD]`
`Fourth`, `4th`	`[4TH]`
`Fifth`, `5th`	`[5TH]`
`p\.?\so\.?\s+box`, `post\sbox`, `post\sboks`, `GPO\sBox`	`[POBOX]`
`Private\sbag`, `Locked\sBag`, `Private\sMail\sBag`, `Locked\sBag`, `PMB`	`[PMB]`
`Plaza`, `PLAZ`, `PLZ`	`[PLZ]`
`Suite`, `STE`	`[STE]`
`Trailer`, `TRLR`	`[TRLR]`
`Apartment`, `APT`	`[APT]`
`Basement`, `BSMT`	`[BSMT]`
`Building`, `BLDG`	`[BLDG]`
`Department`, `DEPT`	`[DEPT]`
`Hanger`, `HNGR`	`[HNGR]`
`Lobby`, `LBBY`	`[LBBY]`
`Lower`, `LOWR`	`[LOWR]`
`Office`, `OFC`	`[OFC]`
`Penthouse`, `PH`	`[PH]`
`Space`, `SPC`	`[SPC]`
`Level`, `LVL`	`[LEVEL]`
`Highway`, `Highwy`, `Hiway`, `Hiwy`, `Hway`, `Hwy`	`[HWY]`

Deduplication project

This section contains reference information about deduplication project statuses.

Project statuses

The following table provides descriptions of deduplication project statuses.

Status	Description
Requires configuration	The deduplication project has been created, no matching rules have been added yet.
Ready to generate	The deduplication project has been configured, no matches have been generated yet. This status also appears when you discard matches at any further stage of the project.
Generating	CluedIn is analyzing the specified set of golden records with the aim to detect duplicates based on your matching rules.
Aborting generation	The process of generating matches of the deduplication project is being cancelled. The status will shortly change to Ready to generate.
Ready for review	CluedIn has generated matches of the deduplication project, you can start processing groups of duplicates.
Committing	CluedIn is merging records from selected groups. This status is applicable whether you are merging a single group, multiple groups, or all groups in the project.
Merged	CluedIn has merged records from all groups in the project. If the project contains groups that were not selected for merge, the project status remains Ready for review.
Unmerging	CluedIn is reverting the results of merge. When records are unmerged, the project status becomes Ready for review.
Abort unmerging	The process of reverting changes is being cancelled. The status will shortly change to Ready to generate.

Project status workflow

The following diagram shows the deduplication project workflow along with its statuses and main activities.

Group of duplicates

This section contains reference information about the statuses of a group of duplicates.

Group statuses

The following table provides descriptions of the statuses of groups of duplicates.

Status	Description
New	The group with potential duplicates has been generated. This status is also applicable when you revoke the group’s approval.
Approved	The group has been approved and it is ready for merge. If you change your mind, you can reject the group. If you are uncertain about the selection of values in the group, you can revoke your approval and start processing the group from scratch.
Rejected	The group has been rejected and it cannot be merged. If you change your mind, you can approve the group.
Merge Committed	The group has been merged. When all groups in the project have this status, the status of the project becomes Merged. If you are not satisfied with the merged golden record, you can revert the changes by unmerging the records.
Unmerged	The changes made to the golden record through merging have been reverted, restoring duplicate records to their previous state before the merge.

Group status workflow

The following diagram shows the group of duplicates workflow along with its statuses and main activities.

Deduplication project audit log actions

Whenever some changes or actions are made in the deduplication project, they are recorded and can be found on the Audit Log tab. These actions include the following:

Create a deduplication project
Add users to owners
Update a deduplication project
Create a deduplication rule
Update a deduplication rule
Activate a matching rule
Deactivate a matching rule
Generate matches
Discard matches
Manual conflict resolution
Reset manual conflict resolutions
Remove entity from a group
Approve one group
Approve multiple groups
Approve all groups
Remove (revoke) approval from all groups
Remove (revoke) approval from groups (one or multiple groups)
Reject groups (one or multiple groups)
Merge groups (one or multiple selected groups)
Merge approved groups
Undo merged entities
Undo merge groups
Abort undo
Cancel generating matches
Cancel merge
Archive a deduplication project