Deduplicate data

On this page

  1. Create deduplication project
  2. Configure matching rule
  3. Fix duplicates
  4. Results
  5. Next steps

Deduplication process helps you find and merge duplicate records based on a set of rules that you define. CluedIn will automatically identify the changes and update the stream with deduplicated records.

In this article, you will learn how to deduplicate the data that you have ingested into CluedIn and streamed to a Microsoft SQL Server database.

Deduplicating the data in CluedIn involves creating a deduplication project, configuring the matching rules for identifying duplicates, and fixing duplicates.

Prerequisites

Before proceeding with the data deduplication process, ensure that you have completed the following tasks:

  1. Ingested some data into CluedIn. For more information, see Ingest data.

  2. Created a stream that keeps the data synchronized between CluedIn and the Microsoft SQL Server database. For more information, see Stream data.

Create deduplication project

As a first step, you need to create a deduplication project that allows you to check for duplicates that belong to a certain entity type.

To create a deduplication project

  1. On the navigation pane, select Management. Then, select Deduplication.

  2. On the Actions dashboard, select Deduplication.

    dedup-1.png

  3. Select Create Deduplication Project.

  4. On the Create Deduplication Project pane, do the following:

    1. Enter the name of the deduplication project.

    2. Select the entity type that you want to use as a filter for all records.

      dedup-2.png

    3. In the lower-right corner, select Create.

    You created the deduplication project.

    dedup-3.png

    Now, you can proceed to define the rules for checking duplicates within the selected entity type.

Configure matching rule

When creating a matching rule, you need to specify certain criteria. CluedIn uses these criteria to check for matching values among records belonging to the selected entity type.

To configure a matching rule

  1. Go to the Matching Rules tab and select Add Matching Rule.

    The Add Matching Rule pane opens on the right side of the page.

  2. On the Matching Rule Name tab, enter the name of the matching rule, and then select Next.

    dedup-4.png

  3. On the Matching Criteria tab, do the following:

    1. Enter the name of the matching criteria.

    2. Select the vocabulary key. All values associated with this vocabulary key will be checked for duplicates.

    3. In the Matching Function dropdown list, select the method for detecting duplicates.

      dedup-5.png

    4. In the lower-right corner, select Next.

  4. On the Preview tab, review the defined matching criteria.

    dedup-6.png

    If you want to add more matching criteria to the rule, select Add Matching Criteria.

  5. After you have added the needed matching criteria, in the lower-right corner of the Preview tab, select Add Rule.

    The status of the deduplication project becomes Ready to generate.

    dedup-7.png

  6. In the upper-right corner, select Generate Results. Then, confirm that you want to generate the results for the deduplication project.

    The process of generating results may take some time.

    After the process is completed, you will receive a notification. If duplicates are detected, the results will be displayed on the page. The results are organized into groups containing records that match your criteria. For example, on the following screenshot, the group consists of two duplicates. The name of the group corresponds to the value of the vocabulary key from the matching rule.

    dedup-8.png

    Now, you can proceed to fix the duplicates.

Fix duplicates

The process of fixing duplicates involves reviewing the values from duplicate records and selecting which values you want to merge into the deduplicated record.

To fix duplicates

  1. Select the name of the group.

    The Fix Conflicts tab opens. Here, you can view the details of the duplicate records. In the Conflicting section, you can find the properties that have different values in the duplicate records. In the Matching section, you can find the properties that have the same values in the duplicate records.

  2. In the Conflicting section, select the values that you want to merge into the deduplicated record.

    dedup-9.png

  3. In the upper-right corner of the page, select Next.

    The Preview Merge tab opens. Here, you can view the values that will be merged into the deduplicated record.

    dedup-10.png

  4. In the upper-right corner of the page, select Approve. Then, confirm that you want to approve your selection of values for the group.

  5. Select the checkbox next to the group name. Then, select Merge.

    dedup-11.png

  6. Confirm that you want to merge the records from the group:

    1. Review the group that will be merged and select Next.

    2. Select an option to handle the data merging process if more recent data becomes available for the entity. Then, select Confirm.

      dedup-12.png

      The process of merging data may take some time.

      After the process is completed, you will receive a notification. As a result, the duplicate records have been merged into one record.

    You fixed the duplicate records.

All changes to the data records in CluedIn are tracked. You can search for the needed data record and on the Topology pane, you can view the visual representation of the records that were merged through the deduplication process.

Results

You have performed data deduplication in CluedIn.

Next steps