How to tag records with data quality issues
In this article, you will learn how to tag records with data quality issues using data part rules and CluedIn Copilot. We will use invalid email address as an example of a data quality issue.
To begin with, we have ingested, mapped, and processed a file containing contact data. Some records include invalid email addresses. Note that the email addresses in rows 1–3 violate common email address formatting rules.
To tag records with such data quality issues, create a data part rule and add an action to tag records if the email value does not match the acceptable patten of a common regular expression (for example, ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$
). You can use the Conditions section of the action to specify the acceptable pattern for the vocabulary key value.
Alternatively, you can use CluedIn Copilot to create a data part rule. First, ask Copilot to generate a common regular expression that would check for valid email format. For example, you can use the following prompt.
Write a common regular expression that would check for valid email format and shortly describe it.
Then, instruct Copilot to create a data part rule using the prompt similar to the following.
Create a data part rule named "User email format validation" for the TrainingContact entity type. This rule should use CluedIn AI action on the trainingcontact.email vocabulary key to tag with "Invalid email format" any email with a pattern that doesn't match the above regex.
Next, activate and re-process the rule. To verify that the rule has been applied, go to search and use the Tags filter.
As a result, you will see all records where the email value is in the invalid format. Records with invalid email addresses contain the corresponding tag.
When the records with data quality issues are tagged, you can then create a clean project to fix such issues. To do it, in the upper-right corner of the search results page, open the three-dot menu, and select Clean.