Implement Data Quality Management

Data Quality in CluedIn — Step‑by‑Step Implementation Guide

A Data Quality (DQ) program in CluedIn gives you measurable, automated, and auditable control over your data. This guide walks you through: define scope → profile → standardize → validate → score → steward → monitor → publish.

Outcomes

Clear DQ policy (dimensions, rules, thresholds, SLAs).
Automated profiling, validation, and standardization with rules.
Scorecards & dashboards per entity/attribute and per source.
Stewardship workflows for remediation with full audit & rollback.
Continuous monitoring and alerts on regressions.

Prerequisites

Source access to data domains in scope (e.g., Customers, Products, Suppliers, Orders).
Business‑agreed definitions for DQ dimensions and pass/fail thresholds.
RACI: Stewards/Owners must have Accountable permissions.
A dedicated server/VM for heavy profiling, matching, and toolkit operations.

Step 1 — Define Scope & Quality Policy

Portal: Governance → Policies

Pick entities/attributes in scope (e.g., Customer.Email, Product.GTIN, Address.Country).
Choose dimensions & thresholds (examples):
- Completeness: Email ≥ 98%, Phone ≥ 90%
- Validity: GTIN checksum pass ≥ 99.5%
- Uniqueness: unique CustomerID = 100%
- Consistency: Country uses ISO‑2 = 100%
- Timeliness: LastUpdated < 30 days ≥ 95%
Document SLAs and owners per rule (who fixes what, by when).

Step 2 — Connect & Ingest Sources

Portal: Data Sources → Add Source

Add CRM/ERP/PIM/e‑commerce/finance/file feeds as required.
Capture provenance: SourceSystem, SourceRecordID, IngestedAt.
Start with sample loads, then scale.

Step 3 — Baseline Profiling

Portal: Data Quality → Profiling

Run profiling per entity to see: null counts, distinct counts, pattern distributions, min/max/avg lengths, outliers.
Export the baseline as your before snapshot.
Tag anomalies: DataDrift, OutlierPattern, UnexpectedNulls.

Step 4 — Standardization (Normalize Inputs)

Portal: Governance → Rules (Data Part Rules)
Create normalization rules before validation so checks are fair:

Names & Text: trim, case normalization, strip HTML.
Addresses: parse line components; normalize country codes (ISO‑2); optional geocoding.
Phones: convert to E.164; tag InvalidPhone on failure.
Emails: lower‑case, canonicalize dots for known providers (policy‑dependent).
Identifiers: remove spaces/dashes; compute checks (e.g., VAT/ABN/GTIN).
Units: convert to canonical (e.g., cm/kg) and tag UnitNormalized.

Recommended tags: Standardized, NeedsValidation, InvalidFormat, LegacySource.

Step 5 — Validation Rules (Attribute‑Level)

Portal: Governance → Rules (Data Part Rules)
Express the pass/fail logic per attribute and tag failures:

Email Validity: regex + MX (if available) → tag InvalidEmail.
Phone Validity: E.164 parse → tag InvalidPhone.
Country Consistency: enforce ISO‑2 set; map aliases; tag UnknownCountry.
Date Validity: not in future/past-policy; tag InvalidDate.
Identifier Checks: checksum (VAT/ABN/GTIN/IBAN); tag InvalidTaxID, InvalidGTIN.

Keep rules simple & modular; prefer many small, focused rules over one complex rule.

Step 6 — Entity‑Level Constraints

Portal: Entity Matching / Rules

Uniqueness: assert unique keys (CustomerID, SKU). On conflict, tag DuplicateKey and route to stewardship.
Referential Integrity: child → parent exists (e.g., Order.CustomerId must exist). Tag OrphanReference.
Cross‑field Consistency: e.g., State valid for Country, EndDate >= StartDate.

Step 7 — Compute Scores & Scorecards

Portal: Data Quality → Scorecards (or implement via Rules)
Define per‑attribute and per‑entity scores. Example weighting:

Store Score_AttributeX, Score_Entity, Score_Source.
Tag entities by band: DQ_Gold (≥0.95), DQ_Silver (0.90–0.95), DQ_Bronze (<0.90).

Step 8 — Stewardship & Cleaning Projects

Portal: Data Stewardship & Data Quality → Cleaning Projects

Auto‑create work queues from rule failures (e.g., all InvalidEmail for Customer).
Use Cleaning Projects to bulk fix: standardize values, remap codes, correct formats.
Everything is logged in History; you can Undo Project or Restore Previous Value if needed.

Step 9 — Dashboards & Alerts

Portal: Data Quality → Dashboards
Track by entity, attribute, and source:

Completeness %, validity %, uniqueness violations, orphan counts.
Score trends (7/30/90‑day).
Top failing rules and top offending sources.

Set alerts (email/webhook) when:

A KPI breaches threshold (e.g., Valid Email < 98%).
Regression from previous week by > X%.
New pattern emerges (data drift).

Step 10 — Automated Remediation (Where Safe)

Portal: Rules (Data Part & Golden)

Safe auto‑fixes (e.g., country alias mapping, phone formatting) can run continuously.
Risky fixes route to stewards; tag NeedsStewardReview.
For downstream suppression (e.g., invalid emails), tag DoNotContact.

Step 11 — Governance, Access & Lifecycle

Portal: Governance → Permissions / Policies / Retention

Restrict sensitive fields (PII) and steward-only attributes.
Apply Retention Policies where appropriate (e.g., delete stale logs after N days).
Maintain audit: every change is in History with who/when/why.

Step 12 — Publish Quality‑Assured Data

Portal: Exports

Publish conformed entity views plus quality flags/scores.
Provide suppression lists (e.g., invalid emails).
Offer wide tables for BI with Score_Entity and failure counts.

Start read‑only to validate, then wire quality gates into downstream loaders.

Step 13 — Scheduling & Operations

Schedule profiling, validation, scoring, and exports (off‑peak for heavy runs).
For big jobs, use a dedicated VM/server to avoid timeouts and throttling.
Monitor job durations and queue depth; investigate spikes.

Step 14 — UAT & Go‑Live

Validate on a sample set across sources.
Confirm rule accuracy (precision/recall) with business owners.
Dry‑run alerts and dashboards; tune thresholds.
Package with Product Toolkit and promote (ensure Accountable access).

Go‑Live Checklist

DQ policy documented (dimensions, thresholds, SLAs, owners).
Normalization & validation rules active; tests passing.
Scorecards & dashboards show baseline and targets.
Steward queues configured; Cleaning Projects tested; Undo verified.
Exports include quality flags/scores; consumers validated.
Schedules & alerts configured; on‑call/triage process in place.

Example Rules (Snippets)

Email Validity

Condition: Email present
Actions: lower‑case; regex check; tag InvalidEmail on fail.

Phone Normalize

Condition: Phone present
Actions: parse to E.164; tag InvalidPhone on parse failure.

Country ISO‑2 Enforcement

Condition: Country not in ISO2 set
Actions: map alias (e.g., UK→GB, U.S.→US); if unknown, tag UnknownCountry.

Date Consistency

Condition: EndDate < StartDate
Actions: tag InvalidDateRange; send to stewardship.

Uniqueness (CustomerID)

Condition: duplicate CustomerID detected
Actions: tag DuplicateKey; route to steward queue.

GTIN Check

Condition: GTIN present
Actions: strip non‑digits; checksum; tag InvalidGTIN on fail.

Timeliness

Condition: Now - LastUpdated > 30 days
Actions: tag StaleRecord.

KPIs & Targets (Examples)

Valid Email ≥ 98%, Valid Phone ≥ 95%.
Address Standardized ≥ 97%; Unknown Country < 0.1%.
Duplicate Key violations = 0.
Orphan References = 0.
Entity Score ≥ 0.95 sustained over 30 days.
Mean time to remediate stewarded issues < 5 days.

Common Pitfalls & How to Avoid Them

Skipping standardization before validation → false failures. Normalize first.
Over‑aggressive regex → false negatives; keep patterns realistic.
One giant rule → hard to debug; split into small rules.
No owner for each KPI → assign accountable roles.
Running profiling on laptops → use dedicated compute to avoid timeouts.
Fixes re‑applied by other rules → adjust/disable the problematic rule before reprocessing.

Summary

A successful Data Quality program in CluedIn combines clear policy, modular rules, transparent scoring, and active stewardship. Start small, measure relentlessly, automate safe fixes, and keep governance tight so improvements stick.