CluedIn For Data Stewards
CluedIn for Data Stewards — Field Guide
Audience: Data Stewards, Data Quality Leads, Business Owners
Goal: Give stewards a clear, practical playbook for operating CluedIn: profiling, validation, cleaning, dedup review, metadata & glossary, and day‑to‑day rhythms with engineers and admins.
This is the “how we actually work” guide. It favors repeatable workflows, small changes shipped often, and evidence from metrics and audit trails.
0) Your First 48 Hours (Checklist)
Get access & context
- Sign in via SSO; verify your Steward role permissions.
- Join the #data-quality and #cluedin-ops channels (or equivalent).
- Open the Entities Explorer and Exports you own.
- Skim the Runbook and Use‑Case Brief for your domain.
Profile & label
- Run profiling on top entities (completeness, validity, uniqueness).
- Apply labels/classifications (PII, Restricted, Confidential).
- Note top 3 issues (e.g., invalid emails, duplicate orgs).
Set foundations
- Draft validation rules for the highest‑impact fields.
- Create a Cleaning Project with 1–2 safe normalizations.
- Review Dedup queue; define deterministic auto‑approve rules (with Admin/Engineer).
Observability
- Pin DQ metrics and set initial thresholds.
- Learn where to find logs and audit logs for your domain.
1) What a Steward Does in CluedIn
- Profiling & Monitoring — Understand data shapes, trends, outliers.
- Validation & Policy — Encode business rules as validations/policies.
- Cleaning — Build small, idempotent improvements that run on a cadence.
- Dedup & Golden Records — Review matches, define survivorship, and unmerge safely.
- Metadata & Glossary — Maintain definitions, semantics, classifications, and lineage notes.
- Governance & Evidence — Track DQ metrics, create change records, use audit logs.
- Collaboration — Work with Engineers (pipelines/mapping) and Admins (access/roles).
Principle: Ship smaller changes more often. Each step should be low‑risk, easy to revert, and measurable.
2) Finding and Understanding Your Data
2.1 Entities Explorer & Catalog
- Browse Entities (e.g.,
Person
,Organization
,Order
) and their attributes. - Check relationships (e.g., Person ↔ Organization) and lineage from sources to exports.
- Use labels (PII, Restricted) to drive access and masking policies.
2.2 Profiling (What to look at)
- Completeness: non‑null %, blank %, cardinality.
- Validity: regex/domain compliance (emails, country codes).
- Uniqueness: duplicates by natural/business keys.
- Distributions: skew, min/max, outliers.
- Timeliness: update latency vs SLAs.
Quick profiling questions
- Which 5 fields most impact downstream decisions?
- Where are the biggest drops in completeness/validity?
- Which sources disagree (consistency check)?
3) Writing Validations (Guardrails)
3.1 Validation Rule Template
rule: email_must_be_valid
entity: Person
when:
- field: email
is_not_null: true
check:
regex: "^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$"
severity: high
on_fail:
action: "flag" # flag | quarantine | auto_fix
message: "Invalid email format"
labels: ["PII","contactability"]
3.2 Cross‑Field Rules
rule: order_dates_consistent
entity: Order
check:
expression: "order_date <= ship_date"
severity: medium
on_fail: { action: "flag", message: "Ship date precedes order date" }
3.3 Domain Lists & Reference Data
rule: country_in_iso3166
entity: Address
check:
in_list:
field: country_code
list: ["US","AU","GB","DE","FR","JP","NZ"]
on_fail: { action: "flag" }
Tips
- Start allow‑listing (valid patterns), avoid complex negative logic.
- Add labels to rules to group them by policy (e.g.,
privacy
,finance
). - Prefer flag or quarantine first; adopt auto_fix only when safe and deterministic.
4) Cleaning Projects (Small, Safe, Scheduled)
4.1 Design Principles
- Idempotent: Running twice doesn’t double‑change.
- Observable: Log counts changed, keep before/after examples.
- Revertible: Version the project; easy rollback.
- Scoped: 1–3 steps per release; avoid big‑bang refactors.
4.2 Starter Project
project: normalize_contacts
schedule: "0 * * * *" # hourly
steps:
- name: trim_names
action: set
field: first_name
value: trim(first_name)
- name: normalize_email
action: set
field: email
value: lower(trim(email))
- name: drop_impossible_birthdates
when: "birthdate < '1900-01-01' or birthdate > now()"
action: set
field: birthdate
value: null
observability:
sample_before_after: 25
emit_metrics: true
4.3 Approval & Promotion
- Test in staging export. Compare row counts, nulls, and schema.
- Peer review from another Steward/Engineer.
- Promote with a short changelog entry and link to metrics.
5) Deduplication & Golden Records
5.1 Review Workflow
- Open Dedup queue for your entity (e.g.,
Person
). - Start with high‑confidence candidates; verify examples.
- Approve merges that are deterministic; send ambiguous ones to manual review with notes.
- If needed, unmerge with documented rationale.
5.2 Deterministic First, then Fuzzy
rules:
- name: exact_email
when: lower(email) == lower(other.email)
confidence: 0.98
- name: phone_e164
when: e164(phone) == e164(other.phone)
confidence: 0.95
auto_approve_threshold: 0.97
queue_threshold: 0.85
5.3 Survivorship (Golden Record)
Define how fields are chosen on merge:
survivorship:
precedence:
- source: "crm"
- source: "support"
- source: "marketing"
recency_win: true # prefer latest updated_at when sources tie
field_overrides:
email: "most_recent_non_null"
phone: "most_trusted_source_first"
- Maintain a decision log for merges/unmerges.
- Update dedup rules after recurring manual decisions to reduce future toil.
6) Mapping & Schema with a Steward’s Lens
- Propose new fields/semantics based on business needs.
- Keep business definitions (glossary) aligned with mapped fields.
- Avoid heavy transformation in mapping; use Cleaning Projects for standardization.
- When adding fields, coordinate with Engineers to run staging exports and diff results.
Change Checklist
- Impacted dashboards identified
- Export contracts updated (if needed)
- Backfill plan (if required)
- Rollback tested
7) Metadata, Glossary & Classifications
7.1 Glossary Entries
Include: business definition, calculation notes, owner, example, data quality caveats.
term: "Active Customer"
definition: "Customer with at least one completed order in the last 90 days"
owner: "Sales Ops"
sources: ["orders_v1","subscriptions_v2"]
dq_notes: "Relies on order status != 'cancelled'"
7.2 Classifications & Policy Hooks
- Tag fields/entities with
PII
,Restricted
,Confidential
, orPublic
. - Policies can mask, hash, or deny reads based on labels.
- Steward task: keep labels accurate as new fields arrive.
8) Data Quality Metrics & Dashboards
Track few, meaningful KPIs per entity:
- Completeness:
pct_non_null(email)
- Validity:
pct_match(email, regex)
- Uniqueness:
1 - duplicate_rate(email)
- Timeliness:
p95(now() - updated_at)
Thresholds
metrics:
email_validity: { warn: ">= 0.97", fail: "< 0.95" }
duplicate_rate_email: { warn: "<= 0.03", fail: "> 0.05" }
alerts:
- metric: email_validity
action: "notify #data-quality"
Habit: Improve by small, frequent increments (e.g., +0.5–1.0% per week).
9) Reading Logs & Audit Logs (Steward View)
9.1 Operational Logs
- Cleaning logs for before/after counts, failures.
- Validation logs for rule breaches and quarantines.
- Export logs for schema diffs and row counts.
Grab the correlation_id from UI errors and filter logs by it when triaging.
9.2 Audit Logs
- Who approved a merge? Who changed a rule? When was a policy updated?
- Use audit logs to evidence changes and for post‑incident analysis.
- Include audit links in your changelog notes.
10) Working with AI Agents (Your Co‑pilot)
10.1 Analysis Prompt Starters
“Analyze Person for missingness, invalid formats, and duplicates. Propose validation rules with example failing records. Suggest deterministic dedup keys first.”
“Given recent invalid email spikes, propose a stricter validation with back‑compat and estimate false positive rate.”
10.2 Safe Auto‑Fix Patterns
- Normalize casing/whitespace.
- Standardize ISO codes (countries, states).
- Convert phone numbers to E.164 (with default country).
Guardrails
- Keep auto‑fixes reversible; log original values.
- Restrict AI write scope to masked views where needed.
11) Collaboration & Change Management
- Submit PRs or change requests for new validations/cleaning steps.
- Pair with Engineers for mapping/export impacts.
- Ask Admins for labels/policy updates and role changes.
- Document user impact and rollback in each change.
Promotion flow
- Draft in staging → 2. Peer review → 3. Scheduled deploy → 4. Post‑deploy checks → 5. Changelog & metric snapshot.
12) Day‑to‑Day Operating Rhythm
Daily (5–10 minutes)
- DQ dashboard glance; triage new alerts.
- Dedup queue: approve high‑confidence; flag edge cases.
- Check last cleaning run; scan error logs.
Weekly
- Ship 1–3 cleaning improvements.
- Review validation breach trends; tighten where safe.
- Update glossary entries if semantics changed.
Monthly
- Access review for your domain (with Admins).
- Revisit dedup thresholds; sample precision/recall.
- Retire deprecated fields/exports.
13) Steward Runbook (Common Situations)
Spike in invalid emails
- Confirm via validation logs and sample records.
- Identify source; check recent mapping/cleaning changes.
- Add temporary quarantine rule; propose stricter regex.
- Communicate downstream impact; backfill if needed.
Duplicate surge after new source onboarded
- Lower auto‑approve threshold; pause auto‑merge.
- Add deterministic rule (e.g., exact phone or customer_id).
- Review queue; update survivorship precedence with the new source.
Export schema drift detected
- Compare staging vs prod exports; find mapping change.
- Coordinate rollback with Engineer; file a changelog entry.
- Add a validation to catch the offending pattern next time.
14) Templates & Snippets
14.1 Validation Pack (copy/paste)
- rule: phone_is_e164_or_null
entity: Person
when: [{ field: phone, is_not_null: true }]
check: { regex: "^\\+\\d{8,15}$" }
severity: medium
on_fail: { action: "flag" }
- rule: name_min_length
entity: Person
check: { expression: "len(trim(first_name)) >= 1 and len(trim(last_name)) >= 1" }
severity: low
on_fail: { action: "flag" }
14.2 Cleaning Steps
- name: fix_common_email_typos
action: set
field: email
value: replace_multi(lower(email), {"gmal.com":"gmail.com","hotnail.com":"hotmail.com"})
14.3 Dedup Reviewer Notes (template)
pair_id: "dup_7f3a"
decision: "approve"
evidence:
- "exact_email"
- "matching phone_e164"
comment: "Same person from CRM and Support; safe to auto‑approve rule next time."
14.4 Changelog Entry
date: "2025-08-24"
change: "Added phone E.164 normalization; tightened email regex"
impact: "Duplicate rate down from 5.2% → 2.1%; validity up 96.8% → 98.9%"
links: ["run logs", "audit event #1432"]
15) What Good Looks Like
- DQ metrics trending up, no long‑lived red alerts.
- Small, frequent improvements with audit evidence.
- Glossary & labels kept up to date.
- Dedup queue under control; high precision merges; rare unmerges.
- Change notes clear enough that a new Steward can follow the story.
Appendix — Steward Permission Profile (Typical)
Capability | Viewer | Steward | Engineer | Admin |
---|---|---|---|---|
Read entities/exports | ✅ | ✅ | ✅ | ✅ |
Edit validations | ❌ | ✅ | ✅ | ✅ |
Edit cleaning projects | ❌ | ✅ | ✅ | ✅ |
Review dedup & approve merges | ❌ | ✅ | ✅ | ✅ |
Configure ingestion/export | ❌ | ❌ | ✅ | ✅ |
Manage roles/policies | ❌ | ❌ | ❌ | ✅ |
Toggle features | ❌ | ❌ | ❌ | ✅ |
If you need elevated access temporarily, request a time‑boxed role with Admin approval.
You now have the Steward playbook: profile, validate, clean, dedup, document, measure—then repeat. Keep changes small, keep evidence, and keep the pipeline healthy.