CluedIn for Data Stewards — field guide

0) Your First 48 Hours (Checklist)
1) What a Steward Does in CluedIn
2) Finding and Understanding Your Data
1. 2.1 Entities Explorer & Catalog
2. 2.2 Profiling (What to look at)
3) Writing Validations (Guardrails)
4) Cleaning Projects (Small, Safe, Scheduled)
5) Deduplication & Golden Records
6) Mapping & Schema with a Steward’s Lens
7) Metadata, Glossary & Classifications
1. 7.1 Glossary Entries
2. 7.2 Classifications & Policy Hooks
8) Data Quality Metrics & Dashboards
9) Reading Logs & Audit Logs (Steward View)
1. 9.1 Operational Logs
2. 9.2 Audit Logs
10) Working with AI Agents (Your Co‑pilot)
1. 10.1 Analysis Prompt Starters
2. 10.2 Safe Auto‑Fix Patterns
11) Collaboration & Change Management
12) Day‑to‑Day Operating Rhythm
13) Steward Runbook (Common Situations)
14) Templates & Snippets
15) What Good Looks Like
Appendix — Steward Permission Profile (Typical)

Audience: Data Stewards, Data Quality Leads, Business Owners
Goal: Give stewards a clear, practical playbook for operating CluedIn: profiling, validation, cleaning, dedup review, metadata & glossary, and day‑to‑day rhythms with engineers and admins.

This is the “how we actually work” guide. It favors repeatable workflows, small changes shipped often, and evidence from metrics and audit trails.

0) Your First 48 Hours (Checklist)

Get access & context

Sign in via SSO; verify your Steward role permissions.
Join the #data-quality and #cluedin-ops channels (or equivalent).
Open the Entities Explorer and Exports you own.
Skim the Runbook and Use‑Case Brief for your domain.

Profile & label

Run profiling on top entities (completeness, validity, uniqueness).
Apply labels/classifications (PII, Restricted, Confidential).
Note top 3 issues (e.g., invalid emails, duplicate orgs).

Set foundations

Draft validation rules for the highest‑impact fields.
Create a Cleaning Project with 1–2 safe normalizations.
Review Dedup queue; define deterministic auto‑approve rules (with Admin/Engineer).

Observability

Pin DQ metrics and set initial thresholds.
Learn where to find logs and audit logs for your domain.

1) What a Steward Does in CluedIn

Profiling & Monitoring — Understand data shapes, trends, outliers.
Validation & Policy — Encode business rules as validations/policies.
Cleaning — Build small, idempotent improvements that run on a cadence.
Dedup & Golden Records — Review matches, define survivorship, and unmerge safely.
Metadata & Glossary — Maintain definitions, semantics, classifications, and lineage notes.
Governance & Evidence — Track DQ metrics, create change records, use audit logs.
Collaboration — Work with Engineers (pipelines/mapping) and Admins (access/roles).

Principle: Ship smaller changes more often. Each step should be low‑risk, easy to revert, and measurable.

2) Finding and Understanding Your Data

2.1 Entities Explorer & Catalog

Browse Entities (e.g., Person, Organization, Order) and their attributes.
Check relationships (e.g., Person ↔ Organization) and lineage from sources to exports.
Use labels (PII, Restricted) to drive access and masking policies.

2.2 Profiling (What to look at)

Completeness: non‑null %, blank %, cardinality.
Validity: regex/domain compliance (emails, country codes).
Uniqueness: duplicates by natural/business keys.
Distributions: skew, min/max, outliers.
Timeliness: update latency vs SLAs.

Quick profiling questions

Which 5 fields most impact downstream decisions?
Where are the biggest drops in completeness/validity?
Which sources disagree (consistency check)?

3) Writing Validations (Guardrails)

3.1 Validation Rule Template

rule: email_must_be_valid
entity: Person
when:
  - field: email
    is_not_null: true
check:
  regex: "^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$"
severity: high
on_fail:
  action: "flag"          # flag | quarantine | auto_fix
  message: "Invalid email format"
labels: ["PII","contactability"]

3.2 Cross‑Field Rules

rule: order_dates_consistent
entity: Order
check:
  expression: "order_date <= ship_date"
severity: medium
on_fail: { action: "flag", message: "Ship date precedes order date" }

3.3 Domain Lists & Reference Data

rule: country_in_iso3166
entity: Address
check:
  in_list:
    field: country_code
    list: ["US","AU","GB","DE","FR","JP","NZ"]
on_fail: { action: "flag" }

Tips

Start allow‑listing (valid patterns), avoid complex negative logic.
Add labels to rules to group them by policy (e.g., privacy, finance).
Prefer flag or quarantine first; adopt auto_fix only when safe and deterministic.

4) Cleaning Projects (Small, Safe, Scheduled)

4.1 Design Principles

Idempotent: Running twice doesn’t double‑change.
Observable: Log counts changed, keep before/after examples.
Revertible: Version the project; easy rollback.
Scoped: 1–3 steps per release; avoid big‑bang refactors.

4.2 Starter Project

project: normalize_contacts
schedule: "0 * * * *"  # hourly
steps:
  - name: trim_names
    action: set
    field: first_name
    value: trim(first_name)
  - name: normalize_email
    action: set
    field: email
    value: lower(trim(email))
  - name: drop_impossible_birthdates
    when: "birthdate < '1900-01-01' or birthdate > now()"
    action: set
    field: birthdate
    value: null
observability:
  sample_before_after: 25
  emit_metrics: true

4.3 Approval & Promotion

Test in staging export. Compare row counts, nulls, and schema.
Peer review from another Steward/Engineer.
Promote with a short changelog entry and link to metrics.

5) Deduplication & Golden Records

5.1 Review Workflow

Open Dedup queue for your entity (e.g., Person).
Start with high‑confidence candidates; verify examples.
Approve merges that are deterministic; send ambiguous ones to manual review with notes.
If needed, unmerge with documented rationale.

5.2 Deterministic First, then Fuzzy

rules:
  - name: exact_email
    when: lower(email) == lower(other.email)
    confidence: 0.98
  - name: phone_e164
    when: e164(phone) == e164(other.phone)
    confidence: 0.95
auto_approve_threshold: 0.97
queue_threshold: 0.85

5.3 Survivorship (Golden Record)

Define how fields are chosen on merge:

survivorship:
  precedence:
    - source: "crm"
    - source: "support"
    - source: "marketing"
  recency_win: true     # prefer latest updated_at when sources tie
  field_overrides:
    email: "most_recent_non_null"
    phone: "most_trusted_source_first"

Maintain a decision log for merges/unmerges.
Update dedup rules after recurring manual decisions to reduce future toil.

6) Mapping & Schema with a Steward’s Lens

Propose new fields/semantics based on business needs.
Keep business definitions (glossary) aligned with mapped fields.
Avoid heavy transformation in mapping; use Cleaning Projects for standardization.
When adding fields, coordinate with Engineers to run staging exports and diff results.

Change Checklist

Impacted dashboards identified
Export contracts updated (if needed)
Backfill plan (if required)
Rollback tested

7) Metadata, Glossary & Classifications

7.1 Glossary Entries

Include: business definition, calculation notes, owner, example, data quality caveats.

term: "Active Customer"
definition: "Customer with at least one completed order in the last 90 days"
owner: "Sales Ops"
sources: ["orders_v1","subscriptions_v2"]
dq_notes: "Relies on order status != 'cancelled'"

7.2 Classifications & Policy Hooks

Tag fields/entities with PII, Restricted, Confidential, or Public.
Policies can mask, hash, or deny reads based on labels.
Steward task: keep labels accurate as new fields arrive.

8) Data Quality Metrics & Dashboards

Track few, meaningful KPIs per entity:

Completeness: pct_non_null(email)
Validity: pct_match(email, regex)
Uniqueness: 1 - duplicate_rate(email)
Timeliness: p95(now() - updated_at)

Thresholds

metrics:
  email_validity: { warn: ">= 0.97", fail: "< 0.95" }
  duplicate_rate_email: { warn: "<= 0.03", fail: "> 0.05" }
alerts:
  - metric: email_validity
    action: "notify #data-quality"

Habit: Improve by small, frequent increments (e.g., +0.5–1.0% per week).

9) Reading Logs & Audit Logs (Steward View)

9.1 Operational Logs

Cleaning logs for before/after counts, failures.
Validation logs for rule breaches and quarantines.
Export logs for schema diffs and row counts.

Grab the correlation_id from UI errors and filter logs by it when triaging.

9.2 Audit Logs

Who approved a merge? Who changed a rule? When was a policy updated?
Use audit logs to evidence changes and for post‑incident analysis.
Include audit links in your changelog notes.

10) Working with AI Agents (Your Co‑pilot)

10.1 Analysis Prompt Starters

“Analyze Person for missingness, invalid formats, and duplicates. Propose validation rules with example failing records. Suggest deterministic dedup keys first.”

“Given recent invalid email spikes, propose a stricter validation with back‑compat and estimate false positive rate.”

10.2 Safe Auto‑Fix Patterns

Normalize casing/whitespace.
Standardize ISO codes (countries, states).
Convert phone numbers to E.164 (with default country).

Guardrails

Keep auto‑fixes reversible; log original values.
Restrict AI write scope to masked views where needed.

11) Collaboration & Change Management

Submit PRs or change requests for new validations/cleaning steps.
Pair with Engineers for mapping/export impacts.
Ask Admins for labels/policy updates and role changes.
Document user impact and rollback in each change.

Promotion flow

Draft in staging → 2. Peer review → 3. Scheduled deploy → 4. Post‑deploy checks → 5. Changelog & metric snapshot.

12) Day‑to‑Day Operating Rhythm

Daily (5–10 minutes)

DQ dashboard glance; triage new alerts.
Dedup queue: approve high‑confidence; flag edge cases.
Check last cleaning run; scan error logs.

Weekly

Ship 1–3 cleaning improvements.
Review validation breach trends; tighten where safe.
Update glossary entries if semantics changed.

Monthly

Access review for your domain (with Admins).
Revisit dedup thresholds; sample precision/recall.
Retire deprecated fields/exports.

13) Steward Runbook (Common Situations)

Spike in invalid emails

Confirm via validation logs and sample records.
Identify source; check recent mapping/cleaning changes.
Add temporary quarantine rule; propose stricter regex.
Communicate downstream impact; backfill if needed.

Duplicate surge after new source onboarded

Lower auto‑approve threshold; pause auto‑merge.
Add deterministic rule (e.g., exact phone or customer_id).
Review queue; update survivorship precedence with the new source.

Export schema drift detected

Compare staging vs prod exports; find mapping change.
Coordinate rollback with Engineer; file a changelog entry.
Add a validation to catch the offending pattern next time.

14) Templates & Snippets

14.1 Validation Pack (copy/paste)

- rule: phone_is_e164_or_null
  entity: Person
  when: [{ field: phone, is_not_null: true }]
  check: { regex: "^\\+\\d{8,15}$" }
  severity: medium
  on_fail: { action: "flag" }

- rule: name_min_length
  entity: Person
  check: { expression: "len(trim(first_name)) >= 1 and len(trim(last_name)) >= 1" }
  severity: low
  on_fail: { action: "flag" }

14.2 Cleaning Steps

- name: fix_common_email_typos
  action: set
  field: email
  value: replace_multi(lower(email), {"gmal.com":"gmail.com","hotnail.com":"hotmail.com"})

14.3 Dedup Reviewer Notes (template)

pair_id: "dup_7f3a"
decision: "approve"
evidence:
  - "exact_email"
  - "matching phone_e164"
comment: "Same person from CRM and Support; safe to auto‑approve rule next time."

14.4 Changelog Entry

date: "2025-08-24"
change: "Added phone E.164 normalization; tightened email regex"
impact: "Duplicate rate down from 5.2% → 2.1%; validity up 96.8% → 98.9%"
links: ["run logs", "audit event #1432"]

15) What Good Looks Like

DQ metrics trending up, no long‑lived red alerts.
Small, frequent improvements with audit evidence.
Glossary & labels kept up to date.
Dedup queue under control; high precision merges; rare unmerges.
Change notes clear enough that a new Steward can follow the story.

Appendix — Steward Permission Profile (Typical)

Capability	Viewer	Steward	Engineer	Admin
Read entities/exports	✅	✅	✅	✅
Edit validations	❌	✅	✅	✅
Edit cleaning projects	❌	✅	✅	✅
Review dedup & approve merges	❌	✅	✅	✅
Configure ingestion/export	❌	❌	✅	✅
Manage roles/policies	❌	❌	❌	✅
Toggle features	❌	❌	❌	✅

If you need elevated access temporarily, request a time‑boxed role with Admin approval.

You now have the Steward playbook: profile, validate, clean, dedup, document, measure—then repeat. Keep changes small, keep evidence, and keep the pipeline healthy.