Reference Data Management
Reference Data Management (RDM) in CluedIn — Step-by-Step Implementation Guide
Reference data (codes, lists, and hierarchies) underpins every analytics and operational workflow: countries, currencies, units of measure, cost centers, product categories, payment terms, GL accounts, etc. This guide shows how to build a governed, versioned, and syndicated RDM capability in CluedIn.
Outcomes
- Authoritative code sets (canonical values) for each domain (e.g., Country, Currency, UoM, ProductCategory, CostCenter).
- Mappings/crosswalks from source codes → canonical codes, with full lineage and audit.
- Versioning & effective dating for controlled change (draft → approved → published).
- Hierarchies (single or alternate rollups) with validation and impact awareness.
- Syndication of approved sets to ERP/CRM/data warehouse/BI via exports/APIs.
Prerequisites
- Read access to systems that currently host code lists (ERP, CRM, e-commerce, finance, spreadsheets).
- Agreement on canonical owners per domain (who approves changes).
- RACI: Implementers and data stewards need Accountable access to governed objects.
- A dedicated VM/server for heavy initial loads, matching, and toolkit operations.
Reference Model
Entities (suggested)
CodeSet
— e.g.,Country
,Currency
,UoM
,ProductCategory
,CostCenter
.CodeValue
— one row per canonical value (attributes:Code
,Label
,Description
,EffectiveFrom
,EffectiveTo
,Status
).SourceCodeValue
— raw values from each source system (attributes:SourceSystem
,SourceCode
,SourceLabel
).CodeMapping
— crosswalk fromSourceCodeValue
→CodeValue
(attributes:MatchConfidence
,ApprovedBy
,ApprovedAt
).HierarchyNode
— parent/child structure for hierarchical sets (attributes:ParentId
,ChildId
,AltHierarchyName
,Level
,EffectiveFrom/To
).
Statuses
Draft
→Proposed
→Approved
→Published
→Deprecated
→Retired
Step 1 — Define Scope & Ownership
Portal: Governance → Policies / Permissions
- List domains in scope (start with 2–3 high-impact sets, e.g.,
Country
,Currency
,ProductCategory
). - Assign Domain Owner(s) and Stewards per set.
- Document approval SLAs and release frequency (e.g., monthly).
Step 2 — Connect & Ingest Source Lists
Portal: Data Sources → Add Source
- Connect ERP/CRM/Finance lists and relevant spreadsheets.
- For each list, ingest into
SourceCodeValue
withSourceSystem
and load timestamp. - Profile data (duplicates, nulls, length/format distribution).
Tip: Use sampling to inspect variants early (e.g., US
vs USA
vs 840
).
Step 3 — Create Code Sets & Canonical Schemas
Portal: Entity Explorer → + Create Entity Type
- Create
CodeSet
,CodeValue
,SourceCodeValue
,CodeMapping
,HierarchyNode
. - Seed canonical
CodeValue
tables for stable domains (e.g., ISO 3166 country, ISO 4217 currency). - Add validation attributes:
Status
,EffectiveFrom/To
,Owner
,Version
.
Step 4 — Standardize & Validate Source Codes
Portal: Governance → Rules (Data Part Rules)
Create rules to clean and normalize incoming source codes:
- Trim/uppercase codes; collapse whitespace and punctuation.
- Normalize diacritics in labels; map known synonyms (
U.S.
→US
). - Validate against canonical patterns (length, charset).
- Tag issues:
InvalidFormat
,UnknownCode
,NeedsStewardReview
.
Recommended tags:
Standardized
,Alias
,DeprecatedSourceCode
.
Step 5 — Build the Mapping (Crosswalk)
Portal: Entity Matching / Data Stewardship
- Auto-match exact code or exact label to canonical
CodeValue
. - Fuzzy match for labels (e.g., similarity ≥ 0.95) as candidates only.
- Create/maintain
CodeMapping
for each(SourceSystem, SourceCode)
→CodeValue
. - Route low-confidence matches to Steward Review; capture
ApprovedBy/At
.
Pitfall: Do not auto-create new canonical values from fuzzy matches; require steward approval.
Step 6 — Golden Rules for Canonical Survivorship
Portal: Governance → Rules (Golden Record Rules)
- For shared domains (e.g., product categories coming from multiple apps), set precedence:
MasterTaxonomy
>ERP
>eCommerce
>CSV
. - Tie-breakers: most recent change, highest steward confidence, or owner’s source.
- Prevent conflicting edits by locking
Approved/Published
values to owner role only.
Action examples
- On approval → set
Status=Published
, add tagAuthoritative
. - On deprecation → set
EffectiveTo
, tagDeprecated
, update mappings to successor value.
Step 7 — Versioning & Effective Dating
Portal: Entity Explorer / Rules
- Add a
Version
toCodeSet
(e.g.,ProductCategory v2025.08
). - When changing codes/labels:
- Create new
CodeValue
withEffectiveFrom=YYYY-MM-DD
. - Set predecessor
EffectiveTo
. - Migrate
CodeMapping
entries (track change reason).
- Create new
- Keep Draft vs Published states; only publish after UAT.
Step 8 — Hierarchy Management
Portal: Entity Explorer → Hierarchies (via HierarchyNode
)
- Load parent/child edges for the set (e.g., Category tree, Cost Center rollup).
- Validate acyclic structure; enforce single parent (or allow alternate hierarchies using
AltHierarchyName
). - Add effective dates for time-variant hierarchies.
- Create validation rules: no orphans, root count = 1 (unless multi-root by design), max depth.
Step 9 — Stewardship Workflow
Portal: Data Stewardship
- Work queue for: new code proposals, unmapped source codes, hierarchy edits, deprecations.
- Approve/Reject with justification; changes recorded in History.
- Use tags to drive workflow:
NeedsApproval
,NeedsMapping
,BreakingChange
.
Step 10 — Quality Checks & Policies
Portal: Data Quality → Profiling / Rules
- Uniqueness: canonical
Code
unique perCodeSet
. - Completeness: required attributes populated for
Published
. - Integrity: every
SourceCodeValue
maps to aCodeValue
. - No deprecated usage in exports (flag if
EffectiveTo
< today). - Policy: new values must have owner + description; hierarchy edits require two-person approval.
Step 11 — Publish & Integrate
Portal: Exports
- Publish lookup tables and mappings to ERP/CRM/DW/ML features.
- Provide both wide (
CodeValue
) and narrow (SourceCode
→CanonicalCode
) feeds. - Offer read API for applications that need live lookups.
- Start read-only, validate with consumers, then make CluedIn authoritative.
Step 12 — Scheduling & Release Management
- Schedule source refresh, rematching, validation, and export jobs.
- Run large batch reconciliations off-peak on a dedicated server/VM.
- Use Product Toolkit to package
CodeSet
,CodeValue
, rules, and exports; ensure Accountable permissions before promoting to staging → prod.
Step 13 — UAT & Go-Live Checklist
- All in-scope source codes ingested and standardized.
- 100% of
SourceCodeValue
entries mapped or triaged. - Golden survivorship/precedence approved by domain owners.
- Hierarchies validated (no cycles/orphans; effective dates correct).
- Version labeled and Published; change log prepared.
- Downstream consumers validated against new lookups and mappings.
- Runbooks for deprecations, renames, and rollback documented.
Example Rules (Snippets)
Normalize Code (Data Part Rule)
- Condition:
SourceCode
present - Action: uppercase; strip punctuation; set
Code_Normalized
; tagStandardized
.
Auto-map Exact Codes
- Condition:
Code_Normalized
equals canonicalCode
in sameCodeSet
- Action: create
CodeMapping
withMatchConfidence=1.0
; tagAutoMapped
.
Candidate Fuzzy Label Match
- Condition: label similarity ≥ 0.95
- Action: create candidate
CodeMapping
withMatchConfidence=0.95
; tagNeedsStewardReview
.
Deprecate & Redirect
- Condition:
CodeValue.Status = Deprecated
- Action: enforce
SuccessorCode
; update downstream export to map deprecated → successor.
Monitoring & KPIs
- Mapping coverage:
% SourceCodeValue mapped
(target ≥ 99.5%). - Data quality: duplicates = 0; invalid/unknown codes trend ↓.
- Time to approve new code: median hours/days vs SLA.
- Consumer adoption: % downstream systems using CluedIn lookups.
- Incident rate from reference data changes (should trend to zero).
Common Pitfalls & How to Avoid Them
- Letting sources drift (new codes appear unmapped) → schedule nightly ingest + alert on
UnknownCode
. - Over-fuzzy mapping → require steward approval for < 1.0 confidence.
- Breaking changes mid-cycle → use
Draft
versions; publish on release cadence; communicate impact. - Hierarchy loops → enforce acyclic validation + tests before publish.
- Running heavy reconciliations on laptops → use dedicated compute to avoid timeouts.
Rollback & Audit
- Every change (new code, rename, mapping edit, hierarchy move) appears in History with who/when/why.
- Use Undo/Restore Previous Value for attribute fixes; Unmerge not typically needed for RDM but available for accidental consolidations.
- For erroneous releases: republish the prior Version (set its
EffectiveFrom
to now, mark failed versionDeprecated
).
Summary
With CluedIn you can centralize reference data, create reliable crosswalks, govern changes with versioning and approvals, and distribute authoritative values to every consumer. Start with high-impact sets, lock down ownership, automate the mundane, and publish on a predictable cadence to eliminate code chaos.