One trusted source of company data on Snowflake β so people can ask in plain English and AI agents can plan, retrieve, and act reliably, with every answer governed and cited.
One authoritative record for every team and AI system β no competing definitions.
Ask like a sentence, get a chart β no analyst queue.
Every answer grounded, cited and logged β fit for a regulated firm.
Five short, interactive examples. Each is introduced in one plain-language sentence before any detail β click through them.
In plain language: a manager types a normal sentence; the system writes the database query, returns a chart, and reads the answer aloud. Tap a question to ask it.
In plain language: the same portfolio company β a financial-services firm β appears three different ways across three systems. The platform reconciles them into one authoritative record. (Technically, this is Master Data Management.)
Same firm β three names, three identifiers. Which one is canonical?
In plain language: instead of guessing, the AI looks up the most relevant passages in a document and answers using only those β then shows exactly which passage it used. (The tech: retrieval-augmented generation, or RAG β in production, Snowflake Cortex Search.)
In plain language: the old way refreshes data once a night, so by midday it's stale. A streaming pipeline keeps the number live to the second. Watch the two diverge. (The tech: Apache Kafka β Snowpipe Streaming, vs a nightly batch load.)
By midday the batch number is already behind reality. Streaming closes that gap.
In plain language: most questions are answered inside Snowflake. But training large models on years of history is a different job β that's where a Databricks lakehouse and MLflow earn their place, reading the same governed golden records over open Iceberg tables. One source, two engines.
Iceberg golden records β no copy, no drift.
Serving, NLβSQL, RAG, agents β governed.
Heavy training, feature store, experiment tracking.
The discipline: pick the engine the workload needs β without ever forking the source of truth.
Pull data continuously from every system β Bloomberg, S&P, CRM, filings. (streaming + connectors)
Match duplicates and build one trusted "golden" record per company. (MDM)
Define metrics once so language maps to data the same way every time. (semantic layer)
Let people and AI agents ask, retrieve, and act on it. (RAG, agents, copilots)
Wrap everything in access controls, citations, logging, and quality checks β so every answer is safe and auditable. (governance & eval)
For technical reviewers β the underlying architecture, Snowflake SQL, design trade-offs, and a requirement-by-requirement mapping to the job description. Each section expands on request.
In plain language: an ontology turns rows in tables into the business objects Carlyle actually reasons about β a Fund, a Portfolio Company, a Deal β with the relationships between them and the governed actions you can take. It is the shared language analysts, applications, and AI agents all use. Foundry and the Snowflake foundation are complementary: the same golden records back both β one canonical source, two consumption planes.
Every object is backed by the same Snowflake golden records β one canonical source feeding both Foundry and Cortex.
Why it matters: agents and analysts operate on objects and actions β not raw tables β so AI work stays grounded, permissioned, and auditable. My Foundry experience (ontology design + integration on a healthcare supply-chain platform) maps directly to portfolio-operations objects here.
Read top to bottom: data flows from sources to consumers. Governance wraps every layer (left). MDM is the foundation β golden records before AI; everything Snowflake-native so RAG, NLβSQL, and agents stay inside the governance boundary.
The three steps behind Examples 2 and 3, explained simply. The actual Snowflake code sits under each step for engineers β you don't need to read it to get the idea.
Rather than compare every record to every other one (far too slow), the system first groups records that could be the same firm, then scores how alike their names are. A high score means it's the same company.
-- SOUNDEX blocking avoids an NΒ² comparison; score with Jaro-Winkler WITH pairs AS ( β¦ ) -- group candidates + score name similarity
-- SOUNDEX blocking avoids an NΒ² comparison; score with Jaro-Winkler (0β100)
WITH pairs AS (
SELECT a.raw_id AS id_a, b.raw_id AS id_b, a.cik AS cik_a, b.cik AS cik_b,
JAROWINKLER_SIMILARITY(LOWER(a.name), LOWER(b.name)) AS name_sim,
EDITDISTANCE(a.domain, b.domain) AS domain_dist
FROM bronze.raw_records a
JOIN bronze.raw_records b
ON a.raw_id < b.raw_id
AND SOUNDEX(a.name) = SOUNDEX(b.name) -- blocking key
)
SELECT id_a, id_b,
(0.7 * name_sim/100.0) + (0.3 * (domain_dist = 0)::INT) AS match_score
FROM pairs
QUALIFY match_score > 0.85 OR cik_a = cik_b; -- deterministic override
When the matched records disagree, the most authoritative source wins for each field (for example, S&P for the legal name). The result is a single “golden” record β the best of all three.
-- most-authoritative source wins per field SELECT β¦ QUALIFY ROW_NUMBER() OVER ( β¦ ) = 1;
-- source_priority: S&P=1, Bloomberg=2, CRM=3 β lowest wins per field
SELECT match_group, name, cik, figi, internal_id, source_system
FROM silver.matched_records
QUALIFY ROW_NUMBER() OVER (
PARTITION BY match_group
ORDER BY source_priority, updated_at DESC) = 1;
This sets up the search that finds the right passages inside filings and reports, so the AI answers from real source documents and shows exactly where each fact came from.
CREATE OR REPLACE CORTEX SEARCH SERVICE portfolio_docs ON content_chunk β¦ EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0';
CREATE OR REPLACE CORTEX SEARCH SERVICE portfolio_docs
ON content_chunk
ATTRIBUTES golden_id, doc_type, page_number
WAREHOUSE = search_wh
TARGET_LAG = '1 hour'
EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (SELECT content_chunk, golden_id, doc_type, page_number
FROM silver.doc_chunks);
-- retrieve grounded, cited passages for an agent or copilot
SELECT value:content_chunk::STRING AS passage,
value:golden_id::STRING AS company,
value:page_number::INT AS page
FROM TABLE(FLATTEN(PARSE_JSON(
SNOWFLAKE.CORTEX.SEARCH_PREVIEW('portfolio_docs', '{
"query": "defense and government exposure",
"columns": ["content_chunk","golden_id","page_number"],
"limit": 3 }'))));
Default to Snowflake-native so AI stays inside the governance boundary. Reach out only with a clear reason β each external hop is a governance seam to defend.
The principle: if a governed view and a defined metric answer the question, that beats an agent. AI-forward by default β never AI for its own sake.
19+ years across financial services and federal regulated systems. Sanitized for confidentiality β the patterns are exactly what this role needs.
Cut manual document processing 70% at 95% accuracy with AI on Snowflake, in regulated financial data.
Built & governed a half-petabyte Snowflake platform; canonical models adopted across Finance, Marketing & Operations.
Annual savings from smarter warehouse sizing and cost controls β at full production scale.
Led a multi-source migration to a half-petabyte Snowflake lakehouse with Kimball-modeled marts and a shared semantic layer.
Built fuzzy-matching + survivorship pipelines producing one trusted record per entity across competing source systems.
Production retrieval-augmented generation over governed documents β cited, logged, and quality-checked for a regulated firm.
Senior AI & Data Architect β 19+ years, half-petabyte Snowflake in production with Cortex AI, master & reference data, RAG, and Palantir Foundry, across financial services and federal regulated systems. Reston, VA Β· the 4-day DC cadence works well.
Made by Harnoor Minhas Β· May 2026 Β· reference-architecture demo Β· illustrative data Β· not a live production system Β· Snowflake feature names & SQL accurate to current docs Β· Carlyle stats from public disclosures (NASDAQ: CG).