SPECINSIGNIA / DOSSIER № 0001№ 0001FILE 2026.05.OVERVIEW§ 04 · CAPABILITIES

PG. SUB · data-intelligence

DOC · INS-CAP-DATA-INTELLIGENCEREV · 2026.Q2CLASS · PARTNERPRACTICE · Capability§ · 4

/ CAPABILITY · DATA-INTELLIGENCE

Data & Intelligence

Enterprise data + AI, shipped as intelligence. Data engineering, machine learning, LLM and agentic systems, computer vision, and vision-language models, all engineered to run in production.

Manifesto

Most ML lives and dies in a notebook. We aren't in that business.

We engineer the systems around the model, the pipelines that feed it, the evaluation harnesses that catch it failing, the deployment surface that lets it ship without taking the rest of the platform down. The model is the easy part; the system around it is what makes it production.

We treat eval as the live system, not a chart from before launch. We treat drift as inevitable, not surprising. We treat the production loop as the actual product, and the offline metric as a single signal in a much larger conversation.

The pillars

/ 01

Data pipelines

Streaming + batch, schema-enforced at the seams, with backpressure and replay. Idempotent transforms with end-to-end lineage so the question 'where did this number come from?' has a one-click answer.

Kafka
dbt
Spark
Flink
Lineage

/ 02

MLOps + LLMOps

From notebook to production for both classical ML and LLM systems. Training pipelines as first-class systems, model + prompt registries as source of truth, canary deploys and shadow traffic, instant rollback when the eval line crosses. LLMOps adds prompt versioning, eval-driven prompt iteration, and per-tenant cost governance on top of the MLOps base.

MLflow
DVC
Ray
Triton
LLMOps
Prompt registry

/ 03

Computer vision + VLM

Defect detection on real factory floors, OCR on real claims documents, identity verification on real ID cards. Plus vision-language models for multimodal use cases: visual question answering, document understanding, scene description. Not COCO benchmarks, the messy data that breaks them. Transformer-based architectures (ViT, SAM, CLIP, LLaVA, Florence-2) alongside the convolutional baselines that still win in production.

PyTorch
ONNX
TensorRT
CLIP
LLaVA
Transformers

/ 04

Agentic systems + chatbots

RAG that actually works at scale. Tool use you can audit. Token budgets enforced. Eval harnesses that score reasoning chains, not just final answers. Includes conversational AI and chatbot products built on the same agentic foundation: persistent memory, retrieval-grounded responses, multi-turn coherence, guardrails that survive prompt injection. Productionised, not demoed.

LangGraph
DSPy
vLLM
Chatbots
RAG
MCP

/ 05

Evaluation as first-class

Offline metrics segmented by cohort and edge case. Online metrics from shadow traffic and holdouts. Calibration checks. Drift detectors that fire before the business notices. Eval is the product.

Eval harness
Drift
Calibration
Cohort

What we ship

Data engineering + data science · pipeline design · stream processing
MLOps · model training, registry, deployment, monitoring
LLMOps · prompt registry, eval gates, inference cost governance
Computer vision + vision-language models (CLIP, LLaVA, Florence-2)
Agentic systems · AI agents, conversational AI, chatbots
RAG architectures · retrieval-augmented generation at enterprise scale
Transformer architectures (encoder, decoder, encoder-decoder, multimodal)

Stack

Snowflake · Databricks · BigQuery

PyTorch · ONNX · MLflow · vLLM

HuggingFace Transformers · LangGraph · DSPy

Evaluation · the live system

Eval is the product. The model is the easy part.

We integrate foundation models into production systems, we don't train them. Published benchmarks (MMLU, HumanEval, MT-Bench, LMSYS Arena Elo) inform which model we pick to start with. They tell us nothing about whether the system around that model will work on the customer's actual task, data, or adversarial surface. The eval that decides whether something ships to production is one layer up from the benchmark.

The four disciplines below run in parallel through every system lifecycle. The harness gates traffic. Drift detection runs in production. Calibration is checked at every retrain. Rollout patterns determine how the next version reaches a single real user. Skip any of the four and the system fails silently for months before you notice.

Production eval harness · the live test

What we measure on the customer's actual data.

The benchmarks model trainers ship with (MMLU, HumanEval, MT-Bench) say a model is ready to leave the lab. The harness below says a system is ready to take production traffic. Four axes, all run on the customer's own data, all refreshed continuously, all version-controlled alongside the prompts and the code.

Task-specific held-out eval

Primary harness

Customer ground truth · quarterly refresh

Real customer prompts paired with gold-label answers held out of every prompt iteration, fine-tune, and few-shot example. Pass rate is the metric; benchmark scores are not. Rebuilt every quarter to track the customer's evolving task surface.

Retrieval + faithfulness

RAG systems

Recall@k · Precision@k · Faithfulness · Citation correctness

RAGAS-style axes computed on the customer's own corpus and questions, not the published benchmark dataset. Recall@k catches missed evidence; faithfulness catches hallucination; citation correctness catches cited-but-unsupported claims.

Tool-use correctness

Agentic systems

Right tool · right args · steps-to-completion

Every tool call is logged with the model's reasoning, the tool name, the arguments, and the result. Per-tool success rate, steps-to-completion vs optimal, cost-per-task. Failures categorized: wrong tool, wrong args, wrong sequencing.

Guardrail effectiveness

Adversarial surface

Injection survival · PII leak · jailbreak resistance

Held-out adversarial test set covering known prompt injection patterns, PII probing, and jailbreak attempts. Measured as percentage of attacks blocked. Refreshed monthly from the OWASP LLM threat corpus and current jailbreak research.

Drift detection · the live signal

Models don't fail at launch. They fail three months in.

Production data drifts. Concept drifts. Feature distributions drift. We run statistical tests continuously and alert before the business metric does, because by the time the business metric flags it the issue is already weeks old and customer-visible.

KS-test

Continuous features

Kolmogorov-Smirnov

Two-sample distributional shift, no assumption on shape. Sensitive to median + tail differences. Run per-feature per-day, flag p-value < 0.01 after Bonferroni correction.

PSI

Categorical + binned

Population Stability Index

PSI < 0.1 stable, 0.1-0.25 monitor, > 0.25 retrain. Industry standard for credit risk and operational scoring models. Cheap to compute, easy to chart.

EMD

Embeddings + ordinal

Earth Mover's Distance · Wasserstein-1

Works on non-normalized distributions and ordinal features where KL-divergence misbehaves. The right primary metric for embedding-space drift in RAG and recommender pipelines.

Page-Hinkley

Online detection

Streaming change-point

Cumulative-sum test for shift in mean of a streaming series. Fires faster than batched tests when a concept drift starts; tuned with magnitude + threshold per channel.

Calibration · the honesty check

A confident wrong model is worse than an uncertain one.

A model that says 99% and is wrong 50% of the time is a liability, not an asset. We measure calibration at every retrain and recalibrate with post-hoc methods when the gap exceeds threshold. Confidence is a signal downstream systems consume; it has to be true.

ECE

Primary metric

Expected Calibration Error · 15 bins

Bins predictions by confidence, measures gap between confidence and accuracy per bin, weighted by bin population. ECE > 5% means the model lies about how confident it is.

Brier score

Combined metric

Quadratic proper scoring rule

Penalizes both miscalibration and inaccuracy in one number. Lower is better. Used as the headline metric when stakeholders only have time for one.

Temperature scaling

NN logit calibration

Single-parameter post-hoc

Divides pre-softmax logits by a learned scalar T. Fits in minutes on a held-out set, no retraining. First calibration we try on any classification head.

Isotonic regression

Post-hoc calibration

Non-parametric · monotonic

Fits a piecewise-constant monotonic function from confidence to accuracy. Heavier than temperature scaling but handles non-sigmoid distributions and pre-softmax output.

Rollout · the deployment gate

The new model doesn't ship to 100% on day one.

We use feature-flag patterns with eval gates. Every stage has a pre-registered metric, a hard guardrail, and a rollback plan. Promotion happens automatically when the gate passes; rollback happens automatically when the guardrail fires. No after-hours pages.

Shadow

Pre-rollout signal

Mirror traffic · no user impact

New model receives production traffic, outputs are logged and compared offline against the champion. No user sees the new model's output. The cheapest insurance policy in ML deployment.

Canary

Bounded production exposure

1% / 5% / 10% staged ramp

Small percentage of real production traffic to the new model. Gates between stages are eval-line + drift signal + business-metric guardrails. Roll back within seconds, not hours.

A/B test

Statistical comparison

Pre-registered hypothesis · power-analysis sized

Equal split with the success metric committed before launch. Pre-registration prevents moving goalposts post-hoc. Sample size derived from power analysis, not the time the PM wants to wait.

Multi-armed bandit

Adaptive allocation

Thompson sampling · epsilon-greedy

Traffic allocation continuously shifts toward better-performing arms. Faster convergence than A/B for high-variance metrics. Used when reward signal is dense and we can afford exploration.

Agentic systems · architecture

An agent is a system around a model. The model is the easy part.

Most agents fail not because the model can't reason, but because the system around the model can't recover. The framework decision, the tool surface, the memory architecture, the orchestration topology, each one is a load-bearing choice that determines whether the agent survives the first ambiguous user input.

The four axes below are the architecture-level decisions we make on every agentic engagement, in roughly this order. Get the framework wrong and the abstractions fight every feature. Get the tool surface wrong and the model hallucinates actions. Get the memory wrong and the agent forgets why it was started. Get the orchestration wrong and you ship a chatbot with extra steps.

Frameworks · the control-flow layer

Pick the framework that matches the topology you actually need.

Every framework encodes a different assumption about what an agent is. LangGraph thinks agents are state machines. DSPy thinks agents are compiled programs. The OpenAI Agents SDK thinks agents are hand-off-routed functions. Provider tool-use thinks agents don't need a framework at all. We pick the one whose assumption matches the engagement, not the one trending this quarter.

LangGraph

Explicit control flow

Stateful graph · checkpointed · persistent

Agent as a directed graph of nodes with shared state and checkpoints. Right pick when control flow has branches, retries, and human-in-the-loop interrupts. Time-travel debugging via checkpoints; the graph is the spec.

DSPy

Prompt-as-code

Stanford NLP · prompt compilation · automatic optimization

Prompts and chains are Python modules. The optimizer compiles them against a training set with a metric. Right pick when the prompt space is large and we'd rather optimize than hand-tune. The compile step is the contract.

OpenAI Agents SDK

Simple multi-agent

Hand-off pattern · routing · opinionated

Lightweight successor to Swarm. Each agent is a function with tools and a hand-off rule. Right pick when the topology is shallow and the routing is the interesting part. Less expressive than LangGraph but a much smaller surface.

Provider tool-use directly

No framework

Anthropic / OpenAI / Bedrock native APIs

Function calling against the provider SDK, no framework abstraction. Right pick when the agent is one model + a handful of tools + no multi-step routing. Fastest to ship, easiest to audit, and where most production agents actually end up.

Tool use · the action layer

The agent's action space, made auditable.

Tools are how agents do anything other than write text. The patterns below cover the four primary tool-invocation styles: structured function calls, vendor-neutral MCP integration, parallel execution for latency optimization, and sandboxed code execution for open-ended actions. Every tool call is logged with arguments, results, and the model's reasoning trace.

Function calling

Default pattern

JSON-schema tools · structured outputs

Tools defined as JSON schemas, model returns a structured tool-call object. Provider-native on OpenAI, Anthropic, Bedrock, Vertex. Schema validation is the first defense against malformed args; we run it pre-execution every time.

MCP

Tool interop

Model Context Protocol · Anthropic spec · vendor-neutral

Open protocol for connecting agents to external systems. Vendor-neutral, so the same tool surface works across Anthropic, OpenAI, and self-hosted models. Right pick when the tool catalog is shared across multiple agents.

Parallel tool calls

Latency optimization

Concurrent execution · structured aggregation

Multiple tool calls executed in parallel from one model turn. Cuts agent latency on read-heavy workloads (lookups, retrievals, joins). Requires the framework or runtime to deduplicate, aggregate, and re-inject into context cleanly.

Sandboxed code execution

Open-ended actions

WASM · E2B · Docker · ephemeral

Agent writes code, runtime executes it in a sandboxed environment with bounded CPU, memory, and network. Right pick when the action space is too large for a static tool catalog (data analysis, transformation, ad-hoc scripts).

Memory · the persistence layer

Four tiers of memory, each priced and budgeted.

Agent memory is a hierarchy, not a single store. Context-window memory is fastest and cheapest but ephemeral. Scratchpad memory carries reasoning across a single turn. Conversation summary survives a session. Vector store memory persists across sessions and users. Different facts belong in different tiers; mixing them is the most common cause of agent amnesia in production.

Context window

Working memory

Short-term · in-prompt · ephemeral

Whatever fits in the model's context window for the current turn. The cheapest memory and the only one with zero retrieval latency, but it's lost the moment the turn ends. Budget it like a register file.

Scratchpad

Reasoning trace

Inner monologue · chain-of-thought · structured

Model writes intermediate steps to a structured scratchpad inside the same context window. Improves multi-step reasoning quality, makes the chain auditable, and lets downstream tools key off named intermediate values.

Conversation summary

Medium-term

Rolling summary · episodic · learned compression

Long conversations compressed into a running summary that lives outside the immediate context window. Trade-off: the summary loses fidelity over time. We re-derive from raw transcript when the summary stops being trusted.

Vector store

Long-term recall

Semantic recall · long-term · cross-session

Episodic and semantic memory indexed by embedding. Retrieved by relevance to the current state, not chronology. Persists across sessions. The right home for the kind of memory that has to outlive any single conversation.

Orchestration · the coordination layer

How multiple agents (or multiple turns of one agent) coordinate.

Topology determines what failure modes are even possible. A supervisor topology keeps the routing centralized and debuggable; a sequential pipeline is the most testable; a swarm enables emergent routing at the cost of debug-ability; a debate pattern is the right hammer for high-stakes one-shot decisions. We pick by failure mode, not by what looks impressive in a demo.

Supervisor

Default topology

Router model + specialist workers · hierarchical

One supervisor classifies the request and routes to one of N specialist agents. Specialists return to the supervisor, supervisor returns to user. Easiest topology to reason about and the right pick for most production multi-agent systems.

Sequential pipeline

Workflow agents

Deterministic chain · fixed roles · staged output

Output of agent N is input of agent N+1, in a fixed order. Right pick when the workflow is genuinely linear (e.g., extract -> classify -> summarize -> publish). Less flexible than supervisor but trivially testable and resumable.

Swarm / hand-off

Open exploration

Peer-to-peer · agent-driven routing

Each agent decides which peer to hand off to. No supervisor; routing is emergent. Right pick when the right specialist isn't knowable from the request and has to be discovered. Harder to debug, easier to extend.

Debate / consensus

High-stakes decisions

Multiple independent attempts · judge or vote

N agents independently attempt the same task, a judge model or majority vote selects the answer. Right pick when one wrong answer is expensive and the marginal compute cost of N attempts is acceptable.

The pipeline · interactive

Five stages. Pick one to see what we ship there.

/ Stage 01 · Ingest

The data front door

Every system upstream has a different definition of 'event.' We make sure they all land in ours, schema-enforced at the seam, with backpressure and replay for the days the upstream is having a worse day than we are.

Kafka, Kinesis, Pub/Sub, RabbitMQ for streams
S3 / GCS batch with manifest + checksum
Schema enforcement at the seam (Avro, Protobuf, JSON Schema)
Dead-letter queues with replay tooling, not just logs
Lineage capture from the first byte

Throughput · last 30 hours

47 / hr · p99 200ms · OK

The pipeline, in the only chart it'll ever need.

AI governance · framework mapping

Twelve AI controls. Four standards. Forty-eight cross-references.

Governance is the part most ML practices skip. We treat it as a system, with each control mapped to the published standards a procurement, compliance, or AI safety reviewer will check against.

AI governance mapping · v2026.Q2

FIG. iii · controls × frameworks

AI control to standard mapping

Each cell names the specific sub-clause our control satisfies. Not checkmarks. Cells marked "—" mean the framework does not address that control area, not that we don't do it.

AI control area	NIST AI RMF 1.0	EU AI Act 2024/1689	ISO/IEC 42001:2023	OWASP LLM Top 10
/ 01 Model cards + system cards	GOVERN-1.6, MAP-1.1	Art. 11, Art. 13	A.7.4	—
/ 02 Datasheets for datasets	MAP-2.3	Art. 10	A.7.4	—
/ 03 Eval harness governance	MEASURE-1.1, MEASURE-2.3	Art. 9, Art. 15	A.9.2	—
/ 04 Drift detection + monitoring	MEASURE-2.4, MEASURE-2.7	Art. 15, Art. 17	A.9.2, A.10	—
/ 05 Prompt injection defense	MANAGE-2.1	Art. 15(5)	A.8.2	LLM01
/ 06 Training-data poisoning defense	MAP-2.2, MANAGE-2.1	Art. 10, Art. 15	A.7.4	LLM03
/ 07 Model extraction + theft defense	MANAGE-2.1, MEASURE-2.7	Art. 15(5)	A.8.2	LLM10
/ 08 Output filtering + safety classifier	MEASURE-2.6	Art. 13, Art. 15	A.7.4	LLM02, LLM06
/ 09 Human-in-the-loop oversight	MANAGE-1.4	Art. 14	A.8.2	LLM08, LLM09
/ 10 Red-team + pre-deployment review	MEASURE-2.7, MANAGE-2.1	Art. 9, Art. 15	A.9.2	LLM01-LLM10
/ 11 Bias + fairness assessment	MEASURE-2.11	Art. 10(2)(g), Art. 15	A.8.2	—
/ 12 Lineage + auditability	GOVERN-1.4, MEASURE-2.10	Art. 12	A.7.4, A.9.2	LLM05

12 controls · 4 frameworks · 48 mappingsDoc · INS-AIGOV-MAP-v2026.Q2

Standards versions: NIST AI RMF 1.0 (NIST.AI.100-1, January 2023), Regulation (EU) 2024/1689 (Artificial Intelligence Act, in force August 2024), ISO/IEC 42001:2023 (AI Management System), OWASP Top 10 for LLM Applications 2024-2025. Sub-clause references verified at build time.

MLOps stack · deployment tiers

Ten lifecycle stages. Three deployment postures. Thirty tool selections.

Picking the right tool depends on the deployment posture, not the stage. We ship different stacks for startup speed, enterprise control, and air-gapped regulation, each substitutable column-by-column without rebuilding the lifecycle.

MLOps stack · v2026.Q2

FIG. iv · stages × deployment tiers

Lifecycle stage to tooling map, by deployment posture

Each cell names the actual tool we ship at this stage for this deployment posture, not a category. Pattern-match your constraints to a column. Tiering is by operational burden and data residency, not feature parity.

Lifecycle stage	Startup / fast	Enterprise / controlled	Regulated / air-gapped
/ 01 Experiment tracking	Weights & Biases	MLflow v2.18+ on K8s	MLflow OSS · self-hosted
/ 02 Model registry + versioning	MLflow Registry · managed	MLflow Registry on K8s	MLflow OSS · S3-compatible backend
/ 03 Feature store	Tecton	Feast on K8s · Redis online	Feast OSS · Postgres + Parquet
/ 04 Data + dataset versioning	DVC + S3 · git-tracked	DVC + internal blob store	DVC + Git LFS · air-gapped
/ 05 Training orchestration	Ray on AWS	Ray on K8s · Kueue scheduling	Ray local · KubeRay self-hosted
/ 06 Model serving + inference	Bedrock · Vertex AI	Triton v24+ · vLLM on K8s	vLLM v0.6+ · TGI · llama.cpp
/ 07 Vector + embedding store	Pinecone	Weaviate on K8s · pgvector	Chroma · FAISS · Qdrant OSS
/ 08 Monitoring + observability	Datadog · LangSmith	Prometheus + Grafana · Phoenix	OpenTelemetry · self-hosted
/ 09 Drift + eval pipelines	Evidently Cloud	Evidently OSS on K8s	KS-test · PSI · custom harness
/ 10 Lineage + audit trail	OpenLineage + Marquez	Marquez on K8s · Atlan	OCSF event log · WORM storage

10 stages · 3 tiers · 30 tool selectionsDoc · INS-MLOPS-STACK-v2026.Q2

Tool selections reflect what we ship in production. Versions listed where the choice is version-sensitive (vLLM v0.6+ for continuous batching, Triton v24+ for in-flight batching, MLflow v2.18+ for the new prompt registry). Not exhaustive: where multiple tools fit a cell, the primary is named.

Posture

99.7%

Inference uptime · 2025

200ms

p99 latency · production

Models in production

Active data pipelines

Monitoring active

Frameworks & methodology

MLflow Registry
DVC Lineage
Triton Serving
OpenTelemetry
Drift Detection

Tools & runtimes

PyTorch
TensorFlow
ONNX
HuggingFace Transformers
vLLM
LangGraph
DSPy
Bedrock
Vertex AI
Databricks
CLIP / LLaVA / Florence-2
Whisper

Coverage

Computer vision + VLM · industrial, claims, document
Retrieval & RAG · enterprise knowledge
Forecasting · supply chain + retail
Decision agents + chatbots · workflow automation
Conversational AI · customer-facing assistants

Engagement · redacted sample

Engagement with Insurance, computer-vision claims pipeline. Document classification accuracy from 76% baseline to 98.4% with calibrated abstain; manual review queue cut by 71%.

Hover or focus the bar to reveal · client identity protected

§ 04 · CAPABILITIES · REGISTER · BLUEPRINTIN.↓ NEXT SECTION · § 04 · CAPABILITIES INDEX↓ § 04