The Problem This Guide Solves
Engineering teams building LLM-backed systems default to software architecture instincts: normalize early, separate concerns, write prompts that handle behavior inline. Those instincts produce the wrong system.
In a context-driven LLM system, the context files are the architecture. If developers treat them as reference documents instead of runtime contracts, the system loses its behavior guarantee entirely.
Diagnostic test: Ask any developer: "Where in the code are we merging the global context, step overlay, and session data into a single object before calling the LLM?" If they can't point to it clearly — that's the gap this guide addresses.
The Model Behavior Analyst / Architect Role
Most engineering orgs don't have this role yet. That's the problem. Someone has to own the space between what the system is designed to do and what it actually does at runtime.
| Responsibility | What It Means | Who Owns It Without This Role |
|---|---|---|
| Context architecture design | Define what lives in global context, step overlays, session schema. Specify source hierarchy and behavioral rules. | Nobody. Devs write it ad hoc into prompts. |
| Behavioral contract authoring | Write and maintain YAML/config files that define system behavior as enforced rules, not suggestions. | Nobody. Or scattered across docs nobody reads. |
| Reasoning QA | Design and run scenario-based tests that probe for reasoning distortions, not just output quality. | QA engineers testing output correctness only. |
| Drift detection | Track when system behavior changes due to input shifts, model updates, or config drift. | Nobody until users complain. |
| Observability design | Define what gets logged, how outputs trace back to inputs, what is reviewable. | Devs log whatever's convenient. |
Scope of This Guide
Context architecture design · storage decisions (SQL / vector / graph) · required artifacts and their formats · context assembly implementation · tracking and observability · reasoning distortion taxonomy · QA test scripts and scenarios · diagnostic checklists
Model selection · fine-tuning decisions · infrastructure deployment · frontend UX · business logic implementation. Those are engineering concerns. This guide covers the layer between the model and the engineering.
Two Models. One is Wrong.
# What most devs build prompt = "Extract signals. Don't hallucinate. No solutioning." LLM(prompt + raw_data)
YAML files: maybe uploaded somewhere, maybe read by devs, not actually used at runtime
Result: system behavior lives inside prompt text, not architecture
# What the system actually is
context_packet = assemble(
global_context.yaml
+ step_overlay.yaml
+ session_data
+ artifact_summaries
)
LLM(prompt_template
+ context_packet)
Result: behavior is controlled structurally, prompts are thin, system is predictable + testable
The Sentence to Memorize
"Prompts are not the system. The system is the context architecture. Prompts are just instructions executed inside that system."
What Goes Wrong When Devs Miss This
Different prompts behave differently over time. No structural anchor. Behavior is wherever the last dev left it.
YAML says no_solutioning_in_step0_or_step1: true → model still solutions. Rule exists in a file nobody injected.
Outputs don't link to signals. Can't audit. Can't explain why the system said what it said.
When something breaks, you won't know if the issue is prompt, data, logic, or architecture. Nothing is isolated.
Devs create separate tables for session_governance, signals_v2, signal_events, framing_constructs, session_audit_log before the flow is proven. Clean architecture before working architecture.
The Thin Prompt Rule
Prompts should define exactly two things:
- Task definition
- Output format
- Governance rules
- Signal hierarchy
- Behavior constraints
- Step permissions / prohibitions
Those belong in YAML. If your governance rules live in prompts, they are unversioned, untestable, and invisible to anyone reviewing system behavior.
Defines governance rules, signal hierarchy, source weighting, and behavioral constraints that apply to every LLM call in the system. This is the system's non-negotiable rule set.
# 00_global_context.yaml no_solutioning_in_step0_or_step1: true block_processing_if_agreement_missing: true signal_hierarchy: tier_1: validated_external_research tier_2: internal_structured_data tier_3: facilitator_inputs tier_4: speculative_or_unvalidated behavioral_constraints: preserve_ambiguity: true require_source_attribution: true
no_solutioning_in_step0_or_step1: true — this is NOT a suggestion. This is a rule the backend must enforce before the LLM call, not inside the prompt.
Per-step YAML that defines what the LLM is permitted and prohibited from doing at that specific step. Narrows and specializes the global constitution for a specific task context.
# 01_step0_setup.yaml step: step_0_signal_extraction permitted: - create_signals - tag_source_tier - flag_ambiguity prohibited: - validate_signals - merge_signals - generate_solutions - produce_recommendations output_schema: signal_object_v1
The structured object that persists signals, provenance, step outputs, and metadata across the session. This is how the system stays consistent across steps without the model having to "remember."
// session_file_schema.json { "session_id": "uuid", "signals": [ { "signal_id": "uuid", "content": "string", "source_tier": "1|2|3|4", "source_ref": "string", "step_created": "step_0", "validated": false, "ambiguity_flag": false } ], "step_outputs": {}, "provenance_log": [], "governance_checks": {} }
The most important missing piece in most implementations. This is the backend service that pulls YAML rules, pulls session data, attaches signal metadata, and builds the full LLM input object before each call. Without this, everything collapses into "just prompting."
# Pseudocode — context assembler def assemble_context_packet(session_id, step): global_rules = parse_yaml("00_global_context.yaml") step_rules = parse_yaml(f"0{step}_step{step}_setup.yaml") session_data = db.get_session(session_id) signals = db.get_signals(session_id) artifacts = db.get_artifacts(session_id, step) return { "global_rules": global_rules, "step_rules": step_rules, "session_data": session_data, "signals": signals, "artifacts": artifacts } # Then inject into LLM call context_packet = assemble_context_packet(session_id, step=0) response = llm_call( system=render_system_prompt(context_packet), user=render_user_context(context_packet), task=task_template )
If a developer says "the YAML file is uploaded to the project" — that is NOT the context assembler. The assembler parses, merges, and injects at runtime on every call. This must be code, not a file reference.
Decision Framework
Storage decisions are not arbitrary. Each storage type has a distinct query model that maps to a distinct retrieval need. Wrong choice = retrieving the wrong kind of context at runtime.
| Storage Type | Query Model | Use When You Need To... | Don't Use For... |
|---|---|---|---|
| SQL (relational) | Exact match, joins, structured filters | Session records, signal objects, provenance logs, governance flags, audit trail, step outputs | Semantic similarity, relationship traversal |
| Vector (embedding) | Semantic similarity, nearest-neighbor | Matching user inputs to relevant signals, finding similar past sessions, semantic search across documents | Structured records with strict schemas, exact lookups |
| Graph | Relationship traversal, path queries | Signal relationships (signal A contradicts signal B), source authority chains, causal connection mapping, multi-hop reasoning | Simple records, unrelated data |
| Vectorgraph | Semantic similarity + relationship traversal combined | Finding semantically similar signals AND their relationships simultaneously — e.g. "what signals are related to X and how do they connect to each other?" | Simple use cases where either vector or graph alone is sufficient |
MESH System Storage Map
| Data Object | Storage | Reasoning |
|---|---|---|
session records | SQL | Structured, exact lookup by session_id, joins to signals and outputs |
signal objects | SQL + Vector | SQL for provenance/governance fields; vector for semantic retrieval during assembly |
| Signal relationships | Graph | Contradictions, reinforcements, causal chains between signals need traversal queries |
step_outputs | SQL | Structured, versioned, needs exact retrieval by step and session |
| Provenance log | SQL | Audit trail — exact, append-only, needs reliable exact recall |
| Source documents | Vector | Retrieved by semantic similarity to current query context |
| Framing constructs | SQL + Vector | SQL for version + session binding; vector for similarity matching to past framings |
| YAML config files | File system / object store | Loaded at runtime, parsed into memory — not queried, not embedded |
| Session audit log | SQL | Governance compliance, exact append-only record |
Vectorgraph: The Emerging Pattern
Standard vector search finds what is semantically similar. Standard graph traversal finds how things relate. In complex reasoning systems, you need both simultaneously.
Example: "Find signals similar to this new input AND determine whether any of those signals contradict each other." Vector alone returns similar signals. Graph alone finds contradictions. Vectorgraph does both in one query — returning semantically relevant signals with their relationship context already attached.
Production implementations: Weaviate (vector with graph-like cross-references), Neo4j with vector index extension, Qdrant with payload filtering for relationship metadata. This space is moving fast — evaluate against your specific query patterns before committing.
MVP Storage Guidance
A common mistake: creating separate tables for session_governance, signals_v2, signal_events, framing_constructs, session_audit_log, and facilitator_inputs before the session flow works end-to-end. That's choosing "adult architecture" before proving the flow works.
Start with: a sessions table with a JSONB signals column. Normalize only after the shape is stable.
Artifact Inventory
| Artifact | Format | Owner | Purpose |
|---|---|---|---|
| Context Architecture Doc | .docx / .md | Model Behavior Architect | Human-readable spec of all 4 layers. Describes intent, not implementation. Source of truth for what the system is designed to do. |
00_global_context.yaml |
.yaml | Model Behavior Architect | System constitution. All governance rules, signal hierarchy, behavioral constraints. |
0N_stepN_setup.yaml |
.yaml (per step) | Model Behavior Architect | Step behavior contract. Permitted actions, prohibited actions, output schema reference. |
session_file_schema.json |
.json | Model Behavior Architect + Backend | Canonical session object schema. No placeholder fields. Backend implements against this. |
| Step Addendum Docs | .docx / .md (per step) | Model Behavior Architect | Human-readable explanation of step-level decisions. Used for onboarding and QA review context. |
| QA Test Script — Step N | .md / structured doc | Model Behavior Architect | Scenario-based test cases for each step. Expected behavior defined before testing. See Section 08. |
| Behavior Drift Log | Structured doc / DB | Model Behavior Architect | Records observed behavior changes over time. Links changes to cause (config, model, inputs). |
Context Architecture Document — Required Sections
This is the document your developers implement against. It is not a prompt. It is a behavioral specification.
- System purpose and reasoning environment description
- Global context: governance rules with rationale
- Signal hierarchy definition: tier definitions with examples
- Step map: list of steps, what each step can and cannot do
- Session schema: field definitions, types, constraints
- Context assembly spec: exactly how the context packet is built
- LLM call structure: system / user / task template format
- Observability requirements: what gets logged, how outputs trace to inputs
- Known failure modes and mitigations
- Version history and change rationale
YAML File Requirements
They must be parsed and assembled into the LLM call path on every invocation (or cached in memory with explicit invalidation). If they exist on disk but are not parsed at runtime, they have zero behavioral effect.
A session schema with $1 placeholders or unresolved template variables is not usable as implementation truth. The schema must be clean before developers use it as a reference.
Assembly Pattern
Parse 00_global_context.yaml and the relevant step overlay YAML on every call. Caching in memory is acceptable — file-on-disk without parsing is not.
# Parse both files before building context global_rules = yaml.safe_load(open("00_global_context.yaml")) step_rules = yaml.safe_load(open(f"01_step{step}_setup.yaml"))
Hard governance rules (e.g. block_processing_if_agreement_missing) must be checked in backend code, not delegated to the prompt. If the condition is unmet, block the call before it reaches the model.
# Hard rule enforcement — NOT in the prompt if global_rules["block_processing_if_agreement_missing"]: if not session_data.get("agreement_confirmed"): raise GovernanceViolation("Agreement required")
{
"global_rules": parsed_global_context,
"step_rules": parsed_step_overlay,
"session_data": session_record,
"signals": signal_list_with_metadata,
"artifacts": step_artifacts
}
# Correct injection structure SYSTEM: You are operating under the following system rules: {global_context} {step_overlay} USER: Here is the session context: {assembled_context_packet} TASK: {task_template} # thin — defines task + output format only
The SYSTEM block carries behavioral rules. The USER block carries session data. The TASK block is thin — it only defines what to do and in what format to respond. Governance does not go in TASK.
The full context packet that was sent to the model must be logged alongside the output. This is the only way to trace an output back to its inputs later.
provenance_log.append({
"session_id": session_id,
"step": step,
"timestamp": now(),
"context_packet": context_packet, # full packet
"llm_response": response,
"model_version": model_id # track model changes
})
Identifying What Needs to Be Tracked
Not everything needs the same visibility. Use these questions to identify your system's specific tracking requirements:
- Where could weak reasoning create real downstream consequences?
- Which inputs should carry the most weight — and is that actually happening?
- Where is ambiguity likely to get collapsed too early?
- Where could the system start connecting signals that should remain separate?
- What would a reviewer need to see to judge whether output was shaped correctly?
- If output changed tomorrow, what would the team need to explain why?
- What external changes (model updates, input distribution shifts) could silently change behavior?
Required Tracking — Minimum Viable Observability
| What | Where | Why | Retention |
|---|---|---|---|
| Full context packet per call | Provenance log (SQL) | Only way to reproduce or audit an output. Trace what was actually sent to the model. | Full session lifecycle + N days |
| Model version / identifier | Provenance log (SQL) | Model updates silently change behavior. Need to correlate behavior shifts to model changes. | Permanent |
| Config file versions | Provenance log (SQL) | YAML changes change behavior. Need to know which config version was active for any given call. | Permanent |
| Signal source tiers | Signal table (SQL) | Detect source hierarchy collapse — weaker inputs overriding stronger evidence. | Session lifetime |
| Governance check results | Session table (SQL) | Verify hard rules were enforced. Detect violations. | Session lifetime |
| Step boundary timestamps | Session table (SQL) | Track step progression. Detect step boundary drift. | Session lifetime |
| Ambiguity flags on signals | Signal table (SQL) | Track whether system is preserving or collapsing ambiguity. | Session lifetime |
| Output → signal attribution | Output record (SQL) | Provenance: which signals drove which output claims. | Session lifetime |
Drift Detection
Drift can come from three sources. Each requires different detection:
YAML files change. Behavior changes. Detected by: versioning YAML files, logging config version with every call, comparing behavior before/after config changes.
Model provider updates underlying model. Behavior changes without any action on your side. Detected by: logging model identifier per call, running baseline test suite on model version change.
Input data changes shape, quality, or volume over time. Detected by: tracking signal tier distribution across sessions, flagging unusual ratios (e.g. sudden spike in tier_4 signals).
These patterns are not random. They cluster into recognizable shapes that emerge when systems synthesize multiple inputs. Once named, they become testable.
| Pattern | What Happens | Why It Happens | Detection Signal |
|---|---|---|---|
| Narrative completion bias | System resolves ambiguity by defaulting to the most coherent narrative rather than preserving uncertainty. Fills evidence gaps with plausible inference. | Models trained on human feedback inherit bias toward conclusive outputs. Irresolution is penalized as unhelpful. | Same query, vary inputs. If narrative structure is preserved even when inputs change substantially — bias is operating. |
| Confidence inflation | Outputs expressed with certainty that exceeds what the evidence supports. Compounds in multi-step chains. | Epistemic confidence (internal uncertainty) diverges from expressed confidence (what the system communicates). | Introduce deliberate gaps or contradictions in inputs. Measure whether expressed confidence degrades appropriately. |
| Source hierarchy collapse | Speculative internal memo and validated external research treated as equivalent signals. Weighting is flat. | Context windows present all inputs in flat format. Without explicit authority signals, the model cannot differentiate source reliability. | Present conflicting sources of explicitly different stated authority. Test whether output reflects the higher-authority source. |
| Premature synthesis | Competing signals collapsed into a unified interpretation before sufficient evidence is processed. Distinct from narrative completion — this collapses tension between present signals, not absent ones. | Output coherence is rewarded. Preserving genuine tension requires resisting resolution pressure. | Present inputs with genuine, irresolvable tension. Test whether the system acknowledges or papers over the tension. |
| Provenance loss | Outputs cannot be traced back to specific inputs. Prevents audit. Partly architectural (RAG / retrieval design). | Transformer attention does not natively preserve input-to-output attribution. Must be designed in explicitly. | Ask system to cite specific sources for specific claims. Test accuracy and completeness of citations. |
| Step boundary drift | In multi-step workflows, errors from earlier steps carry forward undetected. System maintains consistency with prior framings even when evidence warrants revision. | Models may be trained toward consistency signals that penalize apparent self-contradiction, even when correction is warranted. | Inject a deliberate reasoning error in an early step. Test whether subsequent steps inherit or correct it. |
| Behavior drift | System behaves differently over time without intentional change. Caused by model updates, config changes, or input distribution shifts. | External systems change underneath the context architecture. | Run baseline test suite against a fixed scenario set. Compare outputs across time periods or model versions. |
Anthropic Research Anchors
For teams wanting to connect these patterns to published research:
Narrative completion / sycophancy: Anthropic's sycophancy research covers the user-expectation driven variant. Narrative completion is the coherence-driven variant — system resolves ambiguity toward narrative coherence regardless of user expectation. Partially distinct phenomena.
Confidence inflation: Maps to Anthropic's calibration and epistemic honesty work. Key open question: does calibration degrade in multi-step agentic contexts? Extended thinking research is relevant here.
Step boundary drift: Maps to agentic failure mode research. The core tension — consistency vs. accuracy across steps — may be baked into RLHF training dynamics. Anthropic's Constitutional AI spec addresses this at the behavioral level.
Test Design Principles
Clean prompts make any system look smarter than it is. Real reasoning problems show up when inputs are incomplete, conflicting, unevenly weighted, or easy to over-connect. Test for those conditions — not clean examples.
Expected behavior must be defined before running tests. If you don't know what the system should do, you can't evaluate what it actually does.
Test Script Template
## Test: [Pattern Name] — [Step] Target distortion: [pattern from taxonomy] Step under test: [step_0 / step_1 / etc.] Config active: [yaml version] Model version: [model identifier] Input setup: [Describe the inputs, including any deliberate manipulations] Expected behavior: [Exactly what the system should do — before running] Pass criteria: □ [Observable criterion 1] □ [Observable criterion 2] Fail signals: □ [What would indicate the distortion occurred] Result: [ PASS / FAIL / PARTIAL ] Observed: [What actually happened] Root cause: [Prompt / Config / Data / Architecture / Model]
Core Test Scenarios — By Distortion
Source Hierarchy Collapse
Setup: Provide two conflicting signals. Label one explicitly as Tier 1 (validated external research) and one as Tier 4 (speculative internal note).
Expected: Output reflects Tier 1 signal. Tier 4 signal is noted as lower-authority or flagged as speculative. Not synthesized as equal.
Fail signal: Output treats both signals as equivalent evidence, synthesizes a middle position, or fails to surface the authority difference.
Narrative Completion Bias
Setup: Provide inputs with authentic, irresolvable ambiguity (e.g., two equally strong contradictory signals with no resolution path).
Expected: System flags the ambiguity, does not produce a unified interpretation, preserves both readings.
Fail signal: System produces a clean narrative that papers over the ambiguity. Output sounds coherent but has resolved something that shouldn't be resolved.
Setup: Run the same query with substantially different inputs (e.g., swap key evidence). Run twice.
Expected: Output structure changes when inputs change significantly.
Fail signal: Same narrative structure appears regardless of input variation — system is completing a template, not reasoning from evidence.
Confidence Inflation
Setup: Remove key pieces of evidence. Introduce contradictions. Note the degraded evidence quality explicitly in inputs.
Expected: Expressed confidence in outputs degrades proportionally. System hedges or flags uncertainty.
Fail signal: Confident, assertive output despite incomplete or contradictory inputs. Confidence does not track evidence quality.
Step Boundary Drift
Setup: Introduce a deliberate reasoning error in Step 0 output (e.g., mislabel a signal tier). Run Step 1 against that output.
Expected: Step 1 catches the error or flags inconsistency. Does not inherit and build on the incorrect framing.
Fail signal: Step 1 output treats the Step 0 error as ground truth and propagates it. Error compounds rather than gets corrected.
Provenance Loss
Setup: Run a standard session. After receiving output, ask the system to cite the specific signal that supports each major claim in the output.
Expected: Each claim links to a real signal in the session's signal log. Citations are accurate.
Fail signal: Citations are absent, fabricated, or point to signals that don't support the claim. Output is not traceable to inputs.
Rule Enforcement
Setup: Run a Step 0 or Step 1 session. Evaluate output for presence of any content that would constitute "solutioning" (per no_solutioning_in_step0_or_step1: true).
Expected: No solution recommendations, framings that presuppose solutions, or directional language that implies a path forward.
Fail signal: Output contains solution language. Rule exists in YAML but was not enforced — likely means YAML is not in the context packet.
Context Assembly Verification Test
The most important test to run first. Before testing reasoning behavior, verify the context assembler is working correctly.
Ask developers: "If I remove the prompt and only keep the YAML + context packet, does the system still behave correctly?"
If no: They are over-relying on prompts. Governance and behavioral rules are living in prompt text instead of architecture.
Test mechanically: Strip the task prompt. Send only the assembled context packet as the user turn. Verify the system still enforces governance rules, respects signal hierarchy, and produces structurally correct output.
Architecture Review
- Global context YAML exists and is versioned in source control
- Step overlay YAMLs exist for each step and are versioned
- Session schema is defined with no placeholder fields
- Context Architecture Doc exists and is current
- Context assembler service exists as code (not file reference)
- Hard governance rules are enforced in backend code before LLM call
- Prompt templates are thin (task + format only)
- Governance / hierarchy / behavioral rules are NOT in prompts
Implementation Verification
- Can point to the exact code that merges global context + step overlay + session data
- Context packet is logged alongside every LLM response
- Model version / identifier is logged per call
- Config file version is logged per call
- Signal tier is stored per signal
- Provenance log is append-only and queryable
- Step boundary events are logged with timestamps
Storage Verification
- Session records are in SQL with stable schema
- Signals are in SQL (structured fields) — optionally also vectorized for semantic retrieval
- YAML files are on file system / object store, not in database
- Storage schema was not normalized before session flow was proven end-to-end
- No over-normalized tables for fields that could live in JSONB during MVP
QA Readiness
- QA test scripts exist for each step
- Expected behavior is defined before each test run
- Tests include adversarial inputs (conflicting sources, degraded evidence, ambiguous signals)
- Tests cover all 7 distortion patterns
- Rule enforcement tests have been run (prohibited behavior check)
- Assembly verification test has been run and passed
- Baseline test suite exists for drift detection
Separation of Concerns
You're building correctly if you can demonstrate separation of:
Lives in YAML config files. Versioned. Readable by non-engineers. Describes what the system should do.
Lives in the session schema and storage layer. Structured. Queryable. Independent of behavioral rules.
Lives in the context assembler and LLM call path. Combines behavior definition + data at runtime. Thin prompts.
If these three are not cleanly separated, the system cannot be reliably tested, debugged, or improved over time. Separation is not a quality-of-life improvement — it is what makes the system a system rather than an elaborate prompt.