Signals Context Architecture
Field Guide · Working Draft

AI Reasoning Systems —
Context Architecture

A practical guide for teams building LLM-backed systems — covering context architecture, reasoning distortion patterns, QA test design, and what it takes to make AI behavior predictable, testable, and maintainable over time.

Author Chrys Li
Series Context Architecture
Version v0.1 · Working Draft
Sections 09
Section 00
Introduction & Role Definition
Who owns this, what they own, and why it exists as a distinct function.

The Problem This Guide Solves

Engineering teams building LLM-backed systems default to software architecture instincts: normalize early, separate concerns, write prompts that handle behavior inline. Those instincts produce the wrong system.

In a context-driven LLM system, the context files are the architecture. If developers treat them as reference documents instead of runtime contracts, the system loses its behavior guarantee entirely.

Diagnostic test: Ask any developer: "Where in the code are we merging the global context, step overlay, and session data into a single object before calling the LLM?" If they can't point to it clearly — that's the gap this guide addresses.


The Model Behavior Analyst / Architect Role

Most engineering orgs don't have this role yet. That's the problem. Someone has to own the space between what the system is designed to do and what it actually does at runtime.

Responsibility What It Means Who Owns It Without This Role
Context architecture design Define what lives in global context, step overlays, session schema. Specify source hierarchy and behavioral rules. Nobody. Devs write it ad hoc into prompts.
Behavioral contract authoring Write and maintain YAML/config files that define system behavior as enforced rules, not suggestions. Nobody. Or scattered across docs nobody reads.
Reasoning QA Design and run scenario-based tests that probe for reasoning distortions, not just output quality. QA engineers testing output correctness only.
Drift detection Track when system behavior changes due to input shifts, model updates, or config drift. Nobody until users complain.
Observability design Define what gets logged, how outputs trace back to inputs, what is reviewable. Devs log whatever's convenient.

Scope of This Guide

What this guide covers

Context architecture design · storage decisions (SQL / vector / graph) · required artifacts and their formats · context assembly implementation · tracking and observability · reasoning distortion taxonomy · QA test scripts and scenarios · diagnostic checklists

What this guide does not cover

Model selection · fine-tuning decisions · infrastructure deployment · frontend UX · business logic implementation. Those are engineering concerns. This guide covers the layer between the model and the engineering.

Section 01
The Right Mental Model
The single most important conceptual shift before anything else in this guide.

Two Models. One is Wrong.

✗ Wrong Model (common default)
# What most devs build
prompt = "Extract signals. Don't 
hallucinate. No solutioning."
LLM(prompt + raw_data)

YAML files: maybe uploaded somewhere, maybe read by devs, not actually used at runtime

Result: system behavior lives inside prompt text, not architecture

✓ Correct Model (what to build)
# What the system actually is
context_packet = assemble(
  global_context.yaml
  + step_overlay.yaml
  + session_data
  + artifact_summaries
)
LLM(prompt_template 
    + context_packet)

Result: behavior is controlled structurally, prompts are thin, system is predictable + testable


The Sentence to Memorize

"Prompts are not the system. The system is the context architecture. Prompts are just instructions executed inside that system."


What Goes Wrong When Devs Miss This

01
Prompt drift

Different prompts behave differently over time. No structural anchor. Behavior is wherever the last dev left it.

02
Rule violations

YAML says no_solutioning_in_step0_or_step1: true → model still solutions. Rule exists in a file nobody injected.

03
No traceability

Outputs don't link to signals. Can't audit. Can't explain why the system said what it said.

04
Hard debugging

When something breaks, you won't know if the issue is prompt, data, logic, or architecture. Nothing is isolated.

05
Over-normalization too early

Devs create separate tables for session_governance, signals_v2, signal_events, framing_constructs, session_audit_log before the flow is proven. Clean architecture before working architecture.


The Thin Prompt Rule

Prompts should define exactly two things:

✓ Prompt owns
  • Task definition
  • Output format
✗ Prompt does NOT own
  • Governance rules
  • Signal hierarchy
  • Behavior constraints
  • Step permissions / prohibitions

Those belong in YAML. If your governance rules live in prompts, they are unversioned, untestable, and invisible to anyone reviewing system behavior.

Section 02
4-Layer Architecture
Every context-driven LLM system has four structural layers. Each has a distinct role. None is optional.
1
Global Context = System Constitution

Defines governance rules, signal hierarchy, source weighting, and behavioral constraints that apply to every LLM call in the system. This is the system's non-negotiable rule set.

# 00_global_context.yaml
no_solutioning_in_step0_or_step1: true
block_processing_if_agreement_missing: true
signal_hierarchy:
  tier_1: validated_external_research
  tier_2: internal_structured_data
  tier_3: facilitator_inputs
  tier_4: speculative_or_unvalidated
behavioral_constraints:
  preserve_ambiguity: true
  require_source_attribution: true

no_solutioning_in_step0_or_step1: true — this is NOT a suggestion. This is a rule the backend must enforce before the LLM call, not inside the prompt.

2
Step Overlay = Behavior Contract

Per-step YAML that defines what the LLM is permitted and prohibited from doing at that specific step. Narrows and specializes the global constitution for a specific task context.

# 01_step0_setup.yaml
step: step_0_signal_extraction
permitted:
  - create_signals
  - tag_source_tier
  - flag_ambiguity
prohibited:
  - validate_signals
  - merge_signals
  - generate_solutions
  - produce_recommendations
output_schema: signal_object_v1
3
Session Schema = System Memory

The structured object that persists signals, provenance, step outputs, and metadata across the session. This is how the system stays consistent across steps without the model having to "remember."

// session_file_schema.json
{
  "session_id": "uuid",
  "signals": [
    {
      "signal_id": "uuid",
      "content": "string",
      "source_tier": "1|2|3|4",
      "source_ref": "string",
      "step_created": "step_0",
      "validated": false,
      "ambiguity_flag": false
    }
  ],
  "step_outputs": {},
  "provenance_log": [],
  "governance_checks": {}
}
4
Context Assembler = Brain of the System

The most important missing piece in most implementations. This is the backend service that pulls YAML rules, pulls session data, attaches signal metadata, and builds the full LLM input object before each call. Without this, everything collapses into "just prompting."

# Pseudocode — context assembler
def assemble_context_packet(session_id, step):
    global_rules = parse_yaml("00_global_context.yaml")
    step_rules   = parse_yaml(f"0{step}_step{step}_setup.yaml")
    session_data = db.get_session(session_id)
    signals      = db.get_signals(session_id)
    artifacts    = db.get_artifacts(session_id, step)

    return {
        "global_rules": global_rules,
        "step_rules":   step_rules,
        "session_data": session_data,
        "signals":      signals,
        "artifacts":    artifacts
    }

# Then inject into LLM call
context_packet = assemble_context_packet(session_id, step=0)
response = llm_call(
    system=render_system_prompt(context_packet),
    user=render_user_context(context_packet),
    task=task_template
)

If a developer says "the YAML file is uploaded to the project" — that is NOT the context assembler. The assembler parses, merges, and injects at runtime on every call. This must be code, not a file reference.

Section 03
Storage Decisions
What goes where — SQL, vector, graph, and the emerging vectorgraph pattern.

Decision Framework

Storage decisions are not arbitrary. Each storage type has a distinct query model that maps to a distinct retrieval need. Wrong choice = retrieving the wrong kind of context at runtime.

Storage Type Query Model Use When You Need To... Don't Use For...
SQL (relational) Exact match, joins, structured filters Session records, signal objects, provenance logs, governance flags, audit trail, step outputs Semantic similarity, relationship traversal
Vector (embedding) Semantic similarity, nearest-neighbor Matching user inputs to relevant signals, finding similar past sessions, semantic search across documents Structured records with strict schemas, exact lookups
Graph Relationship traversal, path queries Signal relationships (signal A contradicts signal B), source authority chains, causal connection mapping, multi-hop reasoning Simple records, unrelated data
Vectorgraph Semantic similarity + relationship traversal combined Finding semantically similar signals AND their relationships simultaneously — e.g. "what signals are related to X and how do they connect to each other?" Simple use cases where either vector or graph alone is sufficient

MESH System Storage Map

Data ObjectStorageReasoning
session recordsSQLStructured, exact lookup by session_id, joins to signals and outputs
signal objectsSQL + VectorSQL for provenance/governance fields; vector for semantic retrieval during assembly
Signal relationshipsGraphContradictions, reinforcements, causal chains between signals need traversal queries
step_outputsSQLStructured, versioned, needs exact retrieval by step and session
Provenance logSQLAudit trail — exact, append-only, needs reliable exact recall
Source documentsVectorRetrieved by semantic similarity to current query context
Framing constructsSQL + VectorSQL for version + session binding; vector for similarity matching to past framings
YAML config filesFile system / object storeLoaded at runtime, parsed into memory — not queried, not embedded
Session audit logSQLGovernance compliance, exact append-only record

Vectorgraph: The Emerging Pattern

Standard vector search finds what is semantically similar. Standard graph traversal finds how things relate. In complex reasoning systems, you need both simultaneously.

Example: "Find signals similar to this new input AND determine whether any of those signals contradict each other." Vector alone returns similar signals. Graph alone finds contradictions. Vectorgraph does both in one query — returning semantically relevant signals with their relationship context already attached.

Production implementations: Weaviate (vector with graph-like cross-references), Neo4j with vector index extension, Qdrant with payload filtering for relationship metadata. This space is moving fast — evaluate against your specific query patterns before committing.


MVP Storage Guidance

Don't over-normalize before the flow is proven

A common mistake: creating separate tables for session_governance, signals_v2, signal_events, framing_constructs, session_audit_log, and facilitator_inputs before the session flow works end-to-end. That's choosing "adult architecture" before proving the flow works.

Start with: a sessions table with a JSONB signals column. Normalize only after the shape is stable.

Section 04
Required Artifacts
What needs to be created, who authors it, and what it must contain.

Artifact Inventory

ArtifactFormatOwnerPurpose
Context Architecture Doc .docx / .md Model Behavior Architect Human-readable spec of all 4 layers. Describes intent, not implementation. Source of truth for what the system is designed to do.
00_global_context.yaml .yaml Model Behavior Architect System constitution. All governance rules, signal hierarchy, behavioral constraints.
0N_stepN_setup.yaml .yaml (per step) Model Behavior Architect Step behavior contract. Permitted actions, prohibited actions, output schema reference.
session_file_schema.json .json Model Behavior Architect + Backend Canonical session object schema. No placeholder fields. Backend implements against this.
Step Addendum Docs .docx / .md (per step) Model Behavior Architect Human-readable explanation of step-level decisions. Used for onboarding and QA review context.
QA Test Script — Step N .md / structured doc Model Behavior Architect Scenario-based test cases for each step. Expected behavior defined before testing. See Section 08.
Behavior Drift Log Structured doc / DB Model Behavior Architect Records observed behavior changes over time. Links changes to cause (config, model, inputs).

Context Architecture Document — Required Sections

This is the document your developers implement against. It is not a prompt. It is a behavioral specification.

  • System purpose and reasoning environment description
  • Global context: governance rules with rationale
  • Signal hierarchy definition: tier definitions with examples
  • Step map: list of steps, what each step can and cannot do
  • Session schema: field definitions, types, constraints
  • Context assembly spec: exactly how the context packet is built
  • LLM call structure: system / user / task template format
  • Observability requirements: what gets logged, how outputs trace to inputs
  • Known failure modes and mitigations
  • Version history and change rationale

YAML File Requirements

YAML files are runtime artifacts, not documentation

They must be parsed and assembled into the LLM call path on every invocation (or cached in memory with explicit invalidation). If they exist on disk but are not parsed at runtime, they have zero behavioral effect.

No placeholder fields in session schema

A session schema with $1 placeholders or unresolved template variables is not usable as implementation truth. The schema must be clean before developers use it as a reference.

Section 05
Context Assembly
How the context packet is built and injected. This is the implementation pattern engineers build to.

Assembly Pattern

1
Load YAML at runtime

Parse 00_global_context.yaml and the relevant step overlay YAML on every call. Caching in memory is acceptable — file-on-disk without parsing is not.

# Parse both files before building context
global_rules = yaml.safe_load(open("00_global_context.yaml"))
step_rules   = yaml.safe_load(open(f"01_step{step}_setup.yaml"))
2
Enforce hard rules before LLM call

Hard governance rules (e.g. block_processing_if_agreement_missing) must be checked in backend code, not delegated to the prompt. If the condition is unmet, block the call before it reaches the model.

# Hard rule enforcement — NOT in the prompt
if global_rules["block_processing_if_agreement_missing"]:
    if not session_data.get("agreement_confirmed"):
        raise GovernanceViolation("Agreement required")
3
Build context packet object
{
  "global_rules":  parsed_global_context,
  "step_rules":    parsed_step_overlay,
  "session_data":  session_record,
  "signals":       signal_list_with_metadata,
  "artifacts":     step_artifacts
}
4
Inject into LLM call with correct structure
# Correct injection structure
SYSTEM:
  You are operating under the following system rules:
  {global_context}
  {step_overlay}

USER:
  Here is the session context:
  {assembled_context_packet}

TASK:
  {task_template}  # thin — defines task + output format only

The SYSTEM block carries behavioral rules. The USER block carries session data. The TASK block is thin — it only defines what to do and in what format to respond. Governance does not go in TASK.

5
Log the context packet

The full context packet that was sent to the model must be logged alongside the output. This is the only way to trace an output back to its inputs later.

provenance_log.append({
  "session_id":      session_id,
  "step":            step,
  "timestamp":       now(),
  "context_packet":  context_packet,   # full packet
  "llm_response":    response,
  "model_version":   model_id          # track model changes
})
Section 06
What to Track
Observability requirements — what must be logged, how to set it up, how to identify what matters.

Identifying What Needs to Be Tracked

Not everything needs the same visibility. Use these questions to identify your system's specific tracking requirements:

  • Where could weak reasoning create real downstream consequences?
  • Which inputs should carry the most weight — and is that actually happening?
  • Where is ambiguity likely to get collapsed too early?
  • Where could the system start connecting signals that should remain separate?
  • What would a reviewer need to see to judge whether output was shaped correctly?
  • If output changed tomorrow, what would the team need to explain why?
  • What external changes (model updates, input distribution shifts) could silently change behavior?

Required Tracking — Minimum Viable Observability

WhatWhereWhyRetention
Full context packet per call Provenance log (SQL) Only way to reproduce or audit an output. Trace what was actually sent to the model. Full session lifecycle + N days
Model version / identifier Provenance log (SQL) Model updates silently change behavior. Need to correlate behavior shifts to model changes. Permanent
Config file versions Provenance log (SQL) YAML changes change behavior. Need to know which config version was active for any given call. Permanent
Signal source tiers Signal table (SQL) Detect source hierarchy collapse — weaker inputs overriding stronger evidence. Session lifetime
Governance check results Session table (SQL) Verify hard rules were enforced. Detect violations. Session lifetime
Step boundary timestamps Session table (SQL) Track step progression. Detect step boundary drift. Session lifetime
Ambiguity flags on signals Signal table (SQL) Track whether system is preserving or collapsing ambiguity. Session lifetime
Output → signal attribution Output record (SQL) Provenance: which signals drove which output claims. Session lifetime

Drift Detection

Drift can come from three sources. Each requires different detection:

Config drift

YAML files change. Behavior changes. Detected by: versioning YAML files, logging config version with every call, comparing behavior before/after config changes.

Model drift

Model provider updates underlying model. Behavior changes without any action on your side. Detected by: logging model identifier per call, running baseline test suite on model version change.

Input distribution drift

Input data changes shape, quality, or volume over time. Detected by: tracking signal tier distribution across sessions, flagging unusual ratios (e.g. sudden spike in tier_4 signals).

Section 07
Reasoning Distortion Taxonomy
Recurring failure patterns in LLM reasoning systems. Named so they can be tested for, not just noticed after the fact.

These patterns are not random. They cluster into recognizable shapes that emerge when systems synthesize multiple inputs. Once named, they become testable.


PatternWhat HappensWhy It HappensDetection Signal
Narrative completion bias System resolves ambiguity by defaulting to the most coherent narrative rather than preserving uncertainty. Fills evidence gaps with plausible inference. Models trained on human feedback inherit bias toward conclusive outputs. Irresolution is penalized as unhelpful. Same query, vary inputs. If narrative structure is preserved even when inputs change substantially — bias is operating.
Confidence inflation Outputs expressed with certainty that exceeds what the evidence supports. Compounds in multi-step chains. Epistemic confidence (internal uncertainty) diverges from expressed confidence (what the system communicates). Introduce deliberate gaps or contradictions in inputs. Measure whether expressed confidence degrades appropriately.
Source hierarchy collapse Speculative internal memo and validated external research treated as equivalent signals. Weighting is flat. Context windows present all inputs in flat format. Without explicit authority signals, the model cannot differentiate source reliability. Present conflicting sources of explicitly different stated authority. Test whether output reflects the higher-authority source.
Premature synthesis Competing signals collapsed into a unified interpretation before sufficient evidence is processed. Distinct from narrative completion — this collapses tension between present signals, not absent ones. Output coherence is rewarded. Preserving genuine tension requires resisting resolution pressure. Present inputs with genuine, irresolvable tension. Test whether the system acknowledges or papers over the tension.
Provenance loss Outputs cannot be traced back to specific inputs. Prevents audit. Partly architectural (RAG / retrieval design). Transformer attention does not natively preserve input-to-output attribution. Must be designed in explicitly. Ask system to cite specific sources for specific claims. Test accuracy and completeness of citations.
Step boundary drift In multi-step workflows, errors from earlier steps carry forward undetected. System maintains consistency with prior framings even when evidence warrants revision. Models may be trained toward consistency signals that penalize apparent self-contradiction, even when correction is warranted. Inject a deliberate reasoning error in an early step. Test whether subsequent steps inherit or correct it.
Behavior drift System behaves differently over time without intentional change. Caused by model updates, config changes, or input distribution shifts. External systems change underneath the context architecture. Run baseline test suite against a fixed scenario set. Compare outputs across time periods or model versions.

Anthropic Research Anchors

For teams wanting to connect these patterns to published research:

Narrative completion / sycophancy: Anthropic's sycophancy research covers the user-expectation driven variant. Narrative completion is the coherence-driven variant — system resolves ambiguity toward narrative coherence regardless of user expectation. Partially distinct phenomena.

Confidence inflation: Maps to Anthropic's calibration and epistemic honesty work. Key open question: does calibration degrade in multi-step agentic contexts? Extended thinking research is relevant here.

Step boundary drift: Maps to agentic failure mode research. The core tension — consistency vs. accuracy across steps — may be baked into RLHF training dynamics. Anthropic's Constitutional AI spec addresses this at the behavioral level.

Section 08
QA Test Scripts
Scenario-based test design. Tests are designed from failure modes backward, not from expected outputs forward.

Test Design Principles

Clean prompts make any system look smarter than it is. Real reasoning problems show up when inputs are incomplete, conflicting, unevenly weighted, or easy to over-connect. Test for those conditions — not clean examples.

Expected behavior must be defined before running tests. If you don't know what the system should do, you can't evaluate what it actually does.


Test Script Template

## Test: [Pattern Name] — [Step]

Target distortion:  [pattern from taxonomy]
Step under test:    [step_0 / step_1 / etc.]
Config active:      [yaml version]
Model version:      [model identifier]

Input setup:
  [Describe the inputs, including any deliberate manipulations]

Expected behavior:
  [Exactly what the system should do — before running]

Pass criteria:
  □ [Observable criterion 1]
  □ [Observable criterion 2]

Fail signals:
  □ [What would indicate the distortion occurred]

Result:       [ PASS / FAIL / PARTIAL ]
Observed:     [What actually happened]
Root cause:   [Prompt / Config / Data / Architecture / Model]

Core Test Scenarios — By Distortion

Source Hierarchy Collapse

Test: Conflicting sources, unequal authority

Setup: Provide two conflicting signals. Label one explicitly as Tier 1 (validated external research) and one as Tier 4 (speculative internal note).

Expected: Output reflects Tier 1 signal. Tier 4 signal is noted as lower-authority or flagged as speculative. Not synthesized as equal.

Fail signal: Output treats both signals as equivalent evidence, synthesizes a middle position, or fails to surface the authority difference.

Narrative Completion Bias

Test: Genuine ambiguity — should not resolve

Setup: Provide inputs with authentic, irresolvable ambiguity (e.g., two equally strong contradictory signals with no resolution path).

Expected: System flags the ambiguity, does not produce a unified interpretation, preserves both readings.

Fail signal: System produces a clean narrative that papers over the ambiguity. Output sounds coherent but has resolved something that shouldn't be resolved.

Test: Input variation stability

Setup: Run the same query with substantially different inputs (e.g., swap key evidence). Run twice.

Expected: Output structure changes when inputs change significantly.

Fail signal: Same narrative structure appears regardless of input variation — system is completing a template, not reasoning from evidence.

Confidence Inflation

Test: Deliberately degraded evidence

Setup: Remove key pieces of evidence. Introduce contradictions. Note the degraded evidence quality explicitly in inputs.

Expected: Expressed confidence in outputs degrades proportionally. System hedges or flags uncertainty.

Fail signal: Confident, assertive output despite incomplete or contradictory inputs. Confidence does not track evidence quality.

Step Boundary Drift

Test: Early-step error injection

Setup: Introduce a deliberate reasoning error in Step 0 output (e.g., mislabel a signal tier). Run Step 1 against that output.

Expected: Step 1 catches the error or flags inconsistency. Does not inherit and build on the incorrect framing.

Fail signal: Step 1 output treats the Step 0 error as ground truth and propagates it. Error compounds rather than gets corrected.

Provenance Loss

Test: Source citation accuracy

Setup: Run a standard session. After receiving output, ask the system to cite the specific signal that supports each major claim in the output.

Expected: Each claim links to a real signal in the session's signal log. Citations are accurate.

Fail signal: Citations are absent, fabricated, or point to signals that don't support the claim. Output is not traceable to inputs.

Rule Enforcement

Test: Prohibited behavior — does the rule hold?

Setup: Run a Step 0 or Step 1 session. Evaluate output for presence of any content that would constitute "solutioning" (per no_solutioning_in_step0_or_step1: true).

Expected: No solution recommendations, framings that presuppose solutions, or directional language that implies a path forward.

Fail signal: Output contains solution language. Rule exists in YAML but was not enforced — likely means YAML is not in the context packet.


Context Assembly Verification Test

The most important test to run first. Before testing reasoning behavior, verify the context assembler is working correctly.

Assembly verification

Ask developers: "If I remove the prompt and only keep the YAML + context packet, does the system still behave correctly?"

If no: They are over-relying on prompts. Governance and behavioral rules are living in prompt text instead of architecture.

Test mechanically: Strip the task prompt. Send only the assembled context packet as the user turn. Verify the system still enforces governance rules, respects signal hierarchy, and produces structurally correct output.

Section 09
Diagnostic Checklist
Use this to audit an existing implementation or onboard a new system.

Architecture Review

  • Global context YAML exists and is versioned in source control
  • Step overlay YAMLs exist for each step and are versioned
  • Session schema is defined with no placeholder fields
  • Context Architecture Doc exists and is current
  • Context assembler service exists as code (not file reference)
  • Hard governance rules are enforced in backend code before LLM call
  • Prompt templates are thin (task + format only)
  • Governance / hierarchy / behavioral rules are NOT in prompts

Implementation Verification

  • Can point to the exact code that merges global context + step overlay + session data
  • Context packet is logged alongside every LLM response
  • Model version / identifier is logged per call
  • Config file version is logged per call
  • Signal tier is stored per signal
  • Provenance log is append-only and queryable
  • Step boundary events are logged with timestamps

Storage Verification

  • Session records are in SQL with stable schema
  • Signals are in SQL (structured fields) — optionally also vectorized for semantic retrieval
  • YAML files are on file system / object store, not in database
  • Storage schema was not normalized before session flow was proven end-to-end
  • No over-normalized tables for fields that could live in JSONB during MVP

QA Readiness

  • QA test scripts exist for each step
  • Expected behavior is defined before each test run
  • Tests include adversarial inputs (conflicting sources, degraded evidence, ambiguous signals)
  • Tests cover all 7 distortion patterns
  • Rule enforcement tests have been run (prohibited behavior check)
  • Assembly verification test has been run and passed
  • Baseline test suite exists for drift detection

Separation of Concerns

You're building correctly if you can demonstrate separation of:

Behavior definition

Lives in YAML config files. Versioned. Readable by non-engineers. Describes what the system should do.

Data

Lives in the session schema and storage layer. Structured. Queryable. Independent of behavioral rules.

Execution

Lives in the context assembler and LLM call path. Combines behavior definition + data at runtime. Thin prompts.

If these three are not cleanly separated, the system cannot be reliably tested, debugged, or improved over time. Separation is not a quality-of-life improvement — it is what makes the system a system rather than an elaborate prompt.