Latency Improvement

Every query passes through a 5-tier routing system, fastest to slowest. Each tier acts as a short-circuit: if it can answer, it returns immediately and skips all slower tiers. Every result is cached so subsequent queries drop to Tier 0 or 1.

Tier	Name	Latency	LLM Calls
0	Exact cache	~0ms	0
1	Fuzzy cache	~50ms	0
2	Direct search	~100–200ms	0
3	Single LLM	<5s	1
4	Agentic loop	8–15s	Multiple

Tier 0: Exact Cache Hit (0ms)

The fastest path. A normalized version of your query is looked up in an in-memory map.

Algorithm

Normalize query: lowercase, trim, collapse whitespace
Build cache key: normalized_query
Map.get(key) — O(1) lookup
Validate TTL: (now - storedAt) < 60s
Validate fingerprint: MD5 hash of sorted path:mtime pairs must match current context tree state
HIT — return cached response immediately

Context Tree Fingerprint

The cache uses a fingerprint of the context tree to detect changes:

Glob all .md files in .brv/context-tree/
Sort by path, concat path:mtime joined by |, MD5 hash, take first 16 hex chars
The fingerprint itself is cached for 30s to avoid repeated glob I/O
If any file is added, removed, or modified, the fingerprint changes and all cache entries are invalidated

Parameters

Parameter	Value
Max cache size	50 entries
TTL	60 seconds
Fingerprint cache TTL	30 seconds
Eviction policy	LRU (oldest insertion order)

Cost: O(1) map lookup. Zero LLM, zero search, zero I/O.

Tier 1: Fuzzy Cache Match (~50ms)

When the exact match fails, the cache scans all entries for semantic similarity using Jaccard similarity on tokenized queries.

Algorithm

Tokenize query: split on whitespace, remove stopwords, filter tokens shorter than 2 chars
Skip fuzzy matching if query has fewer than 2 meaningful tokens
For each cache entry:
- Skip if fingerprint mismatch (cheap string compare)
- Skip if TTL expired (cheap timestamp compare)
- Compute Jaccard similarity: |A ∩ B| / |A ∪ B|
Return the highest-similarity match if similarity >= 0.6

Jaccard Similarity

Optimization: always iterate the smaller set, check membership in the larger set.
intersection = count of shared tokens
union = |A| + |B| - intersection
similarity = intersection / union

Parameters

Parameter	Value
Similarity threshold	0.6
Min tokens for fuzzy	2

Cost: O(n) over cache entries x O(min(|A|, |B|)) per comparison. No LLM, no I/O.

Out-of-Domain Short-Circuit

Between Tier 1 and Tier 2, if all searches return zero results, the query is classified as out-of-domain:

Returns: “This topic is not covered in the knowledge base.”
The OOD response is cached to prevent repeated misses from hitting the LLM

Cost: Zero. Saves an entire LLM call for irrelevant queries.

Supplementary Entity Search

When the initial local search returns fewer than 3 results, the executor extracts key entities and runs additional searches to improve recall.

Algorithm

Split query into words, filter stopwords, keep words with length >= 3
Take top 3 entities
Run searches in parallel via Promise.allSettled()
Deduplicate results by path
Merge into original result set

For example, the query “How does JWT refresh work in the auth module?” extracts entities ["jwt", "refresh", "auth"] and runs 3 parallel MiniSearch lookups, merging any new unique results into the original set. Cost: Up to 3 additional MiniSearch lookups (in parallel). No LLM.

Tier 2: Direct Search Response (~100-200ms)

When search results are highly confident, the system skips the LLM entirely and returns a formatted markdown response assembled from raw document content.

Algorithm

Filter search results with score >= 0.7, take top 5
Read full document content from .brv/context-tree/ in parallel
Run canRespondDirectly() decision:
- Gate 1: topResult.score >= 0.85 (minimum threshold). If not, fail.
- Gate 2a: topResult.score >= 0.93 — score is so strong that dominance check is skipped. Pass.
- Gate 2b: gap = topScore - secondScore >= 0.08 — clear separation between top and runner-up. Pass.
Format response: Summary + Details (max 5000 chars/doc, max 5 docs) + Sources + Gaps
Cache result and return

Why Gap-Based, Not Ratio-Based?

BM25 normalized scores cluster in the [0.8, 0.95] range. A ratio check like “2x the second result” is mathematically impossible in that range. A fixed gap of 0.08 correctly identifies dominant matches.

Parameters

Parameter	Value
Min score threshold	0.85
High confidence (skip dominance)	0.93
Min gap for dominance	0.08
Max content per doc	5000 chars
Max docs in response	5

Cost: File reads only (no LLM). ~100-200ms for disk I/O.

Tier 3: Optimized Single LLM Call (<5s)

When search found good results (score >= 0.7) but not confident enough for Tier 2, the system makes a single constrained LLM call with pre-fetched context embedded in the prompt.

Algorithm

Filter results with score >= 0.7, build a pre-fetched context string (excerpts formatted as markdown sections)
Inject search data into sandbox as variables:
- __query_results_{taskId} = search results array
- __query_meta_{taskId} = {resultCount, topScore, hasPreFetched}
- Build prompt with pre-fetched context embedded directly
Execute with tight LLM overrides: maxTokens=1024, temperature=0.3
LLM answers from the embedded context. If insufficient, it can use code_exec with silent: true to read additional documents from the sandbox variables
Cache result and return

Parameters

Parameter	Value
Max tokens	1024
Temperature	0.3
Max iterations	50
Pre-fetch score threshold	0.7
Max pre-fetched docs	5

Cost: One LLM call with tight token and temperature constraints.

Tier 4: Full Agentic Loop (8-15s)

When no pre-fetched context is available (search returned nothing above the 0.7 threshold), the system falls back to the full agentic loop.

Algorithm

Same sandbox variable injection as Tier 3
Build prompt WITHOUT pre-fetched context — LLM must discover answers via tool use
Execute with relaxed LLM overrides: maxTokens=2048, temperature=0.5
LLM reads search results via code_exec, may call tools.readFile() to load documents, may loop through multiple tool calls
Protected by doom-loop detection (max iterations limit)
Cache result and return

Parameters

Parameter	Value
Max tokens	2048
Temperature	0.5
Max iterations	50

Cost: Multiple LLM calls with tool use. Full agentic loop with loop detection.

Knowledge Scoring

Search results that feed into Tiers 2–4 are ranked by a compound scoring algorithm:

compoundScore = (0.6 x BM25 + 0.2 x importance/100 + 0.2 x recency) x tier_boost

All three signals are active — relevance, accumulated importance, and freshness together determine result ranking:

Signal	Weight	Source
BM25 relevance	60%	MiniSearch full-text search, normalized via `score / (1 + score)`
Importance	20%	Access hits (+3/search) + curate updates (+5/update), decays `0.995^days`
Recency	20%	Exponential decay: `e^(-days/30)`

Tier Boost Multipliers

Maturity tiers amplify or penalize the compound score:

Maturity	Boost
core	x1.15
validated	x1.00
draft	x0.85

Maturity Lifecycle (Hysteresis)

draft --(importance >= 65)--> validated --(importance >= 85)--> core
draft <--(importance < 35)-- validated <--(importance < 60)-- core

The hysteresis gap (e.g., promote at 65, demote at 35) prevents rapid oscillation between tiers.

Decay Functions

Importance decay: importance x 0.995^days (~78% remaining after 50 days of non-use)
Recency decay: e^(-days/30) (half-life of ~21 days)

Complete Query Flow

User Query
    |
    |--- Fire parallel searches (local)
    |
    v
[Tier 0] Exact Cache Lookup
    |--- HIT --> Return (0ms)
    |
   MISS
    v
[Tier 1] Fuzzy Cache (Jaccard >= 0.6)
    |--- HIT --> Return (~50ms)
    |
   MISS
    v
[OOD] All searches returned 0 results?
    |--- YES --> "Not covered" response (0ms, cached)
    |
    NO
    v
[Entity Search] Initial results < 3?
    |--- YES --> Run supplementary entity searches (parallel)
    |
    v
[Tier 2] Direct Response (score >= 0.85 + dominant)
    |--- PASS --> Return formatted markdown (100-200ms)
    |
   NOT DOMINANT
    v
[Tier 3] Single LLM + Pre-fetched Context (score >= 0.7 exists)
    |--- PASS --> Return LLM response (<5s)
    |
   NO CONTEXT
    v
[Tier 4] Full Agentic Loop
    |--- Return LLM response (8-15s)

All tier results are cached, so subsequent similar queries resolve at Tier 0 or 1.

​Tier 0: Exact Cache Hit (0ms)

​Algorithm

​Context Tree Fingerprint

​Parameters

​Tier 1: Fuzzy Cache Match (~50ms)

​Algorithm

​Jaccard Similarity

​Parameters

​Out-of-Domain Short-Circuit

​Supplementary Entity Search

​Algorithm

​Tier 2: Direct Search Response (~100-200ms)

​Algorithm

​Why Gap-Based, Not Ratio-Based?

​Parameters

​Tier 3: Optimized Single LLM Call (<5s)

​Algorithm

​Parameters

​Tier 4: Full Agentic Loop (8-15s)

​Algorithm

​Parameters

​Knowledge Scoring

​Tier Boost Multipliers

​Maturity Lifecycle (Hysteresis)

​Decay Functions

​Complete Query Flow

Tier 0: Exact Cache Hit (0ms)

Algorithm

Context Tree Fingerprint

Parameters

Tier 1: Fuzzy Cache Match (~50ms)

Algorithm

Jaccard Similarity

Parameters

Out-of-Domain Short-Circuit

Supplementary Entity Search

Algorithm

Tier 2: Direct Search Response (~100-200ms)

Algorithm

Why Gap-Based, Not Ratio-Based?

Parameters

Tier 3: Optimized Single LLM Call (<5s)

Algorithm

Parameters

Tier 4: Full Agentic Loop (8-15s)

Algorithm

Parameters

Knowledge Scoring

Tier Boost Multipliers

Maturity Lifecycle (Hysteresis)

Decay Functions

Complete Query Flow