Priority-Based Context Packing
The ContextWindow is the most critical component for RAG implementations. It accepts content blocks with assigned priorities and packs them into a token-limited prompt, gracefully dropping lower-priority items when space runs out.
Priority Levels
| Priority | Numeric | Guarantee | Typical Usage |
|---|---|---|---|
| CRITICAL | 0 | Always included (fails if it doesn't fit) | System prompt |
| HIGH | 1 | Included before MEDIUM/LOW | User query, essential context |
| MEDIUM | 2 | Included if space allows | RAG documents, history |
| LOW | 3 | First to be dropped | Nice-to-have context, examples |
The Algorithm
1. Sort all blocks by priority (CRITICAL first)
2. Initialize token_budget = max_tokens
3. For each block in priority order:
a. Count tokens in block
b. If tokens ≤ remaining budget:
- Include block
- Subtract tokens from budget
c. If tokens > remaining budget AND priority == CRITICAL:
- FAIL (critical content must fit)
d. If tokens > remaining budget AND priority != CRITICAL:
- Skip block (mark as excluded)
4. Concatenate included blocks → final prompt
:::danger Critical Priority
If a CRITICAL block doesn't fit in the context window, the assembler will raise an error rather than silently drop it. This is intentional — your system prompt being truncated is never acceptable.
:::
Usage
- Python
- TypeScript
from llm_context_forge import ContextWindow, Priority
window = ContextWindow("gpt-4o")
# System prompt — CRITICAL (never dropped)
window.add_block(
"You are an expert Python developer. Answer concisely.",
Priority.CRITICAL,
"system"
)
# User's question — HIGH priority
window.add_block(
"How do I implement a binary search tree in Python?",
Priority.HIGH,
"query"
)
# RAG documents — MEDIUM (dropped if space is tight)
for i, doc in enumerate(retrieved_documents):
window.add_block(doc.text, Priority.MEDIUM, f"rag_{i}")
# Nice-to-have examples — LOW (first to go)
window.add_block(
"Here's an example implementation...",
Priority.LOW,
"example"
)
# Build the final prompt
final_prompt = window.assemble(max_tokens=8000)
stats = window.usage()
print(f"Included: {stats.included} blocks")
print(f"Excluded: {stats.excluded} blocks")
print(f"Tokens used: {stats.tokens_used}")
import { ContextWindow } from "llm-context-forge";
const window = new ContextWindow("gpt-4o");
// System prompt — CRITICAL (priority 0)
window.addBlock(
"You are an expert Python developer. Answer concisely.",
0, // CRITICAL
"system"
);
// User's question — HIGH (priority 1)
window.addBlock(
"How do I implement a binary search tree in Python?",
1, // HIGH
"query"
);
// RAG documents — MEDIUM (priority 2)
for (let i = 0; i < retrievedDocs.length; i++) {
window.addBlock(retrievedDocs[i].text, 2, `rag_${i}`);
}
// Nice-to-have — LOW (priority 3)
window.addBlock("Here's an example implementation...", 3, "example");
// Build
const finalPrompt = window.assemble({ maxTokens: 8000 });
const stats = window.usage();
console.log(`Included: ${stats.included} blocks`);
console.log(`Excluded: ${stats.excluded} blocks`);
console.log(`Tokens: ${stats.tokensUsed}`);
Best Practices
-
Use CRITICAL sparingly — Only system prompts and absolute essentials. If you mark everything as CRITICAL, nothing can be dropped and you'll get overflow errors.
-
Label your blocks — The
idparameter helps you debug which blocks were included or excluded:Included: system, query, rag_0, rag_1Excluded: rag_2, rag_3, example -
Reserve output tokens — If your model needs 2,000 tokens for its response, set
max_tokenstocontext_window - 2000, not the full context window. -
Combine with chunking — If a single document is too large, chunk it first with
DocumentChunker, then add each chunk as a separate MEDIUM-priority block. The packer will include as many chunks as fit.