Skip to main content

Priority-Based Context Packing

The ContextWindow is the most critical component for RAG implementations. It accepts content blocks with assigned priorities and packs them into a token-limited prompt, gracefully dropping lower-priority items when space runs out.

Priority Levels

PriorityNumericGuaranteeTypical Usage
CRITICAL0Always included (fails if it doesn't fit)System prompt
HIGH1Included before MEDIUM/LOWUser query, essential context
MEDIUM2Included if space allowsRAG documents, history
LOW3First to be droppedNice-to-have context, examples

The Algorithm

1. Sort all blocks by priority (CRITICAL first)
2. Initialize token_budget = max_tokens
3. For each block in priority order:
a. Count tokens in block
b. If tokens ≤ remaining budget:
- Include block
- Subtract tokens from budget
c. If tokens > remaining budget AND priority == CRITICAL:
- FAIL (critical content must fit)
d. If tokens > remaining budget AND priority != CRITICAL:
- Skip block (mark as excluded)
4. Concatenate included blocks → final prompt

:::danger Critical Priority If a CRITICAL block doesn't fit in the context window, the assembler will raise an error rather than silently drop it. This is intentional — your system prompt being truncated is never acceptable. :::

Usage

from llm_context_forge import ContextWindow, Priority

window = ContextWindow("gpt-4o")

# System prompt — CRITICAL (never dropped)
window.add_block(
"You are an expert Python developer. Answer concisely.",
Priority.CRITICAL,
"system"
)

# User's question — HIGH priority
window.add_block(
"How do I implement a binary search tree in Python?",
Priority.HIGH,
"query"
)

# RAG documents — MEDIUM (dropped if space is tight)
for i, doc in enumerate(retrieved_documents):
window.add_block(doc.text, Priority.MEDIUM, f"rag_{i}")

# Nice-to-have examples — LOW (first to go)
window.add_block(
"Here's an example implementation...",
Priority.LOW,
"example"
)

# Build the final prompt
final_prompt = window.assemble(max_tokens=8000)
stats = window.usage()

print(f"Included: {stats.included} blocks")
print(f"Excluded: {stats.excluded} blocks")
print(f"Tokens used: {stats.tokens_used}")

Best Practices

  1. Use CRITICAL sparingly — Only system prompts and absolute essentials. If you mark everything as CRITICAL, nothing can be dropped and you'll get overflow errors.

  2. Label your blocks — The id parameter helps you debug which blocks were included or excluded:

    Included: system, query, rag_0, rag_1
    Excluded: rag_2, rag_3, example
  3. Reserve output tokens — If your model needs 2,000 tokens for its response, set max_tokens to context_window - 2000, not the full context window.

  4. Combine with chunking — If a single document is too large, chunk it first with DocumentChunker, then add each chunk as a separate MEDIUM-priority block. The packer will include as many chunks as fit.