Skip to main content

Intelligent Chunking

Chunking is the process of splitting large documents into smaller pieces that fit within token limits. Naive splitting (e.g., every 1,000 characters) destroys meaning. LLM Context Forge provides five strategies that respect natural language boundaries.

The Five Strategies

StrategyBest ForHow It Works
SENTENCEArticles, emailsSplits at sentence boundaries (. ! ?)
PARAGRAPHLong-form docs, reportsSplits at double newlines
SEMANTICResearch papers, mixed contentGroups semantically related sentences together
CODESource code, configsSplits at function/class/block boundaries
FIXEDUniform processing pipelinesFixed token count per chunk (hard split)

Overlap Tokens

Every strategy supports an overlap_tokens parameter that duplicates context at chunk boundaries. This is critical for RAG pipelines where a relevant passage might be split across two chunks:

Chunk 1: [............content............|--overlap--|]
Chunk 2: [--overlap--|............content............]

Without overlap, a question about the boundary region would fail to retrieve the full context. A typical value is 50–150 tokens of overlap.

Usage

from llm_context_forge import DocumentChunker, ChunkStrategy

chunker = DocumentChunker("gpt-4o")

chunks = chunker.chunk(
text="Your very long document goes here...",
strategy=ChunkStrategy.PARAGRAPH,
max_tokens=800,
overlap_tokens=100
)

for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {len(chunk)} chars")

Available Strategies

from llm_context_forge import ChunkStrategy

ChunkStrategy.SENTENCE # Split at sentence boundaries
ChunkStrategy.PARAGRAPH # Split at paragraph breaks
ChunkStrategy.SEMANTIC # Group semantically related content
ChunkStrategy.CODE # Split at code block boundaries
ChunkStrategy.FIXED # Fixed token count per chunk

Choosing a Strategy

Is your content code?
└─ Yes → CODE
└─ No →
Do you need exact-size chunks?
└─ Yes → FIXED
└─ No →
Is it structured with clear sections?
└─ Yes → PARAGRAPH
└─ No →
Is semantic coherence critical (e.g., RAG)?
└─ Yes → SEMANTIC
└─ No → SENTENCE

:::tip Performance SENTENCE and PARAGRAPH are O(n) and essentially free. SEMANTIC is more expensive because it scores sentence similarity — use it only when retrieval quality is worth the latency. :::