Intelligent Chunking
Chunking is the process of splitting large documents into smaller pieces that fit within token limits. Naive splitting (e.g., every 1,000 characters) destroys meaning. LLM Context Forge provides five strategies that respect natural language boundaries.
The Five Strategies
| Strategy | Best For | How It Works |
|---|---|---|
| SENTENCE | Articles, emails | Splits at sentence boundaries (. ! ?) |
| PARAGRAPH | Long-form docs, reports | Splits at double newlines |
| SEMANTIC | Research papers, mixed content | Groups semantically related sentences together |
| CODE | Source code, configs | Splits at function/class/block boundaries |
| FIXED | Uniform processing pipelines | Fixed token count per chunk (hard split) |
Overlap Tokens
Every strategy supports an overlap_tokens parameter that duplicates context at chunk boundaries. This is critical for RAG pipelines where a relevant passage might be split across two chunks:
Chunk 1: [............content............|--overlap--|]
Chunk 2: [--overlap--|............content............]
Without overlap, a question about the boundary region would fail to retrieve the full context. A typical value is 50–150 tokens of overlap.
Usage
- Python
- TypeScript
from llm_context_forge import DocumentChunker, ChunkStrategy
chunker = DocumentChunker("gpt-4o")
chunks = chunker.chunk(
text="Your very long document goes here...",
strategy=ChunkStrategy.PARAGRAPH,
max_tokens=800,
overlap_tokens=100
)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {len(chunk)} chars")
Available Strategies
from llm_context_forge import ChunkStrategy
ChunkStrategy.SENTENCE # Split at sentence boundaries
ChunkStrategy.PARAGRAPH # Split at paragraph breaks
ChunkStrategy.SEMANTIC # Group semantically related content
ChunkStrategy.CODE # Split at code block boundaries
ChunkStrategy.FIXED # Fixed token count per chunk
import { DocumentChunker } from "llm-context-forge";
const chunker = new DocumentChunker();
const chunks = chunker.chunk("Your very long document goes here...", {
maxTokens: 800,
overlapTokens: 100,
});
console.log(`Split into ${chunks.length} safe blocks.`);
chunks.forEach((chunk, i) => {
console.log(`Chunk ${i}: ${chunk.length} chars`);
});
Choosing a Strategy
Is your content code?
└─ Yes → CODE
└─ No →
Do you need exact-size chunks?
└─ Yes → FIXED
└─ No →
Is it structured with clear sections?
└─ Yes → PARAGRAPH
└─ No →
Is semantic coherence critical (e.g., RAG)?
└─ Yes → SEMANTIC
└─ No → SENTENCE
:::tip Performance SENTENCE and PARAGRAPH are O(n) and essentially free. SEMANTIC is more expensive because it scores sentence similarity — use it only when retrieval quality is worth the latency. :::