REST API Server
LLM Context Forge includes a fast, standalone HTTP server built on FastAPI. This allows you to deploy context management as an independent microservice that supports any language — Go, Rust, Ruby, or internal tools — without maintaining multiple tokenizer implementations.
Starting the Server
The server is included in the base Python package.
# Start on default host (127.0.0.1) and port (8000)
python -m llm_context_forge.server
# Configure host and port
python -m llm_context_forge.server --host 0.0.0.0 --port 8080
Endpoints
POST /v1/tokens/count
Counts the exact tokens for a given prompt and model.
Request:
{
"model": "gpt-4o",
"text": "Hello, world!"
}
Response (200 OK):
{
"tokens": 4,
"model": "gpt-4o",
"encoder": "o200k_base"
}
POST /v1/documents/chunk
Splits text into chunks.
Request:
{
"model": "claude-3-5-sonnet",
"text": "Very long document...",
"strategy": "paragraph",
"max_tokens": 1000,
"overlap": 100
}
Response (200 OK):
{
"total_chunks": 5,
"chunks": [
"Chunk 1 text...",
"Chunk 2 text..."
]
}
POST /v1/context/pack
Assembles a prompt using priority packing.
Request:
{
"model": "gpt-4o",
"max_tokens": 4000,
"blocks": [
{"content": "System prompt", "priority": 0, "id": "system"},
{"content": "User query", "priority": 1, "id": "query"},
{"content": "RAG chunk", "priority": 2, "id": "rag_0"}
]
}
Response (200 OK):
{
"prompt": "System prompt\n\nUser query\n\nRAG chunk",
"usage": {
"tokens_used": 150,
"included_ids": ["system", "query", "rag_0"],
"excluded_ids": []
}
}
GET /v1/models
Returns the complete model registry.
Response (200 OK):
{
"models": {
"gpt-4o": {
"provider": "openai",
"context_window": 128000,
"tokenizer": "o200k_base"
}
}
}