Skip to main content

REST API Server

LLM Context Forge includes a fast, standalone HTTP server built on FastAPI. This allows you to deploy context management as an independent microservice that supports any language — Go, Rust, Ruby, or internal tools — without maintaining multiple tokenizer implementations.

Starting the Server

The server is included in the base Python package.

# Start on default host (127.0.0.1) and port (8000)
python -m llm_context_forge.server

# Configure host and port
python -m llm_context_forge.server --host 0.0.0.0 --port 8080

Endpoints

POST /v1/tokens/count

Counts the exact tokens for a given prompt and model.

Request:

{
"model": "gpt-4o",
"text": "Hello, world!"
}

Response (200 OK):

{
"tokens": 4,
"model": "gpt-4o",
"encoder": "o200k_base"
}

POST /v1/documents/chunk

Splits text into chunks.

Request:

{
"model": "claude-3-5-sonnet",
"text": "Very long document...",
"strategy": "paragraph",
"max_tokens": 1000,
"overlap": 100
}

Response (200 OK):

{
"total_chunks": 5,
"chunks": [
"Chunk 1 text...",
"Chunk 2 text..."
]
}

POST /v1/context/pack

Assembles a prompt using priority packing.

Request:

{
"model": "gpt-4o",
"max_tokens": 4000,
"blocks": [
{"content": "System prompt", "priority": 0, "id": "system"},
{"content": "User query", "priority": 1, "id": "query"},
{"content": "RAG chunk", "priority": 2, "id": "rag_0"}
]
}

Response (200 OK):

{
"prompt": "System prompt\n\nUser query\n\nRAG chunk",
"usage": {
"tokens_used": 150,
"included_ids": ["system", "query", "rag_0"],
"excluded_ids": []
}
}

GET /v1/models

Returns the complete model registry.

Response (200 OK):

{
"models": {
"gpt-4o": {
"provider": "openai",
"context_window": 128000,
"tokenizer": "o200k_base"
}
}
}