Content Retrieval Management
When splitting extracted text into chunks for embedding and retrieval, you can work in characters or tokens. Below are guidelines for both approaches, along with a brief on how they differ.
Configuring Chunk Size & Overlap
Tokens vs. Characters
-
Characters
- Literally each letter, number, punctuation mark, or space in your text.
- Easy to reason about in terms of text length (e.g. a 1 000-character chunk is roughly a paragraph).
- Doesn’t depend on any language model’s encoding scheme.
-
Tokens
- The units used internally by language models (words or sub-words).
- One token averages about 4 characters in English, but varies by language and punctuation.
- Ideal when you know your model’s token-limit (e.g. 4 096, 16 384, 32 768, 200 000 tokens) and want to fit precisely within it.
Why choose one over the other?
- Characters: simpler, universal, and stable across models.
- Tokens: more efficient if you need to tightly pack a known token window, or when you mix languages with different tokenization behavior.
Character-Based Splitting
- Chunk size: 2 000–4 000 characters
- Overlap: 200–800 characters (≈ 10 %–20 %)
Why these ranges?
- Semantic coherence: ~2 000–4 000 chars covers one to two paragraphs.
- Model efficiency: At ~4 chars/token, that’s 500–1 000 tokens—small enough for fast processing, large enough to reduce total chunks.
- Context preservation: 200–800-char overlap ensures fragments at edges appear in both chunks.
- Resource balance: Keeps memory/compute in check while avoiding thousands of tiny chunks.
- Tunable: Shrink toward 2 000 chars for latency-sensitive use cases; expand if you have large-context models.
Recommended Character Ranges by Document Type
Document Type | Chunk Size (chars) | Overlap (chars) |
---|---|---|
News Articles | 1 500–2 500 | 150–250 |
Blog Posts | 2 000–3 000 | 200–300 |
Academic Papers | 3 000–4 500 | 300–450 |
Legal Contracts | 2 000–3 000 | 400–600 |
Technical Docs / Specs | 2 500–4 000 | 300–500 |
Books / Long-Form | 4 000–6 000 | 400–800 |
Transcripts / Dialogue | 1 000–2 000 | 100–20 |
Token-Based Splitting
- Chunk size: 500–1 000 tokens
- Overlap: 50–200 tokens (≈ 10 %–20 %)
Why these ranges?
- Semantic coherence: 500–1 000 tokens (~1–2 paragraphs).
- Model efficiency: Fits easily within most context windows (4 096–200 000 tokens).
- Context preservation: 50–200-token overlap covers sentence tails and maintains continuity.
- Resource balance: Balances compute/latency with number of chunks.
- Tunable: Shrink for small-context models; expand for very large-context ones.
Recommended Token Ranges by Document Type
Document Type | Chunk Size (tokens) | Overlap (tokens) |
---|---|---|
News Articles | 300–600 | 30–60 |
Blog Posts | 500–800 | 50–100 |
Academic Papers | 700–1 000 | 70–150 |
Legal Contracts | 500–800 | 100–160 |
Technical Docs / Specs | 600–1 000 | 60–120 |
Books / Long-Form | 1 000–1 500 | 100–300 |
Transcripts / Dialogue | 200–400 | 20–80 |
Tuning Tips
- Start low if you’re latency-sensitive or using smaller-context models.
- Scale up for large-context models (≥ 32 k tokens) or when preserving document coherence is critical.
- Monitor retrieval hits—if you see misses at chunk edges, bump overlap by 10 %–20 % of your chunk size.
Typical OpenAI Model Context Limits
Model | Context Window (tokens) |
---|---|
gpt-3.5-turbo | 4 096 |
gpt-3.5-turbo-16k | 16 384 |
gpt-4 | 8 192 |
gpt-4-32k | 32 768 |
o4-mini (internal reasoning) | 200 000 |
Use token-based settings for precise control within these limits; otherwise, character-based gives a simpler, model-agnostic approach.
Content Extraction Engines
OpenWebUI supports multiple backends for extracting text (and structure) from documents. Choose the one that best matches your input type:
-
Default
A hybrid: uses Apache Tika on born-digital files (PDF, DOCX, HTML) and automatically falls back to Mistral OCR for scanned/image-only pages. -
Tika
The pure Apache Tika extractor.- Pros: Lightning-fast on PDFs/Office files with embedded text.
- Cons: No OCR—can’t read images or scans.
-
Mistral OCR
A vision-based OCR engine powered by Mistral’s open vision models.- Pros: Excellent on scanned documents or photos.
- Cons: Slower and less accurate on very low-quality images.
-
Document Intelligence
Integrates a deep-learning document parser (e.g. Azure Form Recognizer).- Pros: OCR + semantic structuring (tables, key/value pairs, forms).
- Cons: Higher latency and cost; ideal for invoices, receipts, contracts.
-
Docling
A commercial add-on layering custom ML enhancements on top of Tika.- Pros: Improved layout detection and text-cleanup.
- Cons: Paid service; additional configuration needed.
-
External
Sends your file to any third-party or custom endpoint.- Pros: Total flexibility—you bring your own extractor.
- Cons: You manage uptime and compatibility.
Embedding Configuration
Control how OpenWebUI converts text chunks into vector embeddings. By default, SentenceTransformers is used.
Embedding Engines
-
SentenceTransformers (default)
- Runs locally using HuggingFace-compatible models via the Sentence-Transformers library.
- Pros: No per-request costs, full control over model choice (e.g.
all-MiniLM-L6-v2
,multi-qa-MiniLM-L6-cos-v1
). - Cons: Requires CPU/GPU resources; you manage model downloads and updates.
-
OpenAI
- Uses OpenAI’s Embeddings API (
text-embedding-ada-002
,text-embedding-3-small
, etc.). - Pros: Fully managed, high-quality service with multi-lingual support.
- Cons: Usage-based billing; network latency; requires API key.
- Uses OpenAI’s Embeddings API (
-
Ollama
- Leverages locally hosted LLMs via the Ollama daemon (e.g.
lmsys/vicuna-7b
). - Pros: Runs on your infrastructure, no per-token charges, low latency for on-prem deployments.
- Cons: Dependent on available Ollama models; requires local compute resources.
- Leverages locally hosted LLMs via the Ollama daemon (e.g.
Embedding Model Selection
Once you choose an engine, pick a specific model based on your quality vs. throughput vs. cost needs:
Engine | Model Example | Dims | Max Tokens | Notes |
---|---|---|---|---|
SentenceTransformers | all-MiniLM-L6-v2 | 384 | 512 | Very fast, compact vectors. |
SentenceTransformers | paraphrase-mpnet-base-v2 | 768 | 512 | Higher quality for semantic tasks. |
OpenAI | text-embedding-ada-002 | 1 536 | 8 191 | Inexpensive, good general-purpose. |
OpenAI | text-embedding-3-large | 3 072 | 8 191 | Best quality & multi-lingual support. |
Ollama | Varies by model (e.g. vicuna-7b ) | Varies | Varies | Depends on the Ollama model you install. |
Warning: Changing your embedding engine or model requires re-importing all documents to regenerate vectors.
Embedding Batch Size
- Default: 100
- Range: 50–500
Larger batches improve throughput but increase memory usage; tune to fit your environment.
With these guidelines on chunking, model limits, extraction engines, and embeddings, you’ll have everything you need to configure OpenWebUI for robust, high-quality document ingestion and retrieval.
Updated 16 days ago