Content Retrieval Management
When splitting extracted text into chunks for embedding and retrieval, you can work in characters or tokens. Below are guidelines for both approaches, along with a brief on how they differ.
Configuring Chunk Size & Overlap
Tokens vs. Characters
-
Characters
- Literally each letter, number, punctuation mark, or space in your text.
- Easy to reason about in terms of text length (e.g. a 1 000-character chunk is roughly a paragraph).
- Doesnβt depend on any language modelβs encoding scheme.
-
Tokens
- The units used internally by language models (words or sub-words).
- One token averages about 4 characters in English, but varies by language and punctuation.
- Ideal when you know your modelβs token-limit (e.g. 4 096, 16 384, 32 768, 200 000 tokens) and want to fit precisely within it.
Why choose one over the other?
- Characters: simpler, universal, and stable across models.
- Tokens: more efficient if you need to tightly pack a known token window, or when you mix languages with different tokenization behavior.
Character-Based Splitting
- Chunk size: 2 000β4 000 characters
- Overlap: 200β800 characters (β 10 %β20 %)
Why these ranges?
- Semantic coherence: ~2 000β4 000 chars covers one to two paragraphs.
- Model efficiency: At ~4 chars/token, thatβs 500β1 000 tokensβsmall enough for fast processing, large enough to reduce total chunks.
- Context preservation: 200β800-char overlap ensures fragments at edges appear in both chunks.
- Resource balance: Keeps memory/compute in check while avoiding thousands of tiny chunks.
- Tunable: Shrink toward 2 000 chars for latency-sensitive use cases; expand if you have large-context models.
Recommended Character Ranges by Document Type
Document Type | Chunk Size (chars) | Overlap (chars) |
---|---|---|
News Articles | 1 500β2 500 | 150β250 |
Blog Posts | 2 000β3 000 | 200β300 |
Academic Papers | 3 000β4 500 | 300β450 |
Legal Contracts | 2 000β3 000 | 400β600 |
Technical Docs / Specs | 2 500β4 000 | 300β500 |
Books / Long-Form | 4 000β6 000 | 400β800 |
Transcripts / Dialogue | 1 000β2 000 | 100β20 |
Token-Based Splitting
- Chunk size: 500β1 000 tokens
- Overlap: 50β200 tokens (β 10 %β20 %)
Why these ranges?
- Semantic coherence: 500β1 000 tokens (~1β2 paragraphs).
- Model efficiency: Fits easily within most context windows (4 096β200 000 tokens).
- Context preservation: 50β200-token overlap covers sentence tails and maintains continuity.
- Resource balance: Balances compute/latency with number of chunks.
- Tunable: Shrink for small-context models; expand for very large-context ones.
Recommended Token Ranges by Document Type
Document Type | Chunk Size (tokens) | Overlap (tokens) |
---|---|---|
News Articles | 300β600 | 30β60 |
Blog Posts | 500β800 | 50β100 |
Academic Papers | 700β1 000 | 70β150 |
Legal Contracts | 500β800 | 100β160 |
Technical Docs / Specs | 600β1 000 | 60β120 |
Books / Long-Form | 1 000β1 500 | 100β300 |
Transcripts / Dialogue | 200β400 | 20β80 |
Tuning Tips
- Start low if youβre latency-sensitive or using smaller-context models.
- Scale up for large-context models (β₯ 32 k tokens) or when preserving document coherence is critical.
- Monitor retrieval hitsβif you see misses at chunk edges, bump overlap by 10 %β20 % of your chunk size.
Typical OpenAI Model Context Limits
Model | Context Window (tokens) |
---|---|
gpt-3.5-turbo | 4 096 |
gpt-3.5-turbo-16k | 16 384 |
gpt-4 | 8 192 |
gpt-4-32k | 32 768 |
o4-mini (internal reasoning) | 200 000 |
Use token-based settings for precise control within these limits; otherwise, character-based gives a simpler, model-agnostic approach.
Content Extraction Engines
OpenWebUI supports multiple backends for extracting text (and structure) from documents. Choose the one that best matches your input type:
-
Default
A hybrid: uses Apache Tika on born-digital files (PDF, DOCX, HTML) and automatically falls back to Mistral OCR for scanned/image-only pages. -
Tika
The pure Apache Tika extractor.- Pros: Lightning-fast on PDFs/Office files with embedded text.
- Cons: No OCRβcanβt read images or scans.
-
Mistral OCR
A vision-based OCR engine powered by Mistralβs open vision models.- Pros: Excellent on scanned documents or photos.
- Cons: Slower and less accurate on very low-quality images.
-
Document Intelligence
Integrates a deep-learning document parser (e.g. Azure Form Recognizer).- Pros: OCR + semantic structuring (tables, key/value pairs, forms).
- Cons: Higher latency and cost; ideal for invoices, receipts, contracts.
-
Docling
A commercial add-on layering custom ML enhancements on top of Tika.- Pros: Improved layout detection and text-cleanup.
- Cons: Paid service; additional configuration needed.
-
External
Sends your file to any third-party or custom endpoint.- Pros: Total flexibilityβyou bring your own extractor.
- Cons: You manage uptime and compatibility.
Embedding Configuration
Control how OpenWebUI converts text chunks into vector embeddings. By default, SentenceTransformers is used.
Embedding Engines
-
SentenceTransformers (default)
- Runs locally using HuggingFace-compatible models via the Sentence-Transformers library.
- Pros: No per-request costs, full control over model choice (e.g.
all-MiniLM-L6-v2
,multi-qa-MiniLM-L6-cos-v1
). - Cons: Requires CPU/GPU resources; you manage model downloads and updates.
-
OpenAI
- Uses OpenAIβs Embeddings API (
text-embedding-ada-002
,text-embedding-3-small
, etc.). - Pros: Fully managed, high-quality service with multi-lingual support.
- Cons: Usage-based billing; network latency; requires API key.
- Uses OpenAIβs Embeddings API (
-
Ollama
- Leverages locally hosted LLMs via the Ollama daemon (e.g.
lmsys/vicuna-7b
). - Pros: Runs on your infrastructure, no per-token charges, low latency for on-prem deployments.
- Cons: Dependent on available Ollama models; requires local compute resources.
- Leverages locally hosted LLMs via the Ollama daemon (e.g.
Embedding Model Selection
Once you choose an engine, pick a specific model based on your quality vs. throughput vs. cost needs:
Engine | Model Example | Dims | Max Tokens | Notes |
---|---|---|---|---|
SentenceTransformers | all-MiniLM-L6-v2 | 384 | 512 | Very fast, compact vectors. |
SentenceTransformers | paraphrase-mpnet-base-v2 | 768 | 512 | Higher quality for semantic tasks. |
OpenAI | text-embedding-ada-002 | 1 536 | 8 191 | Inexpensive, good general-purpose. |
OpenAI | text-embedding-3-large | 3 072 | 8 191 | Best quality & multi-lingual support. |
Ollama | Varies by model (e.g. vicuna-7b ) | Varies | Varies | Depends on the Ollama model you install. |
Warning: Changing your embedding engine or model requires re-importing all documents to regenerate vectors.
Embedding Batch Size
- Default: 100
- Range: 50β500
Larger batches improve throughput but increase memory usage; tune to fit your environment.
With these guidelines on chunking, model limits, extraction engines, and embeddings, youβll have everything you need to configure OpenWebUI for robust, high-quality document ingestion and retrieval.
Updated about 1 month ago