Content Retrieval Management

Configuring Chunk Size & Overlap

Tokens vs. Characters

Characters
- Literally each letter, number, punctuation mark, or space in your text.
- Easy to reason about in terms of text length (e.g. a 1 000-character chunk is roughly a paragraph).
- Doesn’t depend on any language model’s encoding scheme.
Tokens
- The units used internally by language models (words or sub-words).
- One token averages about 4 characters in English, but varies by language and punctuation.
- Ideal when you know your model’s token-limit (e.g. 4 096, 16 384, 32 768, 200 000 tokens) and want to fit precisely within it.

Why choose one over the other?

Characters: simpler, universal, and stable across models.
Tokens: more efficient if you need to tightly pack a known token window, or when you mix languages with different tokenization behavior.

Character-Based Splitting

Chunk size: 2 000–4 000 characters
Overlap: 200–800 characters (≈ 10 %–20 %)

Why these ranges?

Semantic coherence: ~2 000–4 000 chars covers one to two paragraphs.
Model efficiency: At ~4 chars/token, that’s 500–1 000 tokens—small enough for fast processing, large enough to reduce total chunks.
Context preservation: 200–800-char overlap ensures fragments at edges appear in both chunks.
Resource balance: Keeps memory/compute in check while avoiding thousands of tiny chunks.
Tunable: Shrink toward 2 000 chars for latency-sensitive use cases; expand if you have large-context models.

Document Type	Chunk Size (chars)	Overlap (chars)
News Articles	1 500–2 500	150–250
Blog Posts	2 000–3 000	200–300
Academic Papers	3 000–4 500	300–450
Legal Contracts	2 000–3 000	400–600
Technical Docs / Specs	2 500–4 000	300–500
Books / Long-Form	4 000–6 000	400–800
Transcripts / Dialogue	1 000–2 000	100–20

Token-Based Splitting

Chunk size: 500–1 000 tokens
Overlap: 50–200 tokens (≈ 10 %–20 %)

Why these ranges?

Semantic coherence: 500–1 000 tokens (~1–2 paragraphs).
Model efficiency: Fits easily within most context windows (4 096–200 000 tokens).
Context preservation: 50–200-token overlap covers sentence tails and maintains continuity.
Resource balance: Balances compute/latency with number of chunks.
Tunable: Shrink for small-context models; expand for very large-context ones.

Document Type	Chunk Size (tokens)	Overlap (tokens)
News Articles	300–600	30–60
Blog Posts	500–800	50–100
Academic Papers	700–1 000	70–150
Legal Contracts	500–800	100–160
Technical Docs / Specs	600–1 000	60–120
Books / Long-Form	1 000–1 500	100–300
Transcripts / Dialogue	200–400	20–80

Tuning Tips

Start low if you’re latency-sensitive or using smaller-context models.
Scale up for large-context models (≥ 32 k tokens) or when preserving document coherence is critical.
Monitor retrieval hits—if you see misses at chunk edges, bump overlap by 10 %–20 % of your chunk size.

Typical OpenAI Model Context Limits

Model	Context Window (tokens)
gpt-3.5-turbo	4 096
gpt-3.5-turbo-16k	16 384
gpt-4	8 192
gpt-4-32k	32 768
o4-mini (internal reasoning)	200 000

Use token-based settings for precise control within these limits; otherwise, character-based gives a simpler, model-agnostic approach.

Content Extraction Engines

OpenWebUI supports multiple backends for extracting text (and structure) from documents. Choose the one that best matches your input type:

Default
A hybrid: uses Apache Tika on born-digital files (PDF, DOCX, HTML) and automatically falls back to Mistral OCR for scanned/image-only pages.
Tika
The pure Apache Tika extractor.
- Pros: Lightning-fast on PDFs/Office files with embedded text.
- Cons: No OCR—can’t read images or scans.
Mistral OCR
A vision-based OCR engine powered by Mistral’s open vision models.
- Pros: Excellent on scanned documents or photos.
- Cons: Slower and less accurate on very low-quality images.
Document Intelligence
Integrates a deep-learning document parser (e.g. Azure Form Recognizer).
- Pros: OCR + semantic structuring (tables, key/value pairs, forms).
- Cons: Higher latency and cost; ideal for invoices, receipts, contracts.
Docling
A commercial add-on layering custom ML enhancements on top of Tika.
- Pros: Improved layout detection and text-cleanup.
- Cons: Paid service; additional configuration needed.
External
Sends your file to any third-party or custom endpoint.
- Pros: Total flexibility—you bring your own extractor.
- Cons: You manage uptime and compatibility.

Embedding Configuration

Control how OpenWebUI converts text chunks into vector embeddings. By default, SentenceTransformers is used.

Embedding Engines

SentenceTransformers (default)
- Runs locally using HuggingFace-compatible models via the Sentence-Transformers library.
- Pros: No per-request costs, full control over model choice (e.g. all-MiniLM-L6-v2, multi-qa-MiniLM-L6-cos-v1).
- Cons: Requires CPU/GPU resources; you manage model downloads and updates.
OpenAI
- Uses OpenAI’s Embeddings API (text-embedding-ada-002, text-embedding-3-small, etc.).
- Pros: Fully managed, high-quality service with multi-lingual support.
- Cons: Usage-based billing; network latency; requires API key.
Ollama
- Leverages locally hosted LLMs via the Ollama daemon (e.g. lmsys/vicuna-7b).
- Pros: Runs on your infrastructure, no per-token charges, low latency for on-prem deployments.
- Cons: Dependent on available Ollama models; requires local compute resources.

Embedding Model Selection

Once you choose an engine, pick a specific model based on your quality vs. throughput vs. cost needs:

Engine	Model Example	Dims	Max Tokens	Notes
SentenceTransformers	`all-MiniLM-L6-v2`	384	512	Very fast, compact vectors.
SentenceTransformers	`paraphrase-mpnet-base-v2`	768	512	Higher quality for semantic tasks.
OpenAI	`text-embedding-ada-002`	1 536	8 191	Inexpensive, good general-purpose.
OpenAI	`text-embedding-3-large`	3 072	8 191	Best quality & multi-lingual support.
Ollama	Varies by model (e.g. `vicuna-7b`)	Varies	Varies	Depends on the Ollama model you install.

Warning: Changing your embedding engine or model requires re-importing all documents to regenerate vectors.

Embedding Batch Size

Default: 100
Range: 50–500

Larger batches improve throughput but increase memory usage; tune to fit your environment.

With these guidelines on chunking, model limits, extraction engines, and embeddings, you’ll have everything you need to configure OpenWebUI for robust, high-quality document ingestion and retrieval.