Content Retrieval Management

When splitting extracted text into chunks for embedding and retrieval, you can work in characters or tokens. Below are guidelines for both approaches, along with a brief on how they differ.

Configuring Chunk Size & Overlap

Tokens vs. Characters

  • Characters

    • Literally each letter, number, punctuation mark, or space in your text.
    • Easy to reason about in terms of text length (e.g. a 1 000-character chunk is roughly a paragraph).
    • Doesn’t depend on any language model’s encoding scheme.
  • Tokens

    • The units used internally by language models (words or sub-words).
    • One token averages about 4 characters in English, but varies by language and punctuation.
    • Ideal when you know your model’s token-limit (e.g. 4 096, 16 384, 32 768, 200 000 tokens) and want to fit precisely within it.

Why choose one over the other?

  • Characters: simpler, universal, and stable across models.
  • Tokens: more efficient if you need to tightly pack a known token window, or when you mix languages with different tokenization behavior.

Character-Based Splitting

  • Chunk size: 2 000–4 000 characters
  • Overlap: 200–800 characters (≈ 10 %–20 %)

Why these ranges?

  • Semantic coherence: ~2 000–4 000 chars covers one to two paragraphs.
  • Model efficiency: At ~4 chars/token, that’s 500–1 000 tokens—small enough for fast processing, large enough to reduce total chunks.
  • Context preservation: 200–800-char overlap ensures fragments at edges appear in both chunks.
  • Resource balance: Keeps memory/compute in check while avoiding thousands of tiny chunks.
  • Tunable: Shrink toward 2 000 chars for latency-sensitive use cases; expand if you have large-context models.

Recommended Character Ranges by Document Type

Document TypeChunk Size (chars)Overlap (chars)
News Articles1 500–2 500150–250
Blog Posts2 000–3 000200–300
Academic Papers3 000–4 500300–450
Legal Contracts2 000–3 000400–600
Technical Docs / Specs2 500–4 000300–500
Books / Long-Form4 000–6 000400–800
Transcripts / Dialogue1 000–2 000100–20

Token-Based Splitting

  • Chunk size: 500–1 000 tokens
  • Overlap: 50–200 tokens (≈ 10 %–20 %)

Why these ranges?

  • Semantic coherence: 500–1 000 tokens (~1–2 paragraphs).
  • Model efficiency: Fits easily within most context windows (4 096–200 000 tokens).
  • Context preservation: 50–200-token overlap covers sentence tails and maintains continuity.
  • Resource balance: Balances compute/latency with number of chunks.
  • Tunable: Shrink for small-context models; expand for very large-context ones.

Recommended Token Ranges by Document Type

Document TypeChunk Size (tokens)Overlap (tokens)
News Articles300–60030–60
Blog Posts500–80050–100
Academic Papers700–1 00070–150
Legal Contracts500–800100–160
Technical Docs / Specs600–1 00060–120
Books / Long-Form1 000–1 500100–300
Transcripts / Dialogue200–40020–80

Tuning Tips

  1. Start low if you’re latency-sensitive or using smaller-context models.
  2. Scale up for large-context models (≥ 32 k tokens) or when preserving document coherence is critical.
  3. Monitor retrieval hits—if you see misses at chunk edges, bump overlap by 10 %–20 % of your chunk size.

Typical OpenAI Model Context Limits

ModelContext Window (tokens)
gpt-3.5-turbo4 096
gpt-3.5-turbo-16k16 384
gpt-48 192
gpt-4-32k32 768
o4-mini (internal reasoning)200 000

Use token-based settings for precise control within these limits; otherwise, character-based gives a simpler, model-agnostic approach.

Content Extraction Engines

OpenWebUI supports multiple backends for extracting text (and structure) from documents. Choose the one that best matches your input type:

  • Default
    A hybrid: uses Apache Tika on born-digital files (PDF, DOCX, HTML) and automatically falls back to Mistral OCR for scanned/image-only pages.

  • Tika
    The pure Apache Tika extractor.

    • Pros: Lightning-fast on PDFs/Office files with embedded text.
    • Cons: No OCR—can’t read images or scans.
  • Mistral OCR
    A vision-based OCR engine powered by Mistral’s open vision models.

    • Pros: Excellent on scanned documents or photos.
    • Cons: Slower and less accurate on very low-quality images.
  • Document Intelligence
    Integrates a deep-learning document parser (e.g. Azure Form Recognizer).

    • Pros: OCR + semantic structuring (tables, key/value pairs, forms).
    • Cons: Higher latency and cost; ideal for invoices, receipts, contracts.
  • Docling
    A commercial add-on layering custom ML enhancements on top of Tika.

    • Pros: Improved layout detection and text-cleanup.
    • Cons: Paid service; additional configuration needed.
  • External
    Sends your file to any third-party or custom endpoint.

    • Pros: Total flexibility—you bring your own extractor.
    • Cons: You manage uptime and compatibility.

Embedding Configuration

Control how OpenWebUI converts text chunks into vector embeddings. By default, SentenceTransformers is used.

Embedding Engines

  • SentenceTransformers (default)

    • Runs locally using HuggingFace-compatible models via the Sentence-Transformers library.
    • Pros: No per-request costs, full control over model choice (e.g. all-MiniLM-L6-v2, multi-qa-MiniLM-L6-cos-v1).
    • Cons: Requires CPU/GPU resources; you manage model downloads and updates.
  • OpenAI

    • Uses OpenAI’s Embeddings API (text-embedding-ada-002, text-embedding-3-small, etc.).
    • Pros: Fully managed, high-quality service with multi-lingual support.
    • Cons: Usage-based billing; network latency; requires API key.
  • Ollama

    • Leverages locally hosted LLMs via the Ollama daemon (e.g. lmsys/vicuna-7b).
    • Pros: Runs on your infrastructure, no per-token charges, low latency for on-prem deployments.
    • Cons: Dependent on available Ollama models; requires local compute resources.

Embedding Model Selection

Once you choose an engine, pick a specific model based on your quality vs. throughput vs. cost needs:

EngineModel ExampleDimsMax TokensNotes
SentenceTransformersall-MiniLM-L6-v2384512Very fast, compact vectors.
SentenceTransformersparaphrase-mpnet-base-v2768512Higher quality for semantic tasks.
OpenAItext-embedding-ada-0021 5368 191Inexpensive, good general-purpose.
OpenAItext-embedding-3-large3 0728 191Best quality & multi-lingual support.
OllamaVaries by model (e.g. vicuna-7b)VariesVariesDepends on the Ollama model you install.

Warning: Changing your embedding engine or model requires re-importing all documents to regenerate vectors.


Embedding Batch Size

  • Default: 100
  • Range: 50–500

Larger batches improve throughput but increase memory usage; tune to fit your environment.


With these guidelines on chunking, model limits, extraction engines, and embeddings, you’ll have everything you need to configure OpenWebUI for robust, high-quality document ingestion and retrieval.