Document Creation Best Practices
This guide provides recommendations for authoring documents optimized for ingestion and retrieval by OpenWebUI, ensuring that only the most relevant content is processed and minimizing retrieval inaccuracies.
1. Focus on Essential Content
1.1 Define Clear Objectives
- List key questions or points the document should address.
- Prune unrelated material to avoid retrieval noise.
1.2 Structured Authoring
- Use Concise Sections: Label sections (e.g.
## Overview
,## Key Findings
) and keep them to 300β800 words each. - Semantic Headings: Descriptive headings (e.g.
## Q1 Sales Performance
) help retrieval engines surface the right chunks. - Summaries & Highlights: Start sections with a one-paragraph summary and bullet points of critical insights.
1.3 Curated Inserts
- Tables & Charts: Embed only summarized tables (β€10 rows) with captions.
- Key-Value Pairs: Use
key: value
for vital parameters (e.g.QPS_limit: 1000
). - Callouts: Use blockquotes or markdown alerts (
> **Warning:** ...
) to highlight priority items.
1.4 Avoid Retrieval Traps
- Donβt rely on hidden text (comments, collapsibles) that extractors may skip.
- Limit code snippets to relevant sections; link to full repos if needed.
- Clean formatting: remove excessive whitespace, unsupported fonts, or embedded objects.
2. Minimize Historical Noise
- Prune Obsolete Background: Summarize past events in 1β2 sentences with links to archives.
- Version Indicators: Mark versioned sections (e.g.
### v2.0 Change Log
) to target the latest content. - Avoid Redundancy: Extract only actionable decisions from meeting minutes or log dumps.
3. Pre-Upload & Batch Preparation
3.1 Supported Formats
- Born-Digital (preferred): PDF (text-embedded), DOCX, PPTX, HTML, TXT
- Scanned: Image-based PDFs, JPEG, PNG, TIFF (when necessary)
Tip: Generate PDFs at β₯300 DPI; avoid pure scans when possible.
3.2 File Naming & Manifests
- Consistent Filenames:
project_section_v{version}_{YYYYMMDD}.ext
(e.g.Invoice_Payment_v2_20250521.pdf
). - Batch Manifest: Maintain a CSV/JSON manifest with
filename
,checksum
(SHA256), and metadata. Example:
3.3 File Size & Splitting
- Target Size: 1β10β―MB per document; split >50β―MB into logical parts (
Chapter1.pdf
, etc.). - Chunk & Size Optimization: Ensure each part aligns with your chunking strategy (β2β―000β4β―000β―chars or 500β1β―000β―tokens).
4. Document Creation Quality Checks
- Chunk Simulation: Pre-chunk draft and review that each chunk is self-contained and relevant.
- Search Testing: Run sample queries to verify expected sections surface correctly.
- Peer Review: Have colleagues test retrieval accuracy and flag noise.
5) Maintenance & Cleanup
- Archive Originals: Store raw documents and manifests in cold storage (e.g. AWS Glacier).
- Version Retention: Keep the latest N versions; archive or delete older versions.
By combining document-focused authoring techniques with batch-optimized workflowsβmanifests, sized splits, and robust validationβyouβll ensure scalable ingestion, precise retrieval, and no lost documents in OpenWebUI.
Updated 24 days ago