Document Creation Best Practices
1. Focus on Essential Content
1.1 Define Clear Objectives
- List key questions or points the document should address.
- Prune unrelated material to avoid retrieval noise.
1.2 Structured Authoring
- Use Concise Sections: Label sections (e.g.
## Overview,## Key Findings) and keep them to 300β800 words each. - Semantic Headings: Descriptive headings (e.g.
## Q1 Sales Performance) help retrieval engines surface the right chunks. - Summaries & Highlights: Start sections with a one-paragraph summary and bullet points of critical insights.
1.3 Curated Inserts
- Tables & Charts: Embed only summarized tables (β€10 rows) with captions.
- Key-Value Pairs: Use
key: valuefor vital parameters (e.g.QPS_limit: 1000). - Callouts: Use blockquotes or markdown alerts (
> **Warning:** ...) to highlight priority items.
1.4 Avoid Retrieval Traps
- Donβt rely on hidden text (comments, collapsibles) that extractors may skip.
- Limit code snippets to relevant sections; link to full repos if needed.
- Clean formatting: remove excessive whitespace, unsupported fonts, or embedded objects.
2. Minimize Historical Noise
- Prune Obsolete Background: Summarize past events in 1β2 sentences with links to archives.
- Version Indicators: Mark versioned sections (e.g.
### v2.0 Change Log) to target the latest content. - Avoid Redundancy: Extract only actionable decisions from meeting minutes or log dumps.
3. Pre-Upload & Batch Preparation
3.1 Supported Formats
- Born-Digital (preferred): PDF (text-embedded), DOCX, PPTX, HTML, TXT
- Scanned: Image-based PDFs, JPEG, PNG, TIFF (when necessary)
Tip: Generate PDFs at β₯300 DPI; avoid pure scans when possible.
3.2 File Naming & Manifests
- Consistent Filenames:
project_section_v{version}_{YYYYMMDD}.ext(e.g.Invoice_Payment_v2_20250521.pdf). - Batch Manifest: Maintain a CSV/JSON manifest with
filename,checksum(SHA256), and metadata. Example:
3.3 File Size & Splitting
- Target Size: 1β10β―MB per document; split >50β―MB into logical parts (
Chapter1.pdf, etc.). - Chunk & Size Optimization: Ensure each part aligns with your chunking strategy (β2β―000β4β―000β―chars or 500β1β―000β―tokens).
4. Document Creation Quality Checks
- Chunk Simulation: Pre-chunk draft and review that each chunk is self-contained and relevant.
- Search Testing: Run sample queries to verify expected sections surface correctly.
- Peer Review: Have colleagues test retrieval accuracy and flag noise.
5) Maintenance & Cleanup
- Archive Originals: Store raw documents and manifests in cold storage (e.g. AWS Glacier).
- Version Retention: Keep the latest N versions; archive or delete older versions.
By combining document-focused authoring techniques with batch-optimized workflowsβmanifests, sized splits, and robust validationβyouβll ensure scalable ingestion, precise retrieval, and no lost documents in OpenWebUI.