Document Creation Best Practices

This guide provides recommendations for authoring documents optimized for ingestion and retrieval by OpenWebUI, ensuring that only the most relevant content is processed and minimizing retrieval inaccuracies.

1. Focus on Essential Content

1.1 Define Clear Objectives

  • List key questions or points the document should address.
  • Prune unrelated material to avoid retrieval noise.

1.2 Structured Authoring

  • Use Concise Sections: Label sections (e.g. ## Overview, ## Key Findings) and keep them to 300–800 words each.
  • Semantic Headings: Descriptive headings (e.g. ## Q1 Sales Performance) help retrieval engines surface the right chunks.
  • Summaries & Highlights: Start sections with a one-paragraph summary and bullet points of critical insights.

1.3 Curated Inserts

  • Tables & Charts: Embed only summarized tables (≀10 rows) with captions.
  • Key-Value Pairs: Use key: value for vital parameters (e.g. QPS_limit: 1000).
  • Callouts: Use blockquotes or markdown alerts (> **Warning:** ...) to highlight priority items.

1.4 Avoid Retrieval Traps

  • Don’t rely on hidden text (comments, collapsibles) that extractors may skip.
  • Limit code snippets to relevant sections; link to full repos if needed.
  • Clean formatting: remove excessive whitespace, unsupported fonts, or embedded objects.

2. Minimize Historical Noise

  • Prune Obsolete Background: Summarize past events in 1–2 sentences with links to archives.
  • Version Indicators: Mark versioned sections (e.g. ### v2.0 Change Log) to target the latest content.
  • Avoid Redundancy: Extract only actionable decisions from meeting minutes or log dumps.

3. Pre-Upload & Batch Preparation

3.1 Supported Formats

  • Born-Digital (preferred): PDF (text-embedded), DOCX, PPTX, HTML, TXT
  • Scanned: Image-based PDFs, JPEG, PNG, TIFF (when necessary)

Tip: Generate PDFs at β‰₯300 DPI; avoid pure scans when possible.

3.2 File Naming & Manifests

  • Consistent Filenames: project_section_v{version}_{YYYYMMDD}.ext (e.g. Invoice_Payment_v2_20250521.pdf).
  • Batch Manifest: Maintain a CSV/JSON manifest with filename, checksum (SHA256), and metadata. Example:

3.3 File Size & Splitting

  • Target Size: 1–10β€―MB per document; split >50β€―MB into logical parts (Chapter1.pdf, etc.).
  • Chunk & Size Optimization: Ensure each part aligns with your chunking strategy (β‰ˆ2β€―000–4β€―000β€―chars or 500–1β€―000β€―tokens).

4. Document Creation Quality Checks

  1. Chunk Simulation: Pre-chunk draft and review that each chunk is self-contained and relevant.
  2. Search Testing: Run sample queries to verify expected sections surface correctly.
  3. Peer Review: Have colleagues test retrieval accuracy and flag noise.

5) Maintenance & Cleanup

  • Archive Originals: Store raw documents and manifests in cold storage (e.g. AWS Glacier).
  • Version Retention: Keep the latest N versions; archive or delete older versions.

By combining document-focused authoring techniques with batch-optimized workflowsβ€”manifests, sized splits, and robust validationβ€”you’ll ensure scalable ingestion, precise retrieval, and no lost documents in OpenWebUI.