What is document chunking and why do I need it?
[+]
Document chunking splits large documents into smaller, semantically meaningful pieces for AI processing. LLMs have context window limits (8K-128K tokens), so a 500-page PDF needs to be chunked before it can be searched, embedded, or analyzed. Poor chunking = lost context = bad AI responses. Our smart chunking ensures each chunk is a complete, coherent unit of meaning.
What makes your chunking algorithm better?
[+]
Our algorithm uses multiple techniques for superior results:
Smart Sentence Boundaries: We detect sentence endings accurately, handling abbreviations (Dr., U.S., Inc., etc.), initials (J.K. Rowling), and decimal numbers (3.14).
Topic Shift Detection: We analyze keyword density to detect when content shifts to a new topic, creating natural break points.
Never Breaks Mid-Sentence: Unlike character-based splitters, we always end chunks at complete sentences.
Sentence Overlap: Each chunk includes the last sentence of the previous chunk for context continuity.
How does payment work?
[+]
We accept USDC on Solana with two payment methods:
1. Manual (Web UI): Connect your Phantom wallet on this page, select a paid tier, and approve the USDC transfer. The transaction signature is automatically sent with your file.
2. Programmatic (x402 API): For AI agents and developers - call /estimate to get pricing, execute a USDC transfer on Solana, then include the TX signature in the X-PAYMENT header when calling the chunking endpoint.
Payment is per-page based on document size (~500 chars = 1 page). Demo tier is free (100 pages/day limit).
What's the difference between the tiers?
[+]
Demo (Free): Basic recursive character splitting with paragraph detection. Good for testing and simple documents. Quality: 6/10.
Standard ($0.001/page): Smart sentence boundary detection, topic shift analysis, abbreviation handling (Dr., U.S., Inc., etc.), and sentence-level overlap. Never breaks mid-sentence. Best value for most RAG use cases. Quality: 8.5/10.
Professional ($0.008/page): All Standard features plus document context injection, entity extraction and classification, cross-reference detection. When a chunk mentions "He," we prepend who "He" refers to. Best for legal, medical, and complex documents. Quality: 9.5/10.
What file formats are supported?
[+]
We support: PDF, DOCX, TXT, RTF, HTML, and Markdown.
PDF: Text-based PDFs (not scanned images). We use PyMuPDF for accurate extraction.
DOCX: Full extraction including headers, footers, text boxes, and structured content.
TXT/RTF/HTML/MD: Direct text processing with format-specific parsing.
Pages are calculated at ~500 characters per page.
Can AI agents use this API autonomously?
[+]
Yes! That's exactly what we're built for. The workflow is:
1. Call POST /estimate with the file to get page count and pricing
2. Execute a USDC transfer on Solana to our payment address
3. Call the chunking endpoint with the TX signature in X-PAYMENT header
Fully autonomous, no human intervention needed. Demo tier requires no payment for testing.
What are the chunk size parameters?
[+]
Our chunking is optimized for RAG (Retrieval Augmented Generation):
Target Size: ~1200 characters (ideal for embedding models)
Minimum Size: 400 characters (avoids tiny fragments)
Maximum Size: 2000 characters (prevents oversized chunks)
Overlap: 1 sentence (maintains context between chunks)
These parameters are tuned for optimal performance with OpenAI, Cohere, and other embedding APIs.
Is my data stored or logged?
[+]
No. Documents are processed in memory and immediately discarded. We don't store your files, chunks, or content. Only basic request logs (IP, timestamp, file size) are kept for rate limiting and abuse prevention. Your documents never touch disk storage.