System.Chunker

📁

DROP FILE OR CLICK TO UPLOAD

[SUPPORTED: PDF, DOCX, TXT, RTF, HTML, MD]

PAYMENT REQUIRED

$0.001

1000 USDC micro-units

WALLET NOT CONNECTED

Network: SOLANA-MAINNET

Pricing.Matrix

FREE

$0

Recursive splitting
100 pages/day limit
Basic quality scoring
Character-based chunks

STRUCTURE

$0.002/pg

Code block preservation
Tables & lists intact
Smart sentence bounds
Abbreviation handling

BEST VALUE

SEMANTIC

$0.004/pg

OpenAI embeddings
Topic detection
Similarity-based splits
Structure preservation

AI

$0.01/pg

Claude Haiku analysis
LLM boundary detection
Semantic understanding
Context-aware splits

AI PRO

$0.025/pg

Claude Sonnet analysis
Per-chunk summaries
Entity extraction
Enriched metadata

RECOMMENDED

System.Info

>> Purpose: Split documents into chunks suitable for RAG pipelines and vector databases. LLMs have context limits; large documents need to be chunked before embedding or retrieval.

>> How it works: Upload PDF, DOCX, TXT, or HTML. Select a tier. Lower tiers use rule-based splitting. Higher tiers use OpenAI embeddings or Claude for boundary detection. AI Pro adds per-chunk summaries and entity extraction.

>> Payment: Free tier available (100 pages/day). Paid tiers require USDC on Solana via x402 protocol. Payment included in request header after wallet signature.

Comparison

vs. LOCAL LIBS

LangChain, LlamaIndex

We handle structure detection. RecursiveCharacterTextSplitter doesn't preserve code blocks or tables.

vs. UNSTRUCTURED.IO

Similar features

Pay-per-use with crypto. No API keys or accounts needed. Different payment model.

vs. DIY

Build your own

We handle edge cases: abbreviations, decimal numbers, nested structures, quality scoring.

[NOTE] → AI Pro tier uses Claude Sonnet to generate per-chunk summaries and extract entities. This adds context that helps retrieval, but increases cost and latency compared to lower tiers.

System.Benefits

API_BASED

REST API. Upload file, get chunks. No SDK required. Works with any language that can make HTTP requests.

PAY_PER_USE

USDC on Solana. No subscriptions. Free tier: 100 pages/day. Paid tiers: $0.002-$0.025 per page depending on features.

TIERED_PROCESSING

5 tiers: basic recursive, structure-aware, embedding-based semantic, Claude Haiku boundaries, Claude Sonnet with enrichment.

STRUCTURE_AWARE

Detects code blocks, tables, lists, headers. Avoids splitting mid-structure. Returns quality scores per chunk.

Features

// STRUCTURE DETECTION

Detects fenced code blocks, markdown/HTML tables, bullet and numbered lists. Avoids splitting these structures across chunks when possible.

// QUALITY SCORES

Each chunk receives a quality grade (A-D) based on length, sentence completeness, and structure integrity. Useful for filtering before embedding.

// SENTENCE BOUNDARIES

Handles common abbreviations (Dr., Mr., Inc., etc.) to avoid false sentence breaks. Splits at actual sentence endings when possible.

// JSON OUTPUT

Returns JSON with chunk text and metadata. Includes quality score, detected structures, and chunk index. Ready for your embedding pipeline.

API.Usage

ENDPOINTS

GET / - API info and config
GET /health - Health check
POST /estimate - Get price estimate for file
POST /chunk/demo - Free tier (100 pages/day)
POST /chunk/structure - Structure tier ($0.002/page)
POST /chunk/semantic - Semantic tier ($0.004/page)
POST /chunk/ai - AI tier ($0.01/page)
POST /chunk/ai-pro - AI Pro tier ($0.025/page) [ASYNC]
GET /job/{id}/status - Poll async job status
GET /job/{id}/result - Get async job result

# 1. Get price estimate

curl -X POST https://api.chunker.cc/estimate -F "file=@doc.pdf"

# Returns: { "estimated_pages": 5, "pricing": { "structure": { "price_usd": 0.01, "price_usdc": 10000 }, ... } }

# 2. Free tier (no payment needed)

curl -X POST https://api.chunker.cc/chunk/demo -F "file=@doc.pdf"

# 3. Paid tiers: Send USDC payment first, then include TX signature

curl -X POST https://api.chunker.cc/chunk/structure \

  -H "X-PAYMENT: <solana-tx-signature>" \

  -F "file=@doc.pdf"

AI PRO ASYNC PROCESSING

AI Pro tier uses async processing for long-running LLM analysis. Instead of waiting, you receive a job_id immediately:

1. POST to /chunk/ai-pro → returns job_id
2. Poll /job/{id}/status for progress
3. Fetch /job/{id}/result when completed

Jobs expire after 1 hour. See API docs for details.

RESPONSE FORMAT

{
  "success": true,
  "tier": "structure",
  "total_chunks": 4,
  "chunks": [
    {
      "text": "chunk content...",
      "metadata": {
        "index": 0,
        "quality_score": "A",
        "has_code": false,
        "has_table": false
      }
    }
  ],
  "metadata": {
    "filename": "doc.pdf",
    "estimated_pages": 2
  }
}

FAQ.Database

What is document chunking? [+]

Splitting large documents into smaller pieces for AI processing. LLMs have context limits, so a 500-page PDF needs to be split into chunks before embedding or retrieval. The quality of chunking affects retrieval accuracy - chunks that cut off mid-sentence or split code blocks create problems for RAG systems.

Why not just run LangChain locally? [+]

You can. RecursiveCharacterTextSplitter splits on character count and doesn't detect document structure. Common issues:

✗ Cuts code blocks mid-function
✗ Splits abbreviations like "Dr. Smith"
✗ Breaks tables into fragments
✗ Scatters list items across chunks

Our Structure tier ($0.002/page) detects these structures and avoids splitting them. Whether that's worth paying for depends on your use case.

What structures does CHUNKER preserve? [+]

Code Blocks: Fenced code (``` and ~~~) stays intact with language detection
Tables: Markdown and HTML tables kept as complete units
Lists: Bullet points, numbered lists, and nested items stay together
Headers: H1-H6, ALL CAPS, and numbered sections (Chapter 1, Section 2.3) are detected and tracked

Each chunk's metadata tells you exactly what structures it contains and which section of the document it belongs to.

What's the difference between tiers? [+]

Free ($0): Basic recursive splitting. 100 pages/day limit.

Structure ($0.002/page): Detects and preserves code blocks, tables, lists. Handles abbreviations. Recommended starting point.

Semantic ($0.004/page): Uses OpenAI text-embedding-3-small to find topic boundaries.

AI ($0.01/page): Claude Haiku analyzes text to identify semantic boundaries.

AI Pro ($0.025/page): Claude Sonnet for boundaries + generates per-chunk summaries and extracts entities.

What is quality scoring? [+]

Every chunk gets an A-D grade based on five factors:

Length: Optimal 400-1500 chars (not too short, not too long)
Completeness: Starts with capital letter, ends with punctuation
Coherence: Doesn't start mid-sentence ("and", "but", "however")
Structure: Code blocks are properly closed, not cut off
Density: Reasonable words-per-sentence ratio

Use quality scores to filter chunks before embedding, or flag low-quality chunks for manual review.

What file formats are supported? [+]

PDF - Text-based PDFs (not scanned images)
DOCX - Microsoft Word documents (including tables, headers, footers)
HTML - Web pages (tags stripped, content extracted)
TXT - Plain text files (UTF-8, Windows-1252, Latin-1)

Pages calculated at ~2500 characters per page.

How does payment work? [+]

We accept USDC on Solana - fast, cheap, no subscriptions.

1. Web UI: Connect Solana wallet (Phantom, Solflare, Backpack, or Jupiter), select tier, approve USDC transfer. Done.

2. API: Call /estimate to get price, send USDC transfer on Solana, include TX signature in X-PAYMENT header.

Pay only for what you use. No monthly fees. No credit card required.

Can AI agents use this API programmatically? [+]

Yes. The API uses x402 payment protocol, which supports programmatic payments:

1. Call POST /estimate to get exact USDC cost
2. Execute Solana USDC transfer (requires wallet with signing capability)
3. Call chunking endpoint with TX signature in X-PAYMENT header

No API keys required. Payment verification happens on-chain. Free tier works without any payment for testing.

Is my data stored or logged? [+]

No. Documents are processed in memory and immediately discarded. We don't store your files, chunks, or content. Only basic request metadata (timestamp, file size, wallet address) is logged for rate limiting and payment verification. Your documents never touch disk.