Text Tools
LLM Text Chunker
Split long text into overlapping chunks for embeddings, RAG, or long prompts. Characters, tokens, words, paragraphs, or recursive separators. Browser only.
Split options
Strategy
Split by paragraph, line, sentence, word, then character. RAG default.
Maximum size of each chunk in the selected unit.
Each chunk shares this many units with its neighbor.
Result
Chunks
0
Avg chars
0
Avg tokens
0
Smallest
0
Largest
0
Over budget
0
Export
Copy or download every chunk. JSON is a single array, JSONL is one chunk per line, Plain text is a human-readable preview with chunk separators.
Chunk preview
Showing 0 of 0 chunks
Paste text above to see chunks here.
Quick tips
Embedding pipelines
Start with Recursive separators, 1000 characters, 100 overlap. Most embedding models perform best on 200 to 1000 token passages with 10 to 20 percent overlap.
Token estimates
The token counts here use a 4 chars per token heuristic. For an exact BPE count against a specific model, use the GPT Token Counter tool after splitting.
Structured docs
Paragraph mode keeps headings, lists, and code blocks intact when they fit in the budget. Switch to Recursive for very long paragraphs that need a hard split.
Privacy
Splitting runs entirely in your browser. The text you paste is never uploaded or logged, and stays on your device after the page closes.
How to use
- Paste your document into the input area, or click Load sample to try a sample RAG-style text.
- Pick a strategy. Recursive separators is the safest default for documents with paragraphs and headings; Characters is best for raw text; Tokens is best when you have a token budget in mind.
- Set the chunk size and overlap. A common starting point is 1000 characters with 100 overlap, or use one of the presets at the top.
- Scroll the chunk preview to read every chunk with source offsets, character and word counts, and approximate token counts. Use the per-chunk Copy button to grab one chunk.
- Pick JSON, JSONL, or Plain text for the export format, then click Copy export or Download to save the full set of chunks to your machine.
About this tool
LLM Text Chunker splits a long document into smaller, overlapping pieces that fit into an embedding model, a retrieval-augmented generation pipeline, or a constrained prompt context. Paste the text on the left, pick a strategy, and see every chunk numbered and ready to copy. Five strategies cover the common cases: Characters slices a fixed character window with a configurable overlap, useful when the input has no clear structure; Tokens uses the same window sized in approximate tokens (a 4 chars per token heuristic, matching the rule of thumb published by OpenAI); Words groups whitespace-delimited words into a fixed count with overlap, useful for prose; Paragraphs packs whole paragraphs into a chunk under the size budget so headings and lists stay intact; and Recursive separators implements the LangChain RecursiveCharacterTextSplitter algorithm, trying double newline first, then single newline, sentence-ending punctuation, semicolons, commas, single spaces, and finally character-level fallback, which is the default used by most RAG pipelines. The tool reports the number of chunks, average and maximum chunk size in characters and approximate tokens, the smallest and largest chunk, and an Over budget count of any chunk that exceeded the size limit because no smaller separator was available. Each chunk shows its source offset, character and word counts, and a one-click Copy button; the full set can be exported as JSON for direct ingestion into a vector store, JSONL for streaming embedding APIs, or plain text with chunk separators for human review. Quick presets cover the four most common configurations: Embeddings (1000 characters, 100 overlap, recursive) for general RAG, Long context (4000 characters, 400 overlap) for long-context models, Tweet-size (280 characters), and SMS chunks (160 characters). The token counts here are estimates; pair this tool with the GPT Token Counter when you need an exact BPE count against a specific model. Everything runs locally in your browser so the documents you split (knowledge base content, internal docs, private corpora) never leave your device or pass through a third-party API.
Free to use. Works in your browser. No signup, no login.
Related tools
You may also like
GPT Token Counter
Estimate GPT tokens, context window usage, and OpenAI API cost.
Open tool
TextWord Counter
Live word, character, sentence, paragraph, and reading time stats.
Open tool
TextCharacter Counter
Detailed character, letter, number, space, and line counts.
Open tool
TextText Cleaner
Remove duplicate lines, blank lines, extra spaces, tabs, and invisible characters.
Open tool
TextMarkdown TOC Generator
Build a GitHub-style Markdown table of contents from any document.
Open tool