Text Tools

LLM Text Chunker

Split long text into overlapping chunks for embeddings, RAG, or long prompts. Characters, tokens, words, paragraphs, or recursive separators. Browser only.

Presets

Source text0 chars / 0 words / ~0 tokens

Split options

Strategy

Split by paragraph, line, sentence, word, then character. RAG default.

Chunk size (characters)

Maximum size of each chunk in the selected unit.

Overlap (characters)

Each chunk shares this many units with its neighbor.

Trim whitespace at chunk edgesRemoves leading and trailing whitespace from each chunk after splitting.Keep separators with the preceding chunkRecursive only. Mirrors LangChain's default behavior so sentence-ending punctuation stays attached to the sentence it belongs to.

Result

Chunks

Avg chars

Avg tokens

Smallest

Largest

Over budget

Export

Copy or download every chunk. JSON is a single array, JSONL is one chunk per line, Plain text is a human-readable preview with chunk separators.

Chunk preview

Showing 0 of 0 chunks

Paste text above to see chunks here.

Quick tips

Embedding pipelines

Start with Recursive separators, 1000 characters, 100 overlap. Most embedding models perform best on 200 to 1000 token passages with 10 to 20 percent overlap.

Token estimates

The token counts here use a 4 chars per token heuristic. For an exact BPE count against a specific model, use the GPT Token Counter tool after splitting.

Structured docs

Paragraph mode keeps headings, lists, and code blocks intact when they fit in the budget. Switch to Recursive for very long paragraphs that need a hard split.

Privacy

Splitting runs entirely in your browser. The text you paste is never uploaded or logged, and stays on your device after the page closes.

How to use

Paste your document into the input area, or click Load sample to try a sample RAG-style text.
Pick a strategy. Recursive separators is the safest default for documents with paragraphs and headings; Characters is best for raw text; Tokens is best when you have a token budget in mind.
Set the chunk size and overlap. A common starting point is 1000 characters with 100 overlap, or use one of the presets at the top.
Scroll the chunk preview to read every chunk with source offsets, character and word counts, and approximate token counts. Use the per-chunk Copy button to grab one chunk.
Pick JSON, JSONL, or Plain text for the export format, then click Copy export or Download to save the full set of chunks to your machine.

About this tool

LLM Text Chunker splits a long document into smaller, overlapping pieces that fit into an embedding model, a retrieval-augmented generation pipeline, or a constrained prompt context. Paste the text on the left, pick a strategy, and see every chunk numbered and ready to copy. Five strategies cover the common cases: Characters slices a fixed character window with a configurable overlap, useful when the input has no clear structure; Tokens uses the same window sized in approximate tokens (a 4 chars per token heuristic, matching the rule of thumb published by OpenAI); Words groups whitespace-delimited words into a fixed count with overlap, useful for prose; Paragraphs packs whole paragraphs into a chunk under the size budget so headings and lists stay intact; and Recursive separators implements the LangChain RecursiveCharacterTextSplitter algorithm, trying double newline first, then single newline, sentence-ending punctuation, semicolons, commas, single spaces, and finally character-level fallback, which is the default used by most RAG pipelines. The tool reports the number of chunks, average and maximum chunk size in characters and approximate tokens, the smallest and largest chunk, and an Over budget count of any chunk that exceeded the size limit because no smaller separator was available. Each chunk shows its source offset, character and word counts, and a one-click Copy button; the full set can be exported as JSON for direct ingestion into a vector store, JSONL for streaming embedding APIs, or plain text with chunk separators for human review. Quick presets cover the four most common configurations: Embeddings (1000 characters, 100 overlap, recursive) for general RAG, Long context (4000 characters, 400 overlap) for long-context models, Tweet-size (280 characters), and SMS chunks (160 characters). The token counts here are estimates; pair this tool with the GPT Token Counter when you need an exact BPE count against a specific model. Everything runs locally in your browser so the documents you split (knowledge base content, internal docs, private corpora) never leave your device or pass through a third-party API.

Free to use. Works in your browser. No signup, no login.

Related tools

LLM Text Chunker

How to use

About this tool

You may also like

GPT Token Counter

Word Counter

Character Counter

Text Cleaner

Markdown TOC Generator