A fast AI token counter and cost estimator for the major LLM APIs — GPT-4o, GPT-4 Turbo, GPT-3.5, Claude Opus 4, Claude Sonnet 4, and Claude Haiku 4. Paste or type into the textarea and see live estimates of token count, character count, and word count, plus a per-million-token cost breakdown for input, output, and total spend. Pick a model, optionally specify expected output tokens, and get an instant budget for any prompt. Ideal for sizing up long prompts before you send them and for estimating batch-job costs without having to run them first.
tiktoken library or Anthropic's count_tokens endpoint. This tool uses the industry-standard 4-characters-per-token heuristic for English text."Token" is the unit of work for every modern LLM. You pay per token, you fit prompts into a context window measured in tokens, you get rate-limited per token. Yet most developers using these APIs never look under the hood at how text becomes tokens, why the same prompt costs differently across providers, or what actually drives output cost. This guide walks through the byte-pair-encoding mechanics, the per-model tokenizer differences in 2026, the cost-modeling math you should use for budgets, and the specific patterns (prompt caching, batch APIs, smaller models for routing) that cut bills by 50–90%.
An LLM does not see characters or words; it sees integer IDs. Behind that is a fixed vocabulary — typically 50,000 to 200,000 entries — built once during model training, where each entry maps to a sequence of bytes. Tokenization is the algorithm that splits incoming text into the longest matching pieces from that vocabulary.
Almost every modern LLM uses Byte-Pair Encoding (BPE) or a close variant (SentencePiece, WordPiece). BPE works bottom-up:
The result: common words become single tokens (" the", " and", " of"), rarer words split into 2–4 sub-tokens ("unbelievable" → "un" + "believable"), and unusual sequences fall back to per-byte tokens (a long random hex string can become 30+ tokens).
| Model | Tokenizer | Vocabulary | Avg. chars/token (English) |
|---|---|---|---|
| GPT-4 / GPT-4 Turbo / GPT-3.5 | cl100k_base (BPE) | 100 277 | ~4.0 |
| GPT-4o / GPT-4o-mini / o3 | o200k_base (BPE) | 200 019 | ~4.4 (English) / much better for non-English |
| Claude 3 / 3.5 / Opus 4 / Sonnet 4 / Haiku 4 | Anthropic proprietary BPE | ~65 000 | ~3.7 |
| Gemini 1.5 / 2.0 | SentencePiece | ~256 000 | ~4.5 |
| Llama 3 / 3.1 | SentencePiece (tiktoken-compat for code) | 128 256 | ~4.1 |
| Mistral Large / Mixtral | SentencePiece | 32 768 / 32 000 | ~3.9 |
Practical impact: the same 1,000-word English article runs about 1,250 tokens on GPT-4o (cl100k_base + o200k upgrades) but ~1,400 tokens on Claude. A 10,000-word document can differ by 1,500+ tokens — meaningful for both cost and context-window planning.
{/}/"/,/: punctuation each becomes its own token. JSON is typically 50% more tokens-per-character than equivalent prose.| Model | Input ($/M tokens) | Output ($/M tokens) | Context window |
|---|---|---|---|
| GPT-4o | $2.50 | $10 | 128 K |
| GPT-4o-mini | $0.15 | $0.60 | 128 K |
| GPT-4 Turbo | $10 | $30 | 128 K |
| GPT-3.5 Turbo | $0.50 | $1.50 | 16 K |
| Claude Opus 4 | $15 | $75 | 200 K |
| Claude Sonnet 4 | $3 | $15 | 200 K |
| Claude Haiku 4 | $1 | $5 | 200 K |
| Gemini 2.0 Flash | $0.075 | $0.30 | 1 M |
| Gemini 2.0 Pro | $1.25 | $5 | 2 M |
The 3:1 to 5:1 input-to-output price ratio is consistent across providers. This is because generation is more expensive than prefill: the model has to autoregress one token at a time, while reading input runs in parallel.
Both OpenAI and Anthropic now support prompt caching — pay full price the first time you send a long system prompt, then up to 90% off for repeat reads in the next 5 minutes. The math:
If your system prompt is 5,000 tokens and you fire 100 conversations against it per hour, caching saves about $1.10/hour on Sonnet 4 (5,000 × 0.9 × $3/M × 100 = $1.35 saved on input).
Both OpenAI and Anthropic offer Batch APIs at 50% discount for jobs that don't need real-time responses. The API accepts a JSONL file of prompts and returns results within 24 hours. Use cases:
Combined with caching, this can cut bills 60–70% on batch workloads.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens)) # exact countimport anthropic
client = anthropic.Anthropic()
result = client.messages.count_tokens(
model="claude-sonnet-4-latest",
messages=[{"role": "user", "content": "Hello"}]
)
print(result.input_tokens)import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
const tokens = enc.encode('Hello, world!');
console.log(tokens.length);
enc.free(); // tiktoken uses WASM, free to release memoryimport Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const result = await client.messages.countTokens({
model: 'claude-sonnet-4-latest',
messages: [{ role: 'user', content: 'Hello' }]
});
console.log(result.input_tokens);import "github.com/pkoukk/tiktoken-go"
enc, _ := tiktoken.EncodingForModel("gpt-4o")
tokens := enc.Encode("Hello, world!", nil, nil)
fmt.Println(len(tokens))max_tokens. Output is 3-5× more expensive than input. A 500-token cap saves money even when the model could keep going.o200k_base, GPT-3.5 and GPT-4 Turbo use cl100k_base, and Claude uses Anthropic's proprietary BPE tokenizer. Expect accuracy within 10–15% for typical English text. Code, JSON, non-English languages, unusual punctuation, and long unique identifiers can all produce counts that differ by 30% or more from the estimate.tiktoken Python library (pip install tiktoken) and call encoding_for_model("gpt-4o") to get the exact tokenizer. For Anthropic, use the count_tokens endpoint on the Messages API which returns exact token counts without consuming quota. For JavaScript/Node, gpt-tokenizer and @anthropic-ai/tokenizer provide client-side equivalents. Run these in your prompt pipeline before calling the API if precise counting matters for your budget.cl100k_base (100,277 tokens), GPT-4o uses the newer o200k_base (200,019 tokens) which is roughly 20% more efficient on many non-English languages, and Claude uses its own BPE variant. The same sentence can produce different token counts across providers, which affects both cost (billed per token) and context window usage (tokens consumed against the model's max context).max_tokens, requesting terse responses, and caching long system prompts have an outsized effect on cost.All tools run in your browser, no signup required, nothing sent to a server.