tokenx

Fast and lightweight token count estimation for any LLM without requiring a full tokenizer. This library provides quick approximations that are good enough for most use cases while keeping your bundle size minimal.
For advanced use cases requiring precise token counts, please use a full tokenizer like gpt-tokenizer.
Benchmarks
The following table shows the accuracy of the token count approximation for different input texts:
| Description |
Actual GPT Token Count |
Estimated Token Count |
Token Count Deviation |
| Short English text |
10 |
11 |
10.00% |
| German text with umlauts |
48 |
49 |
2.08% |
| Metamorphosis by Franz Kafka (English) |
31796 |
32325 |
1.66% |
| Die Verwandlung by Franz Kafka (German) |
35309 |
33970 |
3.79% |
| 道德經 by Laozi (Chinese) |
11712 |
11427 |
2.43% |
| TypeScript ES5 Type Declarations (~ 4000 loc) |
49293 |
51599 |
4.68% |
Features
96% accuracy compared to full tokenizers (see benchmarks below)
Just 2kB bundle size with zero dependencies
Multi-language support with configurable language rules
Built-in support for accented characters (German, French, Spanish, etc.)
Configurable and extensible
Installation
Run the following command to add tokenx to your project.
npm install tokenx
pnpm add tokenx
yarn add tokenx
Usage
import { estimateTokenCount, isWithinTokenLimit, splitByTokens } from 'tokenx'
const text = 'Your text goes here.'
const estimatedTokens = estimateTokenCount(text)
console.log(`Estimated token count: ${estimatedTokens}`)
const tokenLimit = 1024
const withinLimit = isWithinTokenLimit(text, tokenLimit)
console.log(`Is within token limit: ${withinLimit}`)
const chunks = splitByTokens(text, 100)
console.log(`Split into ${chunks.length} chunks`)
const customOptions = {
defaultCharsPerToken: 4,
languageConfigs: [
{ pattern: /[你我他]/g, averageCharsPerToken: 1.5 },
]
}
const customEstimate = estimateTokenCount(text, customOptions)
console.log(`Custom estimate: ${customEstimate}`)
API
estimateTokenCount
Estimates the number of tokens in a given input string using heuristic rules that work across multiple languages and text types.
Usage:
const estimatedTokens = estimateTokenCount('Hello, world!')
const customEstimate = estimateTokenCount('Bonjour le monde!', {
defaultCharsPerToken: 4,
languageConfigs: [
{ pattern: /[éèêëàâîï]/i, averageCharsPerToken: 3 }
]
})
Type Declaration:
function estimateTokenCount(
text?: string,
options?: TokenEstimationOptions
): number
interface TokenEstimationOptions {
defaultCharsPerToken?: number
languageConfigs?: LanguageConfig[]
}
interface LanguageConfig {
pattern: RegExp
averageCharsPerToken: number
}
isWithinTokenLimit
Checks if the estimated token count of the input is within a specified token limit.
Usage:
const withinLimit = isWithinTokenLimit('Check this text against a limit', 100)
const customCheck = isWithinTokenLimit('Text', 50, { defaultCharsPerToken: 3 })
Type Declaration:
function isWithinTokenLimit(
text: string,
tokenLimit: number,
options?: TokenEstimationOptions
): boolean
sliceByTokens
Extracts a portion of text based on token positions, similar to Array.prototype.slice(). Supports both positive and negative indices.
Usage:
const text = 'Hello, world! This is a test sentence.'
const firstThree = sliceByTokens(text, 0, 3)
const fromSecond = sliceByTokens(text, 2)
const lastTwo = sliceByTokens(text, -2)
const middle = sliceByTokens(text, 1, -1)
const customSlice = sliceByTokens(text, 0, 5, {
defaultCharsPerToken: 4,
languageConfigs: [
{ pattern: /[éèêëàâîï]/i, averageCharsPerToken: 3 }
]
})
Type Declaration:
function sliceByTokens(
text: string,
start?: number,
end?: number,
options?: TokenEstimationOptions
): string
Parameters:
text - The input text to slice
start - The start token index (inclusive). If negative, treated as offset from end. Default: 0
end - The end token index (exclusive). If negative, treated as offset from end. If omitted, slices to the end
options - Token estimation options (same as estimateTokenCount)
Returns:
The sliced text portion corresponding to the specified token range.
splitByTokens
Splits text into chunks based on token count. Useful for chunking documents for RAG, batch processing, or staying within context windows.
Usage:
const text = 'Long text that needs to be split into smaller chunks...'
const chunks = splitByTokens(text, 100)
console.log(`Split into ${chunks.length} chunks`)
const overlappedChunks = splitByTokens(text, 100, { overlap: 10 })
const customChunks = splitByTokens(text, 50, {
defaultCharsPerToken: 4,
overlap: 5
})
Type Declaration:
interface SplitByTokensOptions extends TokenEstimationOptions {
overlap?: number
}
function splitByTokens(
text: string,
tokensPerChunk: number,
options?: SplitByTokensOptions
): string[]
Parameters:
text - The input text to split
tokensPerChunk - Maximum number of tokens per chunk
options - Token estimation options with optional overlap
Returns:
An array of text chunks, each containing approximately tokensPerChunk tokens.
License
MIT License © 2023-PRESENT Johann Schopplich