Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

Jun 30, 2025 - 14:00
 0  0
Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

TokenDagger: High-Performance Implementation of OpenAI's TikToken

A fast, drop-in implementation of OpenAI's TikToken, designed for large-scale text processing. 2x Throughput and 4x faster on code sample tokenization.

Benchmarks

Performed on an AMD EPYC 4584PX - 16c/32t - 4.2 GHz.

  • Fast Regex Parsing: Optimized PCRE2 regex engine for efficient token pattern matching
  • Drop-In Replacement: Full compatibility with OpenAI's TikToken tokenizer
  • Simplified BPE: Simplied algorithm to reduce performance impact of large special token vocabulary.

Run Tests

make clean && make
pip3 install tiktoken
python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer llama
python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer mistral
python3 tests/performance_benchmark.py --tokenizer llama
python3 tests/performance_benchmark.py --tokenizer mistral
python3 tests/code_performance_benchmark.py --tokenizer llama
================================================================================
🎉 CONCLUSION: TokenDagger is 4.02x faster on code tokenization!
================================================================================

📦 Installation

From PyPI (Recommended)

pip install tokendagger

🛠️ Dev Install

git clone [email protected]:M4THYOU/TokenDagger.git
sudo apt install libpcre2-dev
git submodule update --init --recursive
sudo apt update && sudo apt install -y python3-dev

And optionally for running the tests:

pip3 install tiktoken

Dependencies

  • PCRE2: Perl Compatible Regular Expressions - GitHub

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0