Microsoft.ML.Tokenizers 2.0.0-preview.1.25127.4
About
Microsoft.ML.Tokenizers provides an abstraction for tokenizers as well as implementations of common tokenization algorithms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
- BPE - Byte pair encoding model
- English Roberta model
- Tiktoken model
- Llama model
- Phi2 model
How to Use
using Microsoft.ML.Tokenizers;
using System.IO;
using System.Net.Http;
//
// Using Tiktoken Tokenizer
//
// Initialize the tokenizer for the `gpt-4o` model. This instance should be cached for all subsequent use.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// prints: Tokens: 16
var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// prints: 5 tokens from end: a list of tokens.
trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// prints: 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open a stream to the remote Llama tokenizer model data file.
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream. This should be cached for all subsequent use.
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// prints: Tokens: 5
Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.TokenizerMicrosoft.ML.Tokenizers.BpeTokenizerMicrosoft.ML.Tokenizers.EnglishRobertaTokenizerMicrosoft.ML.Tokenizers.TiktokenTokenizerMicrosoft.ML.Tokenizers.NormalizerMicrosoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Related Packages
Feedback & Contributing
Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.
No packages depend on Microsoft.ML.Tokenizers.
https://aka.ms/mlnetreleasenotes
.NET 8.0
- Google.Protobuf (>= 3.27.1)
- System.Text.Json (>= 8.0.5)
.NET Standard 2.0
- Google.Protobuf (>= 3.27.1)
- Microsoft.Bcl.HashCode (>= 6.0.0)
- Microsoft.Bcl.Memory (>= 9.0.0)
- System.Text.Json (>= 8.0.5)
| Version | Downloads | Last updated |
|---|---|---|
| 2.0.0-preview.1.25127.4 | 6 | 07/10/2025 |
| 2.0.0-preview.1.25125.4 | 6 | 07/10/2025 |
| 1.0.2 | 6 | 07/10/2025 |
| 1.0.1 | 6 | 07/10/2025 |
| 1.0.0 | 6 | 07/10/2025 |
| 0.22.0 | 6 | 07/10/2025 |
| 0.22.0-preview.24526.1 | 6 | 07/10/2025 |
| 0.22.0-preview.24522.7 | 6 | 07/10/2025 |
| 0.22.0-preview.24378.1 | 6 | 07/09/2025 |
| 0.22.0-preview.24271.1 | 7 | 07/10/2025 |
| 0.22.0-preview.24179.1 | 6 | 07/10/2025 |
| 0.22.0-preview.24162.2 | 6 | 07/10/2025 |
| 0.21.1 | 6 | 07/10/2025 |
| 0.21.0 | 6 | 07/10/2025 |
| 0.21.0-preview.23511.1 | 6 | 07/10/2025 |
| 0.21.0-preview.23266.6 | 6 | 07/10/2025 |
| 0.21.0-preview.22621.2 | 6 | 07/10/2025 |
| 0.20.1 | 7 | 07/10/2025 |
| 0.20.1-preview.22573.9 | 6 | 07/10/2025 |
| 0.20.0 | 6 | 07/10/2025 |
| 0.20.0-preview.22551.1 | 6 | 07/10/2025 |