Microsoft.ML.Tokenizers 0.22.0

About

Microsoft.ML.Tokenizers supports various the implementation of the tokenization used in the NLP transforms.

Key Features

  • Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
  • BPE - Byte pair encoding model
  • English Roberta model
  • Tiktoken model
  • Llama model
  • Phi2 model

How to Use

using Microsoft.ML.Tokenizers;
using System.Net.Http;
using System.IO;

//
// Using Tiktoken Tokenizer
//

// initialize the tokenizer for `gpt-4` model
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");

string source = "Text tokenization is the process of splitting a string into a list of tokens.";

Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// print: Tokens: 16

var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// 5 tokens from end:  a list of tokens.

trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// 5 tokens from start: Text tokenization is the

IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13

//
// Using Llama Tokenizer
//

// Open stream of remote Llama tokenizer model data file
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);

// Create the Llama tokenizer using the remote stream
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991

Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// print: Tokens: 5

Main Types

The main types provided by this library are:

  • Microsoft.ML.Tokenizers.Tokenizer
  • Microsoft.ML.Tokenizers.BpeTokenizer
  • Microsoft.ML.Tokenizers.EnglishRobertaTokenizer
  • Microsoft.ML.Tokenizers.TiktokenTokenizer
  • Microsoft.ML.Tokenizers.Normalizer
  • Microsoft.ML.Tokenizers.PreTokenizer

Additional Documentation

Feedback & Contributing

Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.

No packages depend on Microsoft.ML.Tokenizers.

https://aka.ms/mlnetreleasenotes

.NET 8.0

.NET Standard 2.0

Version Downloads Last updated
2.0.0-preview.1.25127.4 7 07/10/2025
2.0.0-preview.1.25125.4 6 07/10/2025
1.0.2 6 07/10/2025
1.0.1 6 07/10/2025
1.0.0 6 07/10/2025
0.22.0 6 07/10/2025
0.22.0-preview.24526.1 6 07/10/2025
0.22.0-preview.24522.7 6 07/10/2025
0.22.0-preview.24378.1 6 07/09/2025
0.22.0-preview.24271.1 7 07/10/2025
0.22.0-preview.24179.1 6 07/10/2025
0.22.0-preview.24162.2 6 07/10/2025
0.21.1 6 07/10/2025
0.21.0 6 07/10/2025
0.21.0-preview.23511.1 6 07/10/2025
0.21.0-preview.23266.6 6 07/10/2025
0.21.0-preview.22621.2 6 07/10/2025
0.20.1 7 07/10/2025
0.20.1-preview.22573.9 6 07/10/2025
0.20.0 6 07/10/2025
0.20.0-preview.22551.1 6 07/10/2025