🇩🇪 German Tokenizer Benchmark

A curated collection of German datasets for comprehensive tokenizer evaluation across diverse domains and text types.

This organization hosts datasets used by the GerTokEval framework to evaluate German tokenizers with standardized metrics and fairness analysis.

🔎 Datasets

The following datasets are currently supported in the main framework:

The main goal for choosing these datasets is to evaluate tokenizers on a broad range of domains.

Many thanks to Clara Meister for releasing the amazing TokEval framework!