A curated collection of German datasets for comprehensive tokenizer evaluation across diverse domains and text types.
This organization hosts datasets used by the GerTokEval framework to evaluate German tokenizers with standardized metrics and fairness analysis.
The following datasets are currently supported in the main framework:
The main goal for choosing these datasets is to evaluate tokenizers on a broad range of domains.
Many thanks to Clara Meister for releasing the amazing TokEval framework!