-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Is your feature request related to a problem? Please describe:
The requested functionality is, with a text tokenizer (e.g. a BPE tokenizer like a LLama one), determine for each token the positions of the text characters corresponding to it.
For background: I need this for projecting token-level predictions from my text encoder back to spans of the original text.
In easy cases (pure ASCII characters), I can achieve this by applying tokenizer.create_encoder(...).encode_as_tokens(text) and then simply adding up the length of text-form tokens. However, with more difficult characters (e.g. non-Latin script), this method results in a UnicodeDecodeError if a token boundary happens to fall in the middle of a multi-byte character.
Describe the solution you would like:
An ideal solution would be either to implement a method like token_to_chars in HF tokenizers that does provide the requested mapping, or at least to publicly expose a method for converting tokens ids into byte sequences (currently, with TikToken-based tokenizers, I can get that with tokenizer._encoding.decode_tokens_bytes(text_tokens), but this wouldn't work with other tokenizer types, and anyway, using a private property doesn't feel very good).
Describe the alternatives you have considered:
- Using
tokenizer._encodingto convert tokens into byte sequences and then manually align them with the text - this works but feels dirty. - Not using Fairseq2 tokenizers at all - this is alright, but I'd like to keep using them, if only for backward compatibility, in some well-established projects like SONAR.