Feature request: make tokenizers map between characters and tokens

**Is your feature request related to a problem? Please describe:**
The requested functionality is, with a text tokenizer (e.g. a BPE tokenizer like a LLama one), determine for each token the positions of the text characters corresponding to it.

For background: I need this for projecting token-level predictions from my text encoder back to spans of the original text.

In easy cases (pure ASCII characters), I can achieve this by applying `tokenizer.create_encoder(...).encode_as_tokens(text)` and then simply adding up the length of text-form tokens. However, with more difficult characters (e.g. non-Latin script), this method results in a `UnicodeDecodeError` if a token boundary happens to fall in the middle of a multi-byte character.

**Describe the solution you would like:**
An ideal solution would be either to implement a method like `token_to_chars` in HF tokenizers that does provide the requested mapping, or at least to publicly expose a method for converting tokens ids into byte sequences (currently, with TikToken-based tokenizers, I can get that with `tokenizer._encoding.decode_tokens_bytes(text_tokens)`, but this wouldn't work with other tokenizer types, and anyway, using a private property doesn't feel very good). 

**Describe the alternatives you have considered:**
1. Using `tokenizer._encoding` to convert tokens into byte sequences and then manually align them with the text - this works but feels dirty. 
2. Not using Fairseq2 tokenizers at all - this is alright, but I'd like to keep using them, if only for backward compatibility, in some well-established projects like SONAR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: make tokenizers map between characters and tokens #1469

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: make tokenizers map between characters and tokens #1469

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions