Skip to content

Feature request: make tokenizers map between characters and tokens #1469

@avidale

Description

@avidale

Is your feature request related to a problem? Please describe:
The requested functionality is, with a text tokenizer (e.g. a BPE tokenizer like a LLama one), determine for each token the positions of the text characters corresponding to it.

For background: I need this for projecting token-level predictions from my text encoder back to spans of the original text.

In easy cases (pure ASCII characters), I can achieve this by applying tokenizer.create_encoder(...).encode_as_tokens(text) and then simply adding up the length of text-form tokens. However, with more difficult characters (e.g. non-Latin script), this method results in a UnicodeDecodeError if a token boundary happens to fall in the middle of a multi-byte character.

Describe the solution you would like:
An ideal solution would be either to implement a method like token_to_chars in HF tokenizers that does provide the requested mapping, or at least to publicly expose a method for converting tokens ids into byte sequences (currently, with TikToken-based tokenizers, I can get that with tokenizer._encoding.decode_tokens_bytes(text_tokens), but this wouldn't work with other tokenizer types, and anyway, using a private property doesn't feel very good).

Describe the alternatives you have considered:

  1. Using tokenizer._encoding to convert tokens into byte sequences and then manually align them with the text - this works but feels dirty.
  2. Not using Fairseq2 tokenizers at all - this is alright, but I'd like to keep using them, if only for backward compatibility, in some well-established projects like SONAR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions