Skip to content

Conversation

@zharinov
Copy link

Adds from_raw_parts() for constructing models from pre-parsed components, from_pretrained() now delegates to it.

Also fixes a bug where loading would fail if the tokenizer doesn't define an unk_token (not all tokenizers have one).

- `from_pretrained` now delegates to `from_raw_parts`
- Fixes BPE tokenizer support (unk_token_id now optional)
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR @zharinov! This is a nice functionality to have I think, and good catch about the unk_token. I have two small (but nice to have) improvements; if you could implement those this is good to go. Thanks for updating the tests as well 👍

@Pringled
Copy link
Member

@zharinov one additional comment, could you also run clippy to fix the formatting issues?

@zharinov zharinov changed the title feat: Add from_raw_parts() constructor feat: Add from_borrowed() constructor Jan 26, 2026
@zharinov
Copy link
Author

Hey,

I wanted to support zero-copy initialization with include_bytes!, yet missed the array clone in my own code 🤦‍♂️

The second attempt transforms StaticModel to use Cow, allowing for both owned and static scenarios. This adds some performance penalty, but hopefully it's minimal thanks to CPU branch prediction.

Also, I applied the suggestion for unk_token.

@zharinov
Copy link
Author

UPD. Once I had benchmarks set up locally, I've done some additional research if you're interested. Here is the report I've got:

  1. ndarray SIMD vectorization in pool_ids
// Before: manual loops
let mut sum = vec![0.0; dim];
for (i, &v) in row.iter().enumerate() {
    sum[i] += v * scale;
}

// After: ndarray vectorized ops
let mut sum = Array1::<f32>::zeros(dim);
sum.scaled_add(scale, &row);
  1. .copied() pattern for hash lookups
// Before
*m.get(tok).unwrap_or(&tok)

// After
m.get(tok).copied().unwrap_or(tok)
  1. AsRef<str> generics to avoid allocations
// Before
pub fn encode(&self, sentences: &[String]) -> Vec<Vec<f32>>

// After
pub fn encode<S: AsRef<str>>(&self, sentences: &[S]) -> Vec<Vec<f32>>
  1. pool_ids takes slice instead of Vec
// Before
fn pool_ids(&self, ids: Vec<u32>) -> Vec<f32>

// After
fn pool_ids(&self, ids: &[u32], max_length: Option<usize>) -> Vec<f32>

Benchmark Results:

Benchmark Improvement
encode_single/short -8%
encode_single/medium -8%
encode_single/long -7%
encode_batch/100/short -12%
encode_batch/100/medium -16%
encode_batch/100/long -16%

Branch: https://github.com/zharinov/model2vec-rs/tree/opt/all-optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants