-
Notifications
You must be signed in to change notification settings - Fork 271
Description
Proposal Summary
Dear Olive team,
First, thank you for developing and maintaining the Olive repository—it’s been incredibly useful for optimizing and quantizing large language models (LLMs) in my workflow. However, I’ve encountered a gap while working with vision-language models (VLMs) like InternVL and Qwen2 VL: there are currently no end-to-end examples or explicit guidance for quantizing, converting, or optimizing these multimodal models using Olive.
VLMs are becoming increasingly critical for tasks like image-text understanding, visual question answering, and multimodal generation. Models such as InternVL and Qwen2 VL have gained traction in the community, but their optimization pipeline differs from that of text-only LLMs due to their dual components (visual encoders + language decoders) and unique input formats (combining images and text).
I’m writing to request:
End-to-end examples for quantizing and converting popular VLMs (e.g., InternVL, Qwen2 VL) using Olive, similar to the examples provided for text-only LLMs. This would help users navigate the specific steps needed to handle both visual and linguistic components.
Clarification on key differences between optimizing VLMs and traditional LLMs in Olive.
For instance:
Do visual encoders require distinct quantization strategies (e.g., different bit-widths, calibration data)?
How should input pipelines (handling both images and text) be adapted for Olive’s optimization workflows?
Are there special considerations for preserving multimodal alignment during quantization?
Adding this support would make Olive more versatile for the growing field of multimodal AI and help the community leverage your tooling for VLMs effectively.
What component(s) does this request affect?
- OliveModels
- OliveSystems
- OliveEvaluator
- Metrics
- Engine
- Passes
- Other