-
Notifications
You must be signed in to change notification settings - Fork 98
Description
Description:LLM inference acceleration is increasingly important for cloud-edge collaborative AI deployments. Speculative decoding can improve end-to-end generation speed by using a lightweight draft model to propose token candidates and a larger target model to verify them, but its real-world gains depend heavily on cloud-edge constraints such as network latency, bandwidth limits, and heterogeneous compute.
Ianvs provides a unified benchmarking framework, and KubeEdge scenarios often require evaluating AI workloads under cloud-edge conditions. This project proposes a single-host cloud-edge simulation benchmark in Ianvs to evaluate speculative decoding for LLM inference. The benchmark will simulate edge (draft) and cloud (verify) roles as separate processes, inject configurable network constraints, and report standardized throughput and latency metrics, enabling reproducible comparison between baseline decoding and speculative decoding under different cloud-edge budgets.
Expected Outcome:
- An Ianvs benchmark test case for cloud-edge speculative decoding.
- A single-host simulation runner that emulates edge/cloud roles and configurable network constraints (e.g., latency, bandwidth).
- Benchmark reports comparing baseline vs speculative decoding with key metrics (e.g., TTFT, end-to-end latency, tokens/s), and reproducible configs/scripts.
Recommended Skills:
Python, PyTorch, HuggingFace Transformers, LLM inference/decoding, benchmarking & performance profiling, KubeEdge/Ianvs basics