-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Brief Introduction
This issue scopes and evidences that the single_task_bench example is currently not runnable and its metric logic is distorted, affecting the validity of the benchmark.
Background
Selected example
- Path:
examples/llm-edge-benchmark-suite/single_task_bench - Purpose: Edge-side performance benchmarking of LLM inference (latency / throughput / prefill_latency / mem_usage).
Minimal reproduction
From the repo root:
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yamlConsistently reproducible first errors (triggered multiple times across environments):
ModuleNotFoundError: No module named 'colorlog'
ImportError: cannot import name 'JSONDataParse'
FileNotFoundError: Model file not found: models/qwen/qwen_1_5_0_5b.gguf
Two key error screenshots are included in Evidence:
- Evidence 1: No module named 'sedna' (run is blocked).
- Evidence 2: cannot import name 'JSONDataParse' (parser missing).
Debugging clues
- Running the official entry command reaches
core/testenvmanager/dataset/dataset.py; importingsedna.datasourcesfails immediately (see Evidence 1). - After explicitly setting
PYTHONPATHto point to the bundled sedna snapshot, it still fails on missingJSONDataParse(see Evidence 2), showing that the parser itself is not included in the current snapshot. - Reviewing example YAML and algorithm configs reveals dataset paths pointing to other examples, and model paths pointing to non-existent files, further blocking execution.
- Metric semantics issues observed (facts only): throughput formula divides by total latency measured in ms;
basemodelhard-truncates samples withdata[:10]. Even if it ran, these harm metric credibility (stating the phenomena only).
Problem categories
- Configuration: path/paradigm/sort vs metrics inconsistencies; dataset points to a wrong directory, etc.
- Dependencies: example doesn’t ship/declare the required sedna parsers and third-party packages to run.
- Resources: no usable model file and no minimal prompts dataset.
- Metrics: unit/semantics unclear (formula uses ms directly); sample truncation distorts statistics.
- Documentation: example README lacks a minimal Install — Run — Troubleshooting guide.
Impact and urgency
- The example cannot complete a single effective inference; as an “entry example,” it’s not runnable to the public.
- Metric math and sample handling show clear bias; results are not comparable or explainable.
- New users and contributors cannot reproduce or extend; this directly reduces the value and attractiveness of the LLM edge example to the community.
Conclusion: This is an “example asset health” issue. The core framework (core/*) parsing/scheduling chain works and throws accurate errors. No core code changes are required.
Goals
Community-facing goals:
- Run-ability: the example can run with one command in a fresh environment, without private resources.
- Metric semantics: unify latency/prefill (ms), throughput (req/s), mem_usage (clearly defined and explainable).
- Reproducible assets: provide a small, semantically reasonable prompts dataset; allow sample size to be controlled via config/env.
- Documentation: README and Troubleshooting cover install, run, metric explanation, and common error triage.
- Anti-regression: provide a smoke test/minimal self-check path for later CI or local validation.
Acceptance criteria:
- Running the official entry command stably produces all four metrics.
- Metric units and sort directions match expectations.
- When changing sample size, latency/throughput change in a reasonable relationship (no hidden hard truncation).
- Steps in docs can be independently reproduced by a first-time user.
Expected Community Contribution
- Example health repair: make
single_task_benchrunnable in a fresh environment (fix example-level YAML paths, sorting semantics, and metric unit semantics). - Metric semantics clarification: unify latency/prefill (ms), throughput (req/s), and mem_usage definitions, including boundaries and assumptions (documentation focus).
- Reproducible experiment assets: ship a small JSONL prompts dataset and guidance to configure sample size, lowering the entry barrier.
- Documentation: enrich README and Troubleshooting to cover install, run, metric explanation, common error triage, and self-check steps.
- Optional CI / self-check: provide a tiny smoke flow (run a small sample set, assert presence and basic sanity of four metrics and their units) to help community PRs quickly assess health.
High-level Approach
- Change boundary: fix and complete at the examples layer only; do not modify
core/*. - Config: correct YAML dataset paths, sort directions, and alignment with the metrics list.
- Dependencies: in example scope, declare/point to third-party deps; for sedna parser gaps, offer a temporary shim/stub within the example or clearly document optional install sources (this issue does not submit an implementation).
- Data: provide a tiny JSONL prompts dataset as the minimal working set; make sample size configurable.
- Metrics: unify time units; make throughput second-based; clarify mem metric definition and scope.
- Docs: complete README with the shortest Install — Run — Metrics — Troubleshooting loop.
- Quality: add a smoke test or minimal validation script guidance to ease quick verification and future regression checks.
Scope
Expected Users
- Edge AI/LLM engineers and researchers: need quick inference performance assessment on local/edge devices.
- Ianvs maintainers and contributors: need a reliable “example health” baseline for regression and showcasing.
- New Ianvs users: need a reproducible, low-barrier entry example with troubleshooting hints.
In Scope
- Example assets and configs: problems and semantics in YAML, metrics scripts, and
basemodel.pyunderexamples/llm-edge-benchmark-suite/single_task_bench. - Minimal data and deps statements: example ships a minimal working dataset (or clearly explains how to obtain it) and lists required deps.
- Docs: README/Troubleshooting must cover the shortest Install — Run — Metrics — Troubleshooting path.
Out of Scope
- No changes to
ianvs/core/*interfaces or architecture; no new core APIs. - No heavy-weight model weights or complex training flows; example can use placeholders or instruct users to provide resources.
- No discussion on algorithm superiority or model accuracy methodology (this topic focuses on run-ability and clear metric semantics).
Uniqueness vs Existing Issues
- Track of Runnable Examples #194 (example run-ability tracking): that is an overview; this issue focuses on one specific example combining “not runnable + metric semantics anomalies,” with a systematic evidence chain.
- Issue with preparing dataset failed:The dataset preparation failure caused by the Ianvs project update #164 (data download failures): resource acquisition path; here the core is systemic flaws in the example’s own path/deps/metric definitions.
- Pipeline fails due to missing 'query' field when running government/objective example #211 (government schema field): applies to a different example’s data schema; here it’s about the LLM edge benchmark example’s execution and metric health.
- PRE-TASK: MOT17 Multi-Edge Inference Benchmark (LFX TERM 3 2025) #231 (CV docs missing): documentation-focused; here it spans “run-blocking + metric calculation semantics + resource and doc gaps.”
Assumptions & Constraints
- Default to no core framework changes; problem location and improvements are limited to the example and docs.
- Environment is a typical local dev setup (macOS/Linux) with basic Python.
- Use a small prompts dataset as the minimal working set; avoid large dependencies or weights.
Evidence
| Category | Representative log / snippet | Impact |
|---|---|---|
| ImportError | ImportError: cannot import name 'JSONDataParse' |
Data parsing init fails; pipeline aborts |
| Missing model | FileNotFoundError: ... qwen_1_5_0_5b.gguf |
Model instantiation fails; inference impossible |
| Metric formula | avg_throughput = num_requests / total_latency (total_latency in ms) |
Values inflated by ~1000x |
| Data truncation | data = data[:10] |
Sample size hard-capped; distortion and opacity |
| Sort logic | latency: descend / throughput: ascend | Sort semantics inverted; misleading results |
| Docs gap | README only 2 lines | Users lack a runnable path |
Evidence 1 — run blocked: missing sedna module
Command and key error (reproducible):
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yamlTraceback (most recent call last):
...
File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
from sedna.datasources import (
...
ModuleNotFoundError: No module named 'sedna'
Evidence 2 — parser missing: JSONDataParse not found
Even after setting PYTHONPATH, it fails, which means sedna is locatable but without JSONDataParse:
PYTHONPATH=/.../ianvs/examples/resources:$PYTHONPATH \
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yamlTraceback (most recent call last):
...
File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
from sedna.datasources import (
...
ImportError: cannot import name 'JSONDataParse' from 'sedna.datasources' (.../examples/resources/sedna/datasources/__init__.py)
Detailed Design
examples/llm-edge-benchmark-suite/single_task_bench/
benchmarkingjob.yaml # Sort and metrics statements have semantic/consistency issues
testenv/testenv.yaml # Dataset path incorrectly points to other examples
testalgorithms/algorithm.yaml # Model path and paradigm affect execution
testalgorithms/basemodel.py # Sample truncation and output schema hurt metric credibility
testenv/*.py (metrics) # Throughput formula and memory definition distort results
(Missing data/model assets and docs)
The core framework call chain (YAML parse → ClassFactory registration → metrics execution) is triggerable but blocked by example-level issues. The problems are concentrated in the “example resources and logic” layer.
Architecture
- Orchestrator:
benchmarking.pyparsesbenchmarkingjob.yamland assembles TestEnv/TestObject/Rank/Visualization. - Dataset:
core/testenvmanager/dataset/dataset.pyloads data viasedna.datasourcesparsers (e.g., CSV/TXT/JSONL). - Algorithm: example
basemodel.pywraps inference and emits unified fields to metrics (e.g., total_latency, prefill_latency, mem_usage, etc.). - Metrics plugins:
testenv/*.pyare invoked as plugins to compute latency/throughput/mem. - Result & Rank: per
benchmarkingjob.yamlsort directions and metrics list, produce comparisons and visualization.
In this issue, the failure points concentrate on the Dataset/Algorithm/Metrics “example-level implementations and configs.” The Ianvs core orchestration works and surfaces accurate errors.
Module details
- Config (YAML)
single_task_bench/benchmarkingjob.yaml: ensure rank sort directions match metric intentions; alignselected_dataitem.metricswith actual plugins; keep visualization section consistent with artifact paths.single_task_bench/testenv/testenv.yaml: dataset should point to the example’s own minimal JSONL prompts; avoid referencing other examples.single_task_bench/testalgorithms/algorithm.yaml: model-related fields must match the paradigm; avoid non-existent weight paths.
- Data and parsing
- Prefer parsers that Ianvs currently integrates (e.g., JSONL). If the example demands JSON metadata parsing, document the installation source and options explicitly to avoid implicit dependencies on an “incomplete snapshot” in-repo.
- To avoid heavy resources, ship a tiny prompts set to satisfy the minimal runnable loop and reproducibility.
- Algorithm wrapper (
basemodel.py)- Constrain the output schema to meet metrics’ contract; remove opaque sample truncation. Expose sample size as a config/env option instead of hard-coding.
- Metric scripts (
testenv/*.py)- Clarify units and semantics (latency/prefill: ms; throughput: req/s; mem: define window and statistics), avoiding ms-based reciprocals and cross-device non-comparability.
- Documentation
- Example README/Troubleshooting must provide the shortest Install — Run — Metrics — Troubleshooting path and explicitly list third-party deps and options (e.g., sedna install path).