Skip to content

PRE-TASK: llm-edge-benchmark-suite/single_task_bench example initially not runnable: config / resource / metric issues break the benchmark #250

@NIKOPACK

Description

@NIKOPACK

Brief Introduction

This issue scopes and evidences that the single_task_bench example is currently not runnable and its metric logic is distorted, affecting the validity of the benchmark.

Background

Selected example

  • Path: examples/llm-edge-benchmark-suite/single_task_bench
  • Purpose: Edge-side performance benchmarking of LLM inference (latency / throughput / prefill_latency / mem_usage).

Minimal reproduction

From the repo root:

python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml

Consistently reproducible first errors (triggered multiple times across environments):

ModuleNotFoundError: No module named 'colorlog'
ImportError: cannot import name 'JSONDataParse'
FileNotFoundError: Model file not found: models/qwen/qwen_1_5_0_5b.gguf

Two key error screenshots are included in Evidence:

  • Evidence 1: No module named 'sedna' (run is blocked).
  • Evidence 2: cannot import name 'JSONDataParse' (parser missing).

Debugging clues

  1. Running the official entry command reaches core/testenvmanager/dataset/dataset.py; importing sedna.datasources fails immediately (see Evidence 1).
  2. After explicitly setting PYTHONPATH to point to the bundled sedna snapshot, it still fails on missing JSONDataParse (see Evidence 2), showing that the parser itself is not included in the current snapshot.
  3. Reviewing example YAML and algorithm configs reveals dataset paths pointing to other examples, and model paths pointing to non-existent files, further blocking execution.
  4. Metric semantics issues observed (facts only): throughput formula divides by total latency measured in ms; basemodel hard-truncates samples with data[:10]. Even if it ran, these harm metric credibility (stating the phenomena only).

Problem categories

  • Configuration: path/paradigm/sort vs metrics inconsistencies; dataset points to a wrong directory, etc.
  • Dependencies: example doesn’t ship/declare the required sedna parsers and third-party packages to run.
  • Resources: no usable model file and no minimal prompts dataset.
  • Metrics: unit/semantics unclear (formula uses ms directly); sample truncation distorts statistics.
  • Documentation: example README lacks a minimal Install — Run — Troubleshooting guide.

Impact and urgency

  • The example cannot complete a single effective inference; as an “entry example,” it’s not runnable to the public.
  • Metric math and sample handling show clear bias; results are not comparable or explainable.
  • New users and contributors cannot reproduce or extend; this directly reduces the value and attractiveness of the LLM edge example to the community.

Conclusion: This is an “example asset health” issue. The core framework (core/*) parsing/scheduling chain works and throws accurate errors. No core code changes are required.

Goals

Community-facing goals:

  • Run-ability: the example can run with one command in a fresh environment, without private resources.
  • Metric semantics: unify latency/prefill (ms), throughput (req/s), mem_usage (clearly defined and explainable).
  • Reproducible assets: provide a small, semantically reasonable prompts dataset; allow sample size to be controlled via config/env.
  • Documentation: README and Troubleshooting cover install, run, metric explanation, and common error triage.
  • Anti-regression: provide a smoke test/minimal self-check path for later CI or local validation.

Acceptance criteria:

  • Running the official entry command stably produces all four metrics.
  • Metric units and sort directions match expectations.
  • When changing sample size, latency/throughput change in a reasonable relationship (no hidden hard truncation).
  • Steps in docs can be independently reproduced by a first-time user.

Expected Community Contribution

  • Example health repair: make single_task_bench runnable in a fresh environment (fix example-level YAML paths, sorting semantics, and metric unit semantics).
  • Metric semantics clarification: unify latency/prefill (ms), throughput (req/s), and mem_usage definitions, including boundaries and assumptions (documentation focus).
  • Reproducible experiment assets: ship a small JSONL prompts dataset and guidance to configure sample size, lowering the entry barrier.
  • Documentation: enrich README and Troubleshooting to cover install, run, metric explanation, common error triage, and self-check steps.
  • Optional CI / self-check: provide a tiny smoke flow (run a small sample set, assert presence and basic sanity of four metrics and their units) to help community PRs quickly assess health.

High-level Approach

  • Change boundary: fix and complete at the examples layer only; do not modify core/*.
  • Config: correct YAML dataset paths, sort directions, and alignment with the metrics list.
  • Dependencies: in example scope, declare/point to third-party deps; for sedna parser gaps, offer a temporary shim/stub within the example or clearly document optional install sources (this issue does not submit an implementation).
  • Data: provide a tiny JSONL prompts dataset as the minimal working set; make sample size configurable.
  • Metrics: unify time units; make throughput second-based; clarify mem metric definition and scope.
  • Docs: complete README with the shortest Install — Run — Metrics — Troubleshooting loop.
  • Quality: add a smoke test or minimal validation script guidance to ease quick verification and future regression checks.

Scope

Expected Users

  • Edge AI/LLM engineers and researchers: need quick inference performance assessment on local/edge devices.
  • Ianvs maintainers and contributors: need a reliable “example health” baseline for regression and showcasing.
  • New Ianvs users: need a reproducible, low-barrier entry example with troubleshooting hints.

In Scope

  • Example assets and configs: problems and semantics in YAML, metrics scripts, and basemodel.py under examples/llm-edge-benchmark-suite/single_task_bench.
  • Minimal data and deps statements: example ships a minimal working dataset (or clearly explains how to obtain it) and lists required deps.
  • Docs: README/Troubleshooting must cover the shortest Install — Run — Metrics — Troubleshooting path.

Out of Scope

  • No changes to ianvs/core/* interfaces or architecture; no new core APIs.
  • No heavy-weight model weights or complex training flows; example can use placeholders or instruct users to provide resources.
  • No discussion on algorithm superiority or model accuracy methodology (this topic focuses on run-ability and clear metric semantics).

Uniqueness vs Existing Issues

Assumptions & Constraints

  • Default to no core framework changes; problem location and improvements are limited to the example and docs.
  • Environment is a typical local dev setup (macOS/Linux) with basic Python.
  • Use a small prompts dataset as the minimal working set; avoid large dependencies or weights.

Evidence

Category Representative log / snippet Impact
ImportError ImportError: cannot import name 'JSONDataParse' Data parsing init fails; pipeline aborts
Missing model FileNotFoundError: ... qwen_1_5_0_5b.gguf Model instantiation fails; inference impossible
Metric formula avg_throughput = num_requests / total_latency (total_latency in ms) Values inflated by ~1000x
Data truncation data = data[:10] Sample size hard-capped; distortion and opacity
Sort logic latency: descend / throughput: ascend Sort semantics inverted; misleading results
Docs gap README only 2 lines Users lack a runnable path

Evidence 1 — run blocked: missing sedna module

Command and key error (reproducible):

python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
Traceback (most recent call last):
    ...
    File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
        from sedna.datasources import (
        ...
ModuleNotFoundError: No module named 'sedna'
Image

Evidence 2 — parser missing: JSONDataParse not found

Even after setting PYTHONPATH, it fails, which means sedna is locatable but without JSONDataParse:

PYTHONPATH=/.../ianvs/examples/resources:$PYTHONPATH \
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
Traceback (most recent call last):
    ...
    File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
        from sedna.datasources import (
        ...
ImportError: cannot import name 'JSONDataParse' from 'sedna.datasources' (.../examples/resources/sedna/datasources/__init__.py)
Image

Detailed Design

examples/llm-edge-benchmark-suite/single_task_bench/
    benchmarkingjob.yaml        # Sort and metrics statements have semantic/consistency issues
    testenv/testenv.yaml        # Dataset path incorrectly points to other examples
    testalgorithms/algorithm.yaml  # Model path and paradigm affect execution
    testalgorithms/basemodel.py # Sample truncation and output schema hurt metric credibility
    testenv/*.py (metrics)      # Throughput formula and memory definition distort results
    (Missing data/model assets and docs)

The core framework call chain (YAML parse → ClassFactory registration → metrics execution) is triggerable but blocked by example-level issues. The problems are concentrated in the “example resources and logic” layer.

Architecture

  • Orchestrator: benchmarking.py parses benchmarkingjob.yaml and assembles TestEnv/TestObject/Rank/Visualization.
  • Dataset: core/testenvmanager/dataset/dataset.py loads data via sedna.datasources parsers (e.g., CSV/TXT/JSONL).
  • Algorithm: example basemodel.py wraps inference and emits unified fields to metrics (e.g., total_latency, prefill_latency, mem_usage, etc.).
  • Metrics plugins: testenv/*.py are invoked as plugins to compute latency/throughput/mem.
  • Result & Rank: per benchmarkingjob.yaml sort directions and metrics list, produce comparisons and visualization.

In this issue, the failure points concentrate on the Dataset/Algorithm/Metrics “example-level implementations and configs.” The Ianvs core orchestration works and surfaces accurate errors.

Module details

  • Config (YAML)
    • single_task_bench/benchmarkingjob.yaml: ensure rank sort directions match metric intentions; align selected_dataitem.metrics with actual plugins; keep visualization section consistent with artifact paths.
    • single_task_bench/testenv/testenv.yaml: dataset should point to the example’s own minimal JSONL prompts; avoid referencing other examples.
    • single_task_bench/testalgorithms/algorithm.yaml: model-related fields must match the paradigm; avoid non-existent weight paths.
  • Data and parsing
    • Prefer parsers that Ianvs currently integrates (e.g., JSONL). If the example demands JSON metadata parsing, document the installation source and options explicitly to avoid implicit dependencies on an “incomplete snapshot” in-repo.
    • To avoid heavy resources, ship a tiny prompts set to satisfy the minimal runnable loop and reproducibility.
  • Algorithm wrapper (basemodel.py)
    • Constrain the output schema to meet metrics’ contract; remove opaque sample truncation. Expose sample size as a config/env option instead of hard-coding.
  • Metric scripts (testenv/*.py)
    • Clarify units and semantics (latency/prefill: ms; throughput: req/s; mem: define window and statistics), avoiding ms-based reciprocals and cross-device non-comparability.
  • Documentation
    • Example README/Troubleshooting must provide the shortest Install — Run — Metrics — Troubleshooting path and explicitly list third-party deps and options (e.g., sedna install path).

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions