PRE-TASK: llm-edge-benchmark-suite/single_task_bench example initially not runnable: config / resource / metric issues break the benchmark


## Brief Introduction
This issue scopes and evidences that the `single_task_bench` example is currently not runnable and its metric logic is distorted, affecting the validity of the benchmark.

## Background
### Selected example
- Path: `examples/llm-edge-benchmark-suite/single_task_bench`
- Purpose: Edge-side performance benchmarking of LLM inference (latency / throughput / prefill_latency / mem_usage).

### Minimal reproduction
From the repo root:
```bash
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
```
Consistently reproducible first errors (triggered multiple times across environments):
```text
ModuleNotFoundError: No module named 'colorlog'
ImportError: cannot import name 'JSONDataParse'
FileNotFoundError: Model file not found: models/qwen/qwen_1_5_0_5b.gguf
```
Two key error screenshots are included in Evidence:
- Evidence 1: No module named 'sedna' (run is blocked).
- Evidence 2: cannot import name 'JSONDataParse' (parser missing).

### Debugging clues
1. Running the official entry command reaches `core/testenvmanager/dataset/dataset.py`; importing `sedna.datasources` fails immediately (see Evidence 1).
2. After explicitly setting `PYTHONPATH` to point to the bundled sedna snapshot, it still fails on missing `JSONDataParse` (see Evidence 2), showing that the parser itself is not included in the current snapshot.
3. Reviewing example YAML and algorithm configs reveals dataset paths pointing to other examples, and model paths pointing to non-existent files, further blocking execution.
4. Metric semantics issues observed (facts only): throughput formula divides by total latency measured in ms; `basemodel` hard-truncates samples with `data[:10]`. Even if it ran, these harm metric credibility (stating the phenomena only).

### Problem categories
- Configuration: path/paradigm/sort vs metrics inconsistencies; dataset points to a wrong directory, etc.
- Dependencies: example doesn’t ship/declare the required sedna parsers and third-party packages to run.
- Resources: no usable model file and no minimal prompts dataset.
- Metrics: unit/semantics unclear (formula uses ms directly); sample truncation distorts statistics.
- Documentation: example README lacks a minimal Install — Run — Troubleshooting guide.

### Impact and urgency
- The example cannot complete a single effective inference; as an “entry example,” it’s not runnable to the public.
- Metric math and sample handling show clear bias; results are not comparable or explainable.
- New users and contributors cannot reproduce or extend; this directly reduces the value and attractiveness of the LLM edge example to the community.

Conclusion: This is an “example asset health” issue. The core framework (`core/*`) parsing/scheduling chain works and throws accurate errors. No core code changes are required.

## Goals
Community-facing goals:
- Run-ability: the example can run with one command in a fresh environment, without private resources.
- Metric semantics: unify latency/prefill (ms), throughput (req/s), mem_usage (clearly defined and explainable).
- Reproducible assets: provide a small, semantically reasonable prompts dataset; allow sample size to be controlled via config/env.
- Documentation: README and Troubleshooting cover install, run, metric explanation, and common error triage.
- Anti-regression: provide a smoke test/minimal self-check path for later CI or local validation.

Acceptance criteria:
- Running the official entry command stably produces all four metrics.
- Metric units and sort directions match expectations.
- When changing sample size, latency/throughput change in a reasonable relationship (no hidden hard truncation).
- Steps in docs can be independently reproduced by a first-time user.

### Expected Community Contribution
- Example health repair: make `single_task_bench` runnable in a fresh environment (fix example-level YAML paths, sorting semantics, and metric unit semantics).
- Metric semantics clarification: unify latency/prefill (ms), throughput (req/s), and mem_usage definitions, including boundaries and assumptions (documentation focus).
- Reproducible experiment assets: ship a small JSONL prompts dataset and guidance to configure sample size, lowering the entry barrier.
- Documentation: enrich README and Troubleshooting to cover install, run, metric explanation, common error triage, and self-check steps.
- Optional CI / self-check: provide a tiny smoke flow (run a small sample set, assert presence and basic sanity of four metrics and their units) to help community PRs quickly assess health.

### High-level Approach
- Change boundary: fix and complete at the examples layer only; do not modify `core/*`.
- Config: correct YAML dataset paths, sort directions, and alignment with the metrics list.
- Dependencies: in example scope, declare/point to third-party deps; for sedna parser gaps, offer a temporary shim/stub within the example or clearly document optional install sources (this issue does not submit an implementation).
- Data: provide a tiny JSONL prompts dataset as the minimal working set; make sample size configurable.
- Metrics: unify time units; make throughput second-based; clarify mem metric definition and scope.
- Docs: complete README with the shortest Install — Run — Metrics — Troubleshooting loop.
- Quality: add a smoke test or minimal validation script guidance to ease quick verification and future regression checks.

## Scope
### Expected Users
- Edge AI/LLM engineers and researchers: need quick inference performance assessment on local/edge devices.
- Ianvs maintainers and contributors: need a reliable “example health” baseline for regression and showcasing.
- New Ianvs users: need a reproducible, low-barrier entry example with troubleshooting hints.

### In Scope
- Example assets and configs: problems and semantics in YAML, metrics scripts, and `basemodel.py` under `examples/llm-edge-benchmark-suite/single_task_bench`.
- Minimal data and deps statements: example ships a minimal working dataset (or clearly explains how to obtain it) and lists required deps.
- Docs: README/Troubleshooting must cover the shortest Install — Run — Metrics — Troubleshooting path.

### Out of Scope
- No changes to `ianvs/core/*` interfaces or architecture; no new core APIs.
- No heavy-weight model weights or complex training flows; example can use placeholders or instruct users to provide resources.
- No discussion on algorithm superiority or model accuracy methodology (this topic focuses on run-ability and clear metric semantics).

### Uniqueness vs Existing Issues
- #194 (example run-ability tracking): that is an overview; this issue focuses on one specific example combining “not runnable + metric semantics anomalies,” with a systematic evidence chain.
- #164 (data download failures): resource acquisition path; here the core is systemic flaws in the example’s own path/deps/metric definitions.
- #211 (government schema field): applies to a different example’s data schema; here it’s about the LLM edge benchmark example’s execution and metric health.
- #231 (CV docs missing): documentation-focused; here it spans “run-blocking + metric calculation semantics + resource and doc gaps.”

### Assumptions & Constraints
- Default to no core framework changes; problem location and improvements are limited to the example and docs.
- Environment is a typical local dev setup (macOS/Linux) with basic Python.
- Use a small prompts dataset as the minimal working set; avoid large dependencies or weights.

## Evidence
| Category | Representative log / snippet | Impact |
|---------|-------------------------------|--------|
| ImportError | `ImportError: cannot import name 'JSONDataParse'` | Data parsing init fails; pipeline aborts |
| Missing model | `FileNotFoundError: ... qwen_1_5_0_5b.gguf` | Model instantiation fails; inference impossible |
| Metric formula | `avg_throughput = num_requests / total_latency` (total_latency in ms) | Values inflated by ~1000x |
| Data truncation | `data = data[:10]` | Sample size hard-capped; distortion and opacity |
| Sort logic | latency: descend / throughput: ascend | Sort semantics inverted; misleading results |
| Docs gap | README only 2 lines | Users lack a runnable path |

### Evidence 1 — run blocked: missing sedna module
Command and key error (reproducible):
```bash
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
```
```text
Traceback (most recent call last):
    ...
    File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
        from sedna.datasources import (
        ...
ModuleNotFoundError: No module named 'sedna'
```
<img width="794" height="323" alt="Image" src="https://github.com/user-attachments/assets/5189067c-6de5-4d0f-afc2-767917dbf05b" />

### Evidence 2 — parser missing: JSONDataParse not found
Even after setting PYTHONPATH, it fails, which means sedna is locatable but without JSONDataParse:
```bash
PYTHONPATH=/.../ianvs/examples/resources:$PYTHONPATH \
python benchmarking.py -f examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
```
```text
Traceback (most recent call last):
    ...
    File ".../core/testenvmanager/dataset/dataset.py", line 21, in <module>
        from sedna.datasources import (
        ...
ImportError: cannot import name 'JSONDataParse' from 'sedna.datasources' (.../examples/resources/sedna/datasources/__init__.py)
```

<img width="836" height="311" alt="Image" src="https://github.com/user-attachments/assets/359ffb78-c4d5-4800-bec8-73396b46f29e" />

## Detailed Design
```
examples/llm-edge-benchmark-suite/single_task_bench/
    benchmarkingjob.yaml        # Sort and metrics statements have semantic/consistency issues
    testenv/testenv.yaml        # Dataset path incorrectly points to other examples
    testalgorithms/algorithm.yaml  # Model path and paradigm affect execution
    testalgorithms/basemodel.py # Sample truncation and output schema hurt metric credibility
    testenv/*.py (metrics)      # Throughput formula and memory definition distort results
    (Missing data/model assets and docs)
```
The core framework call chain (YAML parse → ClassFactory registration → metrics execution) is triggerable but blocked by example-level issues. The problems are concentrated in the “example resources and logic” layer.

### Architecture
- Orchestrator: `benchmarking.py` parses `benchmarkingjob.yaml` and assembles TestEnv/TestObject/Rank/Visualization.
- Dataset: `core/testenvmanager/dataset/dataset.py` loads data via `sedna.datasources` parsers (e.g., CSV/TXT/JSONL).
- Algorithm: example `basemodel.py` wraps inference and emits unified fields to metrics (e.g., total_latency, prefill_latency, mem_usage, etc.).
- Metrics plugins: `testenv/*.py` are invoked as plugins to compute latency/throughput/mem.
- Result & Rank: per `benchmarkingjob.yaml` sort directions and metrics list, produce comparisons and visualization.

In this issue, the failure points concentrate on the Dataset/Algorithm/Metrics “example-level implementations and configs.” The Ianvs core orchestration works and surfaces accurate errors.

### Module details
- Config (YAML)
    - `single_task_bench/benchmarkingjob.yaml`: ensure rank sort directions match metric intentions; align `selected_dataitem.metrics` with actual plugins; keep visualization section consistent with artifact paths.
    - `single_task_bench/testenv/testenv.yaml`: dataset should point to the example’s own minimal JSONL prompts; avoid referencing other examples.
    - `single_task_bench/testalgorithms/algorithm.yaml`: model-related fields must match the paradigm; avoid non-existent weight paths.
- Data and parsing
    - Prefer parsers that Ianvs currently integrates (e.g., JSONL). If the example demands JSON metadata parsing, document the installation source and options explicitly to avoid implicit dependencies on an “incomplete snapshot” in-repo.
    - To avoid heavy resources, ship a tiny prompts set to satisfy the minimal runnable loop and reproducibility.
- Algorithm wrapper (`basemodel.py`)
    - Constrain the output schema to meet metrics’ contract; remove opaque sample truncation. Expose sample size as a config/env option instead of hard-coding.
- Metric scripts (`testenv/*.py`)
    - Clarify units and semantics (latency/prefill: ms; throughput: req/s; mem: define window and statistics), avoiding ms-based reciprocals and cross-device non-comparability.
- Documentation
    - Example README/Troubleshooting must provide the shortest Install — Run — Metrics — Troubleshooting path and explicitly list third-party deps and options (e.g., sedna install path).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRE-TASK: llm-edge-benchmark-suite/single_task_bench example initially not runnable: config / resource / metric issues break the benchmark #250

Brief Introduction

Background

Selected example

Minimal reproduction

Debugging clues

Problem categories

Impact and urgency

Goals

Expected Community Contribution

High-level Approach

Scope

Expected Users

In Scope

Out of Scope

Uniqueness vs Existing Issues

Assumptions & Constraints

Evidence

Evidence 1 — run blocked: missing sedna module

Evidence 2 — parser missing: JSONDataParse not found

Detailed Design

Architecture

Module details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Representative log / snippet	Impact
ImportError	`ImportError: cannot import name 'JSONDataParse'`	Data parsing init fails; pipeline aborts
Missing model	`FileNotFoundError: ... qwen_1_5_0_5b.gguf`	Model instantiation fails; inference impossible
Metric formula	`avg_throughput = num_requests / total_latency` (total_latency in ms)	Values inflated by ~1000x
Data truncation	`data = data[:10]`	Sample size hard-capped; distortion and opacity
Sort logic	latency: descend / throughput: ascend	Sort semantics inverted; misleading results
Docs gap	README only 2 lines	Users lack a runnable path

PRE-TASK: llm-edge-benchmark-suite/single_task_bench example initially not runnable: config / resource / metric issues break the benchmark #250

Description

Brief Introduction

Background

Selected example

Minimal reproduction

Debugging clues

Problem categories

Impact and urgency

Goals

Expected Community Contribution

High-level Approach

Scope

Expected Users

In Scope

Out of Scope

Uniqueness vs Existing Issues

Assumptions & Constraints

Evidence

Evidence 1 — run blocked: missing sedna module

Evidence 2 — parser missing: JSONDataParse not found

Detailed Design

Architecture

Module details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions