[ab-advisor] Improve experiment infrastructure: schema, reporting & audit

### 🔬 Improve Experiment Infrastructure

This is a sub-issue of the `daily-news` experiment campaign issue, proposing concrete improvements to the `gh-aw` experiment infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration.

---

### Area 1: Frontmatter Schema Enhancements

#### Current State

The compiler (`pkg/workflow/compiler_experiments.go`) already accepts the rich object form with `variants`, `description`, `metric`, `weight`, `start_date`, `end_date`, and `issue`. However, several useful fields are missing:

- No `secondary_metrics` list (only one primary `metric` string)
- No `guardrail_metrics` to define thresholds that must not degrade
- No `min_samples` to declare the required sample size per variant
- No `hypothesis` text (forces researchers to write it only in the tracking issue)
- No `owner` field to tag the team/person responsible

#### Proposed Enhanced Schema

```yaml
experiments:
  prompt_style:
    variants: [concise, detailed]
    description: "Test whether a concise prompt reduces token cost without quality loss"
    hypothesis: "H0: no change in effective_tokens. H1: concise reduces tokens by >=15%"
    metric: effective_tokens
    secondary_metrics: [duration_ms, discussion_word_count]
    guardrail_metrics:
      - name: success_rate
        threshold: ">=0.95"
      - name: empty_output_rate
        threshold: "==0"
    weight: [50, 50]
    min_samples: 25
    start_date: "2026-05-05"
    end_date: "2026-07-25"
    issue: 1234
    owner: "`@team-agents`"
```

#### Implementation Changes

1. Extend `ExperimentConfig` struct in `compiler_experiments.go` to add `Hypothesis string`, `SecondaryMetrics []string`, `GuardrailMetrics []GuardrailMetric`, `MinSamples int`, `Owner string`.
2. Update `extractOneExperimentConfig` to parse the new fields.
3. Update `pick_experiment.cjs` to write `hypothesis`, `secondary_metrics`, `guardrail_metrics`, `min_samples`, and `owner` into `state.json` so downstream reporting can consume them.
4. Surface `hypothesis` and `guardrail_metrics` in the Markdown step summary written by `writeSummary()`.

---

### Area 2: Reporting & Dashboards

#### Current State

`daily-experiment-report.md` already exists and likely aggregates some experiment data, but there is no automated significance detection or visual comparison.

#### Proposed `daily-experiment-report` Enhancements

**Step 1 - Aggregate experiment artifacts**

Download all `experiment` artifacts for each workflow that has an active experiment using `gh run list` and `gh run download`:

```bash
gh run list --workflow="daily-news.lock.yml" --limit 100 --json databaseId \
  | jq -r .[].databaseId \
  | xargs -I{} gh run download {} -n "experiment" -D /tmp/exp/{}
```

**Step 2 - Compute running statistics per variant**

For each experiment, read `assignments.json` from each run artifact and cross-reference with the effective token count from the run log or OTEL span. Compute:
- `n` (sample size per variant)
- `mean`, `variance` of primary metric per variant
- `p-value` via Welch t-test
- `guardrail status` (pass/fail per threshold)

**Step 3 - Detect significance**

When `p < 0.05` AND `n >= min_samples` for all variants, mark the experiment as ready for promotion. Post a notification comment to the tracking issue.

**Step 4 - ASCII comparison table**

Emit an artifact like:

```
Experiment      | Variant  | n  | Mean tokens | p-value
prompt_style    | detailed | 14 | 48 320      | 0.031
                | concise  | 14 | 39 870      |
```

**Step 5 - Post to discussion**

Create or update a pinned `Experiment Dashboard` discussion in the `audits` category with the latest stats table, significance status, and a recommendation (promote / continue / abort).

---

### Area 3: Audit & OTEL Observability Integration

#### Current State

`pick_experiment.cjs` writes `assignments.json` to `/tmp/gh-aw/experiments/` but this data is not yet surfaced in OTEL spans or `gh aw audit` output.

#### Proposed Changes

**3a - OTEL span attributes**

In the shared OTEL step (`shared/observability-otlp.md`), read `/tmp/gh-aw/experiments/assignments.json` (if present) and add one span attribute per experiment:

```
gh_aw.experiment.prompt_style = "concise"
gh_aw.experiment.names = "prompt_style"
```

This enables filtering in Datadog/Honeycomb to compare failure rates, latency, and token usage by variant.

**3b - `gh aw audit` output**

Extend the audit command to show experiment assignments in the run detail view:

```
Run #25240168844  daily-news  success  2026-05-02T09:03Z
  Duration: 4m 12s   Tokens: 39 870
  Experiments: prompt_style=concise
```

This lets engineers filter audit logs by variant: `gh aw audit --experiment prompt_style=concise`.

**3c - Step summary enrichment**

Update `writeSummary()` in `pick_experiment.cjs` to also output:
- Guardrail metric thresholds
- `min_samples` required vs. current counts
- A progress bar per variant: `detailed: -------- 14/25 (56%)`
- A "ready for analysis" flag when `n >= min_samples` for all variants

**3d - Experiment lifecycle labels**

Automate GitHub label management on tracking issues:
- `experiment:active` when `start_date <= today <= end_date`
- `experiment:ready-for-analysis` when all variants hit `min_samples`
- `experiment:concluded` after the winning variant is promoted

### Implementation Steps

- [ ] Extend `ExperimentConfig` struct and parser for new schema fields (`hypothesis`, `secondary_metrics`, `guardrail_metrics`, `min_samples`, `owner`)
- [ ] Update `pick_experiment.cjs` to write new fields to `state.json` and enrich step summary
- [ ] Add OTEL span attribute injection for experiment assignments in `shared/observability-otlp.md`
- [ ] Extend `gh aw audit` to show and filter by experiment variant
- [ ] Enhance `daily-experiment-report.md` to aggregate artifacts, compute significance, and post dashboard to discussion
- [ ] Add lifecycle label automation to tracking issues

### References

- `pkg/workflow/compiler_experiments.go` - experiment config parsing and step generation
- `actions/setup/js/pick_experiment.cjs` - variant selection, state persistence, step summary
- `.github/workflows/daily-experiment-report.md` - existing report workflow to enhance
- `.github/workflows/shared/observability-otlp.md` - OTEL integration point







> Generated by [Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25240168844/agentic_workflow) · ● 518.6K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 16, 2026, 1:25 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29661

🔬 Improve Experiment Infrastructure

Area 1: Frontmatter Schema Enhancements

Current State

Proposed Enhanced Schema

Implementation Changes

Area 2: Reporting & Dashboards

Current State

Proposed `daily-experiment-report` Enhancements

Area 3: Audit & OTEL Observability Integration

Current State

Proposed Changes

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29661

Description

🔬 Improve Experiment Infrastructure

Area 1: Frontmatter Schema Enhancements

Current State

Proposed Enhanced Schema

Implementation Changes

Area 2: Reporting & Dashboards

Current State

Proposed daily-experiment-report Enhancements

Area 3: Audit & OTEL Observability Integration

Current State

Proposed Changes

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposed `daily-experiment-report` Enhancements