Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29661

@github-actions

Description

@github-actions

🔬 Improve Experiment Infrastructure

This is a sub-issue of the daily-news experiment campaign issue, proposing concrete improvements to the gh-aw experiment infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration.


Area 1: Frontmatter Schema Enhancements

Current State

The compiler (pkg/workflow/compiler_experiments.go) already accepts the rich object form with variants, description, metric, weight, start_date, end_date, and issue. However, several useful fields are missing:

  • No secondary_metrics list (only one primary metric string)
  • No guardrail_metrics to define thresholds that must not degrade
  • No min_samples to declare the required sample size per variant
  • No hypothesis text (forces researchers to write it only in the tracking issue)
  • No owner field to tag the team/person responsible

Proposed Enhanced Schema

experiments:
  prompt_style:
    variants: [concise, detailed]
    description: "Test whether a concise prompt reduces token cost without quality loss"
    hypothesis: "H0: no change in effective_tokens. H1: concise reduces tokens by >=15%"
    metric: effective_tokens
    secondary_metrics: [duration_ms, discussion_word_count]
    guardrail_metrics:
      - name: success_rate
        threshold: ">=0.95"
      - name: empty_output_rate
        threshold: "==0"
    weight: [50, 50]
    min_samples: 25
    start_date: "2026-05-05"
    end_date: "2026-07-25"
    issue: 1234
    owner: "`@team-agents`"

Implementation Changes

  1. Extend ExperimentConfig struct in compiler_experiments.go to add Hypothesis string, SecondaryMetrics []string, GuardrailMetrics []GuardrailMetric, MinSamples int, Owner string.
  2. Update extractOneExperimentConfig to parse the new fields.
  3. Update pick_experiment.cjs to write hypothesis, secondary_metrics, guardrail_metrics, min_samples, and owner into state.json so downstream reporting can consume them.
  4. Surface hypothesis and guardrail_metrics in the Markdown step summary written by writeSummary().

Area 2: Reporting & Dashboards

Current State

daily-experiment-report.md already exists and likely aggregates some experiment data, but there is no automated significance detection or visual comparison.

Proposed daily-experiment-report Enhancements

Step 1 - Aggregate experiment artifacts

Download all experiment artifacts for each workflow that has an active experiment using gh run list and gh run download:

gh run list --workflow="daily-news.lock.yml" --limit 100 --json databaseId \
  | jq -r .[].databaseId \
  | xargs -I{} gh run download {} -n "experiment" -D /tmp/exp/{}

Step 2 - Compute running statistics per variant

For each experiment, read assignments.json from each run artifact and cross-reference with the effective token count from the run log or OTEL span. Compute:

  • n (sample size per variant)
  • mean, variance of primary metric per variant
  • p-value via Welch t-test
  • guardrail status (pass/fail per threshold)

Step 3 - Detect significance

When p < 0.05 AND n >= min_samples for all variants, mark the experiment as ready for promotion. Post a notification comment to the tracking issue.

Step 4 - ASCII comparison table

Emit an artifact like:

Experiment      | Variant  | n  | Mean tokens | p-value
prompt_style    | detailed | 14 | 48 320      | 0.031
                | concise  | 14 | 39 870      |

Step 5 - Post to discussion

Create or update a pinned Experiment Dashboard discussion in the audits category with the latest stats table, significance status, and a recommendation (promote / continue / abort).


Area 3: Audit & OTEL Observability Integration

Current State

pick_experiment.cjs writes assignments.json to /tmp/gh-aw/experiments/ but this data is not yet surfaced in OTEL spans or gh aw audit output.

Proposed Changes

3a - OTEL span attributes

In the shared OTEL step (shared/observability-otlp.md), read /tmp/gh-aw/experiments/assignments.json (if present) and add one span attribute per experiment:

gh_aw.experiment.prompt_style = "concise"
gh_aw.experiment.names = "prompt_style"

This enables filtering in Datadog/Honeycomb to compare failure rates, latency, and token usage by variant.

3b - gh aw audit output

Extend the audit command to show experiment assignments in the run detail view:

Run #25240168844  daily-news  success  2026-05-02T09:03Z
  Duration: 4m 12s   Tokens: 39 870
  Experiments: prompt_style=concise

This lets engineers filter audit logs by variant: gh aw audit --experiment prompt_style=concise.

3c - Step summary enrichment

Update writeSummary() in pick_experiment.cjs to also output:

  • Guardrail metric thresholds
  • min_samples required vs. current counts
  • A progress bar per variant: detailed: -------- 14/25 (56%)
  • A "ready for analysis" flag when n >= min_samples for all variants

3d - Experiment lifecycle labels

Automate GitHub label management on tracking issues:

  • experiment:active when start_date <= today <= end_date
  • experiment:ready-for-analysis when all variants hit min_samples
  • experiment:concluded after the winning variant is promoted

Implementation Steps

  • Extend ExperimentConfig struct and parser for new schema fields (hypothesis, secondary_metrics, guardrail_metrics, min_samples, owner)
  • Update pick_experiment.cjs to write new fields to state.json and enrich step summary
  • Add OTEL span attribute injection for experiment assignments in shared/observability-otlp.md
  • Extend gh aw audit to show and filter by experiment variant
  • Enhance daily-experiment-report.md to aggregate artifacts, compute significance, and post dashboard to discussion
  • Add lifecycle label automation to tracking issues

References

  • pkg/workflow/compiler_experiments.go - experiment config parsing and step generation
  • actions/setup/js/pick_experiment.cjs - variant selection, state persistence, step summary
  • .github/workflows/daily-experiment-report.md - existing report workflow to enhance
  • .github/workflows/shared/observability-otlp.md - OTEL integration point

Generated by Daily A/B Testing Advisor · ● 518.6K ·

  • expires on May 16, 2026, 1:25 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions