🔬 Improve Experiment Infrastructure
This is a sub-issue of the daily-news experiment campaign issue, proposing concrete improvements to the gh-aw experiment infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration.
Area 1: Frontmatter Schema Enhancements
Current State
The compiler (pkg/workflow/compiler_experiments.go) already accepts the rich object form with variants, description, metric, weight, start_date, end_date, and issue. However, several useful fields are missing:
- No
secondary_metrics list (only one primary metric string)
- No
guardrail_metrics to define thresholds that must not degrade
- No
min_samples to declare the required sample size per variant
- No
hypothesis text (forces researchers to write it only in the tracking issue)
- No
owner field to tag the team/person responsible
Proposed Enhanced Schema
experiments:
prompt_style:
variants: [concise, detailed]
description: "Test whether a concise prompt reduces token cost without quality loss"
hypothesis: "H0: no change in effective_tokens. H1: concise reduces tokens by >=15%"
metric: effective_tokens
secondary_metrics: [duration_ms, discussion_word_count]
guardrail_metrics:
- name: success_rate
threshold: ">=0.95"
- name: empty_output_rate
threshold: "==0"
weight: [50, 50]
min_samples: 25
start_date: "2026-05-05"
end_date: "2026-07-25"
issue: 1234
owner: "`@team-agents`"
Implementation Changes
- Extend
ExperimentConfig struct in compiler_experiments.go to add Hypothesis string, SecondaryMetrics []string, GuardrailMetrics []GuardrailMetric, MinSamples int, Owner string.
- Update
extractOneExperimentConfig to parse the new fields.
- Update
pick_experiment.cjs to write hypothesis, secondary_metrics, guardrail_metrics, min_samples, and owner into state.json so downstream reporting can consume them.
- Surface
hypothesis and guardrail_metrics in the Markdown step summary written by writeSummary().
Area 2: Reporting & Dashboards
Current State
daily-experiment-report.md already exists and likely aggregates some experiment data, but there is no automated significance detection or visual comparison.
Proposed daily-experiment-report Enhancements
Step 1 - Aggregate experiment artifacts
Download all experiment artifacts for each workflow that has an active experiment using gh run list and gh run download:
gh run list --workflow="daily-news.lock.yml" --limit 100 --json databaseId \
| jq -r .[].databaseId \
| xargs -I{} gh run download {} -n "experiment" -D /tmp/exp/{}
Step 2 - Compute running statistics per variant
For each experiment, read assignments.json from each run artifact and cross-reference with the effective token count from the run log or OTEL span. Compute:
n (sample size per variant)
mean, variance of primary metric per variant
p-value via Welch t-test
guardrail status (pass/fail per threshold)
Step 3 - Detect significance
When p < 0.05 AND n >= min_samples for all variants, mark the experiment as ready for promotion. Post a notification comment to the tracking issue.
Step 4 - ASCII comparison table
Emit an artifact like:
Experiment | Variant | n | Mean tokens | p-value
prompt_style | detailed | 14 | 48 320 | 0.031
| concise | 14 | 39 870 |
Step 5 - Post to discussion
Create or update a pinned Experiment Dashboard discussion in the audits category with the latest stats table, significance status, and a recommendation (promote / continue / abort).
Area 3: Audit & OTEL Observability Integration
Current State
pick_experiment.cjs writes assignments.json to /tmp/gh-aw/experiments/ but this data is not yet surfaced in OTEL spans or gh aw audit output.
Proposed Changes
3a - OTEL span attributes
In the shared OTEL step (shared/observability-otlp.md), read /tmp/gh-aw/experiments/assignments.json (if present) and add one span attribute per experiment:
gh_aw.experiment.prompt_style = "concise"
gh_aw.experiment.names = "prompt_style"
This enables filtering in Datadog/Honeycomb to compare failure rates, latency, and token usage by variant.
3b - gh aw audit output
Extend the audit command to show experiment assignments in the run detail view:
Run #25240168844 daily-news success 2026-05-02T09:03Z
Duration: 4m 12s Tokens: 39 870
Experiments: prompt_style=concise
This lets engineers filter audit logs by variant: gh aw audit --experiment prompt_style=concise.
3c - Step summary enrichment
Update writeSummary() in pick_experiment.cjs to also output:
- Guardrail metric thresholds
min_samples required vs. current counts
- A progress bar per variant:
detailed: -------- 14/25 (56%)
- A "ready for analysis" flag when
n >= min_samples for all variants
3d - Experiment lifecycle labels
Automate GitHub label management on tracking issues:
experiment:active when start_date <= today <= end_date
experiment:ready-for-analysis when all variants hit min_samples
experiment:concluded after the winning variant is promoted
Implementation Steps
References
pkg/workflow/compiler_experiments.go - experiment config parsing and step generation
actions/setup/js/pick_experiment.cjs - variant selection, state persistence, step summary
.github/workflows/daily-experiment-report.md - existing report workflow to enhance
.github/workflows/shared/observability-otlp.md - OTEL integration point
Generated by Daily A/B Testing Advisor · ● 518.6K · ◷
🔬 Improve Experiment Infrastructure
This is a sub-issue of the
daily-newsexperiment campaign issue, proposing concrete improvements to thegh-awexperiment infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration.Area 1: Frontmatter Schema Enhancements
Current State
The compiler (
pkg/workflow/compiler_experiments.go) already accepts the rich object form withvariants,description,metric,weight,start_date,end_date, andissue. However, several useful fields are missing:secondary_metricslist (only one primarymetricstring)guardrail_metricsto define thresholds that must not degrademin_samplesto declare the required sample size per varianthypothesistext (forces researchers to write it only in the tracking issue)ownerfield to tag the team/person responsibleProposed Enhanced Schema
Implementation Changes
ExperimentConfigstruct incompiler_experiments.goto addHypothesis string,SecondaryMetrics []string,GuardrailMetrics []GuardrailMetric,MinSamples int,Owner string.extractOneExperimentConfigto parse the new fields.pick_experiment.cjsto writehypothesis,secondary_metrics,guardrail_metrics,min_samples, andownerintostate.jsonso downstream reporting can consume them.hypothesisandguardrail_metricsin the Markdown step summary written bywriteSummary().Area 2: Reporting & Dashboards
Current State
daily-experiment-report.mdalready exists and likely aggregates some experiment data, but there is no automated significance detection or visual comparison.Proposed
daily-experiment-reportEnhancementsStep 1 - Aggregate experiment artifacts
Download all
experimentartifacts for each workflow that has an active experiment usinggh run listandgh run download:Step 2 - Compute running statistics per variant
For each experiment, read
assignments.jsonfrom each run artifact and cross-reference with the effective token count from the run log or OTEL span. Compute:n(sample size per variant)mean,varianceof primary metric per variantp-valuevia Welch t-testguardrail status(pass/fail per threshold)Step 3 - Detect significance
When
p < 0.05ANDn >= min_samplesfor all variants, mark the experiment as ready for promotion. Post a notification comment to the tracking issue.Step 4 - ASCII comparison table
Emit an artifact like:
Step 5 - Post to discussion
Create or update a pinned
Experiment Dashboarddiscussion in theauditscategory with the latest stats table, significance status, and a recommendation (promote / continue / abort).Area 3: Audit & OTEL Observability Integration
Current State
pick_experiment.cjswritesassignments.jsonto/tmp/gh-aw/experiments/but this data is not yet surfaced in OTEL spans orgh aw auditoutput.Proposed Changes
3a - OTEL span attributes
In the shared OTEL step (
shared/observability-otlp.md), read/tmp/gh-aw/experiments/assignments.json(if present) and add one span attribute per experiment:This enables filtering in Datadog/Honeycomb to compare failure rates, latency, and token usage by variant.
3b -
gh aw auditoutputExtend the audit command to show experiment assignments in the run detail view:
This lets engineers filter audit logs by variant:
gh aw audit --experiment prompt_style=concise.3c - Step summary enrichment
Update
writeSummary()inpick_experiment.cjsto also output:min_samplesrequired vs. current countsdetailed: -------- 14/25 (56%)n >= min_samplesfor all variants3d - Experiment lifecycle labels
Automate GitHub label management on tracking issues:
experiment:activewhenstart_date <= today <= end_dateexperiment:ready-for-analysiswhen all variants hitmin_samplesexperiment:concludedafter the winning variant is promotedImplementation Steps
ExperimentConfigstruct and parser for new schema fields (hypothesis,secondary_metrics,guardrail_metrics,min_samples,owner)pick_experiment.cjsto write new fields tostate.jsonand enrich step summaryshared/observability-otlp.mdgh aw auditto show and filter by experiment variantdaily-experiment-report.mdto aggregate artifacts, compute significance, and post dashboard to discussionReferences
pkg/workflow/compiler_experiments.go- experiment config parsing and step generationactions/setup/js/pick_experiment.cjs- variant selection, state persistence, step summary.github/workflows/daily-experiment-report.md- existing report workflow to enhance.github/workflows/shared/observability-otlp.md- OTEL integration point