Commit 4270b02
authored
feat(evals): add llm driven workflow evals with llm as a judge (#383)
* feat(evals): add llm driven workflow evals with llm as a judge
Add workflow evaluation system for testing AI agents in multi-turn
conversations using Apify MCP tools, with LLM-based evaluation.
Core Components:
- Multi-turn conversation executor with dynamic tool discovery
- LLM judge for evaluating agent performance against requirements
- Isolated MCP server per test (prevents state contamination)
- OpenRouter integration (agent + judge models)
- Configurable tool timeout (default: 60s, MCP SDK integration)
Architecture:
• MCP server spawned fresh per test → test isolation
• Tools refreshed after each turn → supports dynamic registration (add-actor)
• Strict pass/fail → all tests must pass for CI success
• Raw error propagation → LLM receives MCP SDK errors unchanged
CLI Usage:
npm run evals:workflow
npm run evals:workflow -- --tool-timeout 300 --category search
CLI Options:
--tool-timeout <seconds> Tool call timeout (default: 60)
--agent-model <model> Agent model (default: claude-haiku-4.5)
--judge-model <model> Judge model (default: grok-4.1-fast)
--category <name> Filter by category
--id <id> Run specific test
--verbose Show full conversations
Environment:
APIFY_TOKEN - Required for MCP server
OPENROUTER_API_KEY - Required for LLM calls
This enables systematic testing of MCP tools, agent tool-calling behavior,
and automated quality evaluation without manual verification.
* refactor(evals): extract shared utilities and unify test case format
This commit refactors the evaluation system to eliminate code duplication
and standardize test case formats across both tool selection and workflow
evaluation systems.
## Changes
### Created shared module (evals/shared/)
- types.ts: Unified type definitions for test cases and tools
- config.ts: Shared OpenRouter configuration and environment validation
- openai-tools.ts: Consolidated tool transformation utilities
- test-case-loader.ts: Unified test case loading and filtering functions
### Unified test case format
- Standardized on 'query' (previously 'prompt' in workflows)
- Standardized on 'reference' (previously 'requirements' in workflows)
- Added version tracking to workflows/test-cases.json
- Maintains backwards compatibility through type exports
### Eliminated duplicates
Removed 7 duplicate functions across the codebase:
- Test case loading (evaluation-utils.ts vs test-cases-loader.ts)
- Test case filtering (filterById, filterByCategory, filterTestCases)
- OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools)
- OpenRouter configuration (OPENROUTER_CONFIG duplicated)
- Environment validation (validateEnvVars duplicated)
### Configuration improvements
- OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1)
- Created Phoenix-specific validation (validatePhoenixEnvVars)
- Separated concerns between shared and system-specific config
### Files modified
- Updated 11 existing files to use shared utilities
- Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared)
- All imports updated to reference shared modules
## Impact
- Reduced config code by ~37%
- Eliminated 100% of duplicate functions
- Improved maintainability and consistency
- No breaking changes to external APIs
## Validation
- TypeScript compilation: ✓
- Project build: ✓
- All imports verified: ✓
* feat(evals): add parallel execution and fix linting for workflows
- Add --concurrency/-c flag to run workflow evals in parallel (default: 4)
- Add p-limit dependency for concurrency control
- Enable ESLint for evals/workflows/ and evals/shared/ directories
- Fix all linting issues (117 errors):
- Convert interfaces to types per project convention
- Fix import ordering with simple-import-sort
- Remove trailing spaces
- Fix comma-dangle, arrow-parens, operator-linebreak
- Prefer node: protocol for built-in imports
- Fix nested ternary in output-formatter.ts
- Add logWithPrefix() helper for prefixed live output
- Extract runSingleTest() function from main evaluation loop
- Remove empty line after test completion in output
Breaking changes: None (all changes backward compatible)
Usage:
npm run evals:workflow -- -c 10 # Run 10 tests in parallel
npm run evals:workflow -- -c 1 # Sequential mode
* feat(evals): use structured output for judge LLM and fix test filtering
- Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation
- Replace fragile text parsing with guaranteed JSON validation
- Fix test case filtering to support wildcard patterns (--category) and regex (--id)
- Add responseFormat parameter to LLM client for structured outputs
- Update judge prompt to remove manual format instructions
- Add test case for weather MCP Actor
* feat(evals): MCP instructions, test tracking, and expanded test coverage1 parent 6dd3b10 commit 4270b02
24 files changed
Lines changed: 3010 additions & 82 deletions
File tree
- evals
- workflows
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
9 | 12 | | |
10 | 13 | | |
11 | 14 | | |
| |||
156 | 159 | | |
157 | 160 | | |
158 | 161 | | |
159 | | - | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
160 | 168 | | |
161 | 169 | | |
162 | 170 | | |
163 | 171 | | |
164 | | - | |
165 | 172 | | |
166 | 173 | | |
167 | 174 | | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
177 | 180 | | |
178 | 181 | | |
179 | 182 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
| 84 | + | |
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | 5 | | |
10 | 6 | | |
11 | 7 | | |
| |||
24 | 20 | | |
25 | 21 | | |
26 | 22 | | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
27 | 26 | | |
28 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
29 | 30 | | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
| 31 | + | |
| 32 | + | |
66 | 33 | | |
67 | | - | |
68 | | - | |
| 34 | + | |
69 | 35 | | |
70 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
71 | 41 | | |
72 | 42 | | |
73 | 43 | | |
| |||
76 | 46 | | |
77 | 47 | | |
78 | 48 | | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | 49 | | |
91 | 50 | | |
92 | 51 | | |
93 | 52 | | |
94 | | - | |
| 53 | + | |
95 | 54 | | |
96 | 55 | | |
97 | 56 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| |||
202 | 202 | | |
203 | 203 | | |
204 | 204 | | |
205 | | - | |
| 205 | + | |
206 | 206 | | |
207 | 207 | | |
208 | 208 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
0 commit comments