Skip to content

Commit 4270b02

Browse files
authored
feat(evals): add llm driven workflow evals with llm as a judge (#383)
* feat(evals): add llm driven workflow evals with llm as a judge Add workflow evaluation system for testing AI agents in multi-turn conversations using Apify MCP tools, with LLM-based evaluation. Core Components: - Multi-turn conversation executor with dynamic tool discovery - LLM judge for evaluating agent performance against requirements - Isolated MCP server per test (prevents state contamination) - OpenRouter integration (agent + judge models) - Configurable tool timeout (default: 60s, MCP SDK integration) Architecture: • MCP server spawned fresh per test → test isolation • Tools refreshed after each turn → supports dynamic registration (add-actor) • Strict pass/fail → all tests must pass for CI success • Raw error propagation → LLM receives MCP SDK errors unchanged CLI Usage: npm run evals:workflow npm run evals:workflow -- --tool-timeout 300 --category search CLI Options: --tool-timeout <seconds> Tool call timeout (default: 60) --agent-model <model> Agent model (default: claude-haiku-4.5) --judge-model <model> Judge model (default: grok-4.1-fast) --category <name> Filter by category --id <id> Run specific test --verbose Show full conversations Environment: APIFY_TOKEN - Required for MCP server OPENROUTER_API_KEY - Required for LLM calls This enables systematic testing of MCP tools, agent tool-calling behavior, and automated quality evaluation without manual verification. * refactor(evals): extract shared utilities and unify test case format This commit refactors the evaluation system to eliminate code duplication and standardize test case formats across both tool selection and workflow evaluation systems. ## Changes ### Created shared module (evals/shared/) - types.ts: Unified type definitions for test cases and tools - config.ts: Shared OpenRouter configuration and environment validation - openai-tools.ts: Consolidated tool transformation utilities - test-case-loader.ts: Unified test case loading and filtering functions ### Unified test case format - Standardized on 'query' (previously 'prompt' in workflows) - Standardized on 'reference' (previously 'requirements' in workflows) - Added version tracking to workflows/test-cases.json - Maintains backwards compatibility through type exports ### Eliminated duplicates Removed 7 duplicate functions across the codebase: - Test case loading (evaluation-utils.ts vs test-cases-loader.ts) - Test case filtering (filterById, filterByCategory, filterTestCases) - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools) - OpenRouter configuration (OPENROUTER_CONFIG duplicated) - Environment validation (validateEnvVars duplicated) ### Configuration improvements - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1) - Created Phoenix-specific validation (validatePhoenixEnvVars) - Separated concerns between shared and system-specific config ### Files modified - Updated 11 existing files to use shared utilities - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared) - All imports updated to reference shared modules ## Impact - Reduced config code by ~37% - Eliminated 100% of duplicate functions - Improved maintainability and consistency - No breaking changes to external APIs ## Validation - TypeScript compilation: ✓ - Project build: ✓ - All imports verified: ✓ * feat(evals): add parallel execution and fix linting for workflows - Add --concurrency/-c flag to run workflow evals in parallel (default: 4) - Add p-limit dependency for concurrency control - Enable ESLint for evals/workflows/ and evals/shared/ directories - Fix all linting issues (117 errors): - Convert interfaces to types per project convention - Fix import ordering with simple-import-sort - Remove trailing spaces - Fix comma-dangle, arrow-parens, operator-linebreak - Prefer node: protocol for built-in imports - Fix nested ternary in output-formatter.ts - Add logWithPrefix() helper for prefixed live output - Extract runSingleTest() function from main evaluation loop - Remove empty line after test completion in output Breaking changes: None (all changes backward compatible) Usage: npm run evals:workflow -- -c 10 # Run 10 tests in parallel npm run evals:workflow -- -c 1 # Sequential mode * feat(evals): use structured output for judge LLM and fix test filtering - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation - Replace fragile text parsing with guaranteed JSON validation - Fix test case filtering to support wildcard patterns (--category) and regex (--id) - Add responseFormat parameter to LLM client for structured outputs - Update judge prompt to remove manual format instructions - Add test case for weather MCP Actor * feat(evals): MCP instructions, test tracking, and expanded test coverage
1 parent 6dd3b10 commit 4270b02

24 files changed

Lines changed: 3010 additions & 82 deletions

eslint.config.mjs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@ export default [
2424
ignores: [
2525
'**/dist', // Build output directory
2626
'**/.venv', // Python virtual environment (if present)
27-
'evals/**', // Evaluation scripts directory
27+
'evals/*.ts', // Top-level evaluation scripts
28+
'evals/*.md', // Documentation files
29+
'evals/*.json', // Test case data files
2830
],
2931
},
3032
// Apply the shared Apify TypeScript ESLint configuration

evals/config.ts

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ import { readFileSync } from 'node:fs';
66
import { dirname, join } from 'node:path';
77
import { fileURLToPath } from 'node:url';
88

9+
// Re-export shared config
10+
export { OPENROUTER_CONFIG, sanitizeHeaderValue, validateEnvVars, getRequiredEnvVars } from './shared/config.js';
11+
912
// Read version from test-cases.json
1013
function getTestCasesVersion(): string {
1114
const currentFilename = fileURLToPath(import.meta.url);
@@ -156,24 +159,24 @@ The response must be exactly:
156159
Decision: either "correct" or "incorrect".
157160
Explanation: brief explanation of the decision.
158161
`
159-
export function getRequiredEnvVars(): Record<string, string | undefined> {
162+
/**
163+
* Get required environment variables for Phoenix-based evaluations
164+
* Extends shared config with Phoenix-specific variables
165+
* Note: OPENROUTER_BASE_URL is optional (defaults to https://openrouter.ai/api/v1)
166+
*/
167+
export function getPhoenixEnvVars(): Record<string, string | undefined> {
160168
return {
161169
PHOENIX_BASE_URL: process.env.PHOENIX_BASE_URL,
162170
PHOENIX_API_KEY: process.env.PHOENIX_API_KEY,
163171
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY,
164-
OPENROUTER_BASE_URL: process.env.OPENROUTER_BASE_URL,
165172
};
166173
}
167174

168-
// Removes newlines and trims whitespace. Useful for Authorization header values
169-
// because CI secrets sometimes include trailing newlines or quotes.
170-
export function sanitizeHeaderValue(value?: string): string | undefined {
171-
if (value == null) return value;
172-
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, '');
173-
}
174-
175-
export function validateEnvVars(): boolean {
176-
const envVars = getRequiredEnvVars();
175+
/**
176+
* Validate Phoenix-specific environment variables
177+
*/
178+
export function validatePhoenixEnvVars(): boolean {
179+
const envVars = getPhoenixEnvVars();
177180
const missing = Object.entries(envVars)
178181
.filter(([, value]) => !value)
179182
.map(([key]) => key);

evals/create-dataset.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ import { hideBin } from 'yargs/helpers';
1414

1515
import log from '@apify/log';
1616

17-
import { sanitizeHeaderValue, validateEnvVars } from './config.js';
17+
import { sanitizeHeaderValue, validatePhoenixEnvVars } from './config.js';
1818
import { loadTestCases, filterByCategory, filterById, type TestCase } from './evaluation-utils.js';
1919

2020
// Set log level to debug
@@ -81,7 +81,7 @@ async function createDatasetFromTestCases(
8181
log.info('Creating Phoenix dataset from test cases...');
8282

8383
// Validate environment variables
84-
if (!validateEnvVars()) {
84+
if (!validatePhoenixEnvVars()) {
8585
process.exit(1);
8686
}
8787

evals/evaluation-utils.ts

Lines changed: 15 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,6 @@
22
* Shared evaluation utilities extracted from run-evaluation.ts
33
*/
44

5-
import { readFileSync } from 'node:fs';
6-
import { dirname as pathDirname, join } from 'node:path';
7-
import { fileURLToPath } from 'node:url';
8-
95
import OpenAI from 'openai';
106
import { createOpenAI } from '@ai-sdk/openai';
117
import { asEvaluator } from '@arizeai/phoenix-client/experiments';
@@ -24,50 +20,24 @@ import {
2420
TEMPERATURE,
2521
sanitizeHeaderValue
2622
} from './config.js';
23+
import { loadTestCases as loadTestCasesShared, filterByCategory, filterById } from './shared/test-case-loader.js';
24+
import { transformToolsToOpenAIFormat } from './shared/openai-tools.js';
25+
import type { ToolSelectionTestCase, TestData } from './shared/types.js';
2726

28-
type ExampleInputOnly = { input: Record<string, unknown>, metadata?: Record<string, unknown>, output?: never };
27+
// Re-export types for backwards compatibility
28+
export type TestCase = ToolSelectionTestCase;
29+
export type { TestData } from './shared/types.js';
2930

30-
export type TestCase = {
31-
id: string;
32-
category: string;
33-
query: string;
34-
context?: string | string[];
35-
expectedTools?: string[];
36-
reference?: string;
37-
};
38-
39-
export type TestData = {
40-
version: string;
41-
testCases: TestCase[];
42-
};
43-
44-
// eslint-disable-next-line consistent-return
45-
export function loadTestCases(filePath: string): TestData {
46-
const filename = fileURLToPath(import.meta.url);
47-
const dirname = pathDirname(filename);
48-
const testCasesPath = join(dirname, filePath);
49-
50-
try {
51-
const fileContent = readFileSync(testCasesPath, 'utf-8');
52-
return JSON.parse(fileContent) as TestData;
53-
} catch {
54-
log.error(`Error: Test cases file not found at ${testCasesPath}`);
55-
process.exit(1);
56-
}
57-
}
58-
59-
export function filterByCategory(testCases: TestCase[], category: string): TestCase[] {
60-
// Convert wildcard pattern to regex
61-
const pattern = category.replace(/\*/g, '.*');
62-
const regex = new RegExp(`^${pattern}$`);
63-
64-
return testCases.filter((testCase) => regex.test(testCase.category));
65-
}
31+
// Re-export shared functions for backwards compatibility
32+
export { filterByCategory, filterById } from './shared/test-case-loader.js';
6633

67-
export function filterById(testCases: TestCase[], idPattern: string): TestCase[] {
68-
const regex = new RegExp(idPattern);
34+
type ExampleInputOnly = { input: Record<string, unknown>, metadata?: Record<string, unknown>, output?: never };
6935

70-
return testCases.filter((testCase) => regex.test(testCase.id));
36+
/**
37+
* Load test cases from a JSON file (wrapper around shared function)
38+
*/
39+
export function loadTestCases(filePath: string): TestData {
40+
return loadTestCasesShared(filePath);
7141
}
7242

7343
export async function loadTools(): Promise<ToolBase[]> {
@@ -76,22 +46,11 @@ export async function loadTools(): Promise<ToolBase[]> {
7646
return urlTools.map((t: ToolEntry) => getToolPublicFieldOnly(t)) as ToolBase[];
7747
}
7848

79-
export function transformToolsToOpenAIFormat(tools: ToolBase[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
80-
return tools.map((tool) => ({
81-
type: 'function',
82-
function: {
83-
name: tool.name,
84-
description: tool.description,
85-
parameters: tool.inputSchema as OpenAI.Chat.ChatCompletionTool['function']['parameters'],
86-
},
87-
}));
88-
}
89-
9049
export function createOpenRouterTask(modelName: string, tools: ToolBase[]) {
9150
const toolsOpenAI = transformToolsToOpenAIFormat(tools);
9251

9352
return async (example: ExampleInputOnly): Promise<{
94-
tool_calls: Array<{ function?: { name?: string } }>;
53+
tool_calls: OpenAI.Chat.Completions.ChatCompletionMessageToolCall[];
9554
llm_response: string;
9655
query: string;
9756
context: string;

evals/run-evaluation.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ import {
2828
EVALUATOR_NAMES,
2929
type EvaluatorName,
3030
sanitizeHeaderValue,
31-
validateEnvVars
31+
validatePhoenixEnvVars
3232
} from './config.js';
3333

3434
type EvaluatorResult = {
@@ -202,7 +202,7 @@ function printResults(results: EvaluatorResult[]): void {
202202
async function main(datasetName: string): Promise<number> {
203203
log.info('Starting MCP tool calling evaluation');
204204

205-
if (!validateEnvVars()) {
205+
if (!validatePhoenixEnvVars()) {
206206
return 1;
207207
}
208208

evals/shared/config.ts

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
/**
2+
* Shared configuration for evaluation systems
3+
* Contains OpenRouter config, environment validation, and common utilities
4+
*/
5+
6+
/**
7+
* OpenRouter API configuration
8+
* OPENROUTER_BASE_URL is optional and defaults to the standard OpenRouter API URL
9+
*/
10+
export const OPENROUTER_CONFIG = {
11+
baseURL: process.env.OPENROUTER_BASE_URL || 'https://openrouter.ai/api/v1',
12+
apiKey: process.env.OPENROUTER_API_KEY || '',
13+
};
14+
15+
/**
16+
* Get required environment variables
17+
* Note: OPENROUTER_BASE_URL is optional (defaults to https://openrouter.ai/api/v1)
18+
*/
19+
export function getRequiredEnvVars(): Record<string, string | undefined> {
20+
return {
21+
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY,
22+
};
23+
}
24+
25+
/**
26+
* Removes newlines and trims whitespace. Useful for Authorization header values
27+
* because CI secrets sometimes include trailing newlines or quotes.
28+
*/
29+
export function sanitizeHeaderValue(value?: string): string | undefined {
30+
if (value == null) return value;
31+
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, '');
32+
}
33+
34+
/**
35+
* Validate that all required environment variables are present
36+
*/
37+
export function validateEnvVars(): boolean {
38+
const envVars = getRequiredEnvVars();
39+
const missing = Object.entries(envVars)
40+
.filter(([, value]) => !value)
41+
.map(([key]) => key);
42+
43+
if (missing.length > 0) {
44+
// eslint-disable-next-line no-console
45+
console.error(`Missing required environment variables: ${missing.join(', ')}`);
46+
return false;
47+
}
48+
49+
return true;
50+
}

evals/shared/openai-tools.ts

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
/**
2+
* Convert tool definitions to OpenAI format
3+
* Unified function for both MCP tools and internal ToolBase types
4+
*/
5+
6+
import type OpenAI from 'openai';
7+
8+
import type { McpTool } from './types.js';
9+
10+
/**
11+
* Generic tool interface that matches both ToolBase and McpTool
12+
*/
13+
type GenericTool = {
14+
name: string;
15+
description?: string;
16+
inputSchema: Record<string, unknown>;
17+
}
18+
19+
/**
20+
* Convert tools to OpenAI Chat Completion format
21+
* Works with both MCP tools and ToolBase from the server
22+
*/
23+
export function transformToolsToOpenAIFormat(tools: GenericTool[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
24+
return tools.map((tool) => ({
25+
type: 'function' as const,
26+
function: {
27+
name: tool.name,
28+
description: tool.description || '',
29+
parameters: tool.inputSchema,
30+
},
31+
}));
32+
}
33+
34+
/**
35+
* Alias for MCP-specific usage (keeps backwards compatibility)
36+
*/
37+
export function mcpToolsToOpenAiTools(mcpTools: McpTool[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
38+
return transformToolsToOpenAIFormat(mcpTools);
39+
}

evals/shared/test-case-loader.ts

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
/**
2+
* Shared test case loading and filtering utilities
3+
*/
4+
5+
import { readFileSync } from 'node:fs';
6+
import { dirname as pathDirname, join } from 'node:path';
7+
import { fileURLToPath } from 'node:url';
8+
9+
import type { BaseTestCase, TestData } from './types.js';
10+
11+
/**
12+
* Load test cases from a JSON file
13+
* Supports both relative and absolute paths
14+
*
15+
* @param filePath - Path to test cases JSON file (relative to caller or absolute)
16+
* @returns Test data with version and test cases
17+
*/
18+
export function loadTestCases(filePath: string): TestData {
19+
const filename = fileURLToPath(import.meta.url);
20+
const dirname = pathDirname(filename);
21+
22+
// Support both relative (from evals/) and absolute paths
23+
let testCasesPath: string;
24+
if (filePath.startsWith('/')) {
25+
testCasesPath = filePath;
26+
} else {
27+
// Relative to evals/ directory (two levels up from shared/)
28+
testCasesPath = join(dirname, '..', filePath);
29+
}
30+
31+
const fileContent = readFileSync(testCasesPath, 'utf-8');
32+
return JSON.parse(fileContent) as TestData;
33+
}
34+
35+
/**
36+
* Filter test cases by category
37+
* Supports wildcard patterns (e.g., "search-actors*" matches "search-actors-1", "search-actors-2", etc.)
38+
*
39+
* @param testCases - Array of test cases to filter
40+
* @param category - Category pattern (supports * wildcard)
41+
* @returns Filtered test cases
42+
*/
43+
export function filterByCategory<T extends BaseTestCase>(testCases: T[], category: string): T[] {
44+
// Convert wildcard pattern to regex
45+
const pattern = category.replace(/\*/g, '.*');
46+
const regex = new RegExp(`^${pattern}$`);
47+
48+
return testCases.filter((testCase) => regex.test(testCase.category));
49+
}
50+
51+
/**
52+
* Filter test cases by ID using regex pattern
53+
*
54+
* @param testCases - Array of test cases to filter
55+
* @param idPattern - Regex pattern to match against test case IDs
56+
* @returns Filtered test cases
57+
*/
58+
export function filterById<T extends BaseTestCase>(testCases: T[], idPattern: string): T[] {
59+
const regex = new RegExp(idPattern);
60+
return testCases.filter((testCase) => regex.test(testCase.id));
61+
}
62+
63+
/**
64+
* Filter test cases by ID or category
65+
* Generic filter function for workflow evaluations
66+
*
67+
* @param testCases - Array of test cases to filter
68+
* @param options - Filter options (id and/or category)
69+
* @returns Filtered test cases
70+
*/
71+
export function filterTestCases<T extends BaseTestCase>(
72+
testCases: T[],
73+
options: { id?: string; category?: string },
74+
): T[] {
75+
let filtered = testCases;
76+
77+
if (options.id) {
78+
filtered = filterById(filtered, options.id);
79+
}
80+
81+
if (options.category) {
82+
filtered = filterByCategory(filtered, options.category);
83+
}
84+
85+
return filtered;
86+
}

0 commit comments

Comments
 (0)