Skip to content

Commit 8c091af

Browse files
committed
Squashed commit of the following:
commit d1f7dc7 Author: Jakub Kopecký <themq37@gmail.com> Date: Wed Jan 7 14:03:21 2026 +0100 fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384) * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json * fix: remove pollInterval from task creation in tool call request commit 4270b02 Author: Jakub Kopecký <themq37@gmail.com> Date: Wed Jan 7 12:10:14 2026 +0100 feat(evals): add llm driven workflow evals with llm as a judge (#383) * feat(evals): add llm driven workflow evals with llm as a judge Add workflow evaluation system for testing AI agents in multi-turn conversations using Apify MCP tools, with LLM-based evaluation. Core Components: - Multi-turn conversation executor with dynamic tool discovery - LLM judge for evaluating agent performance against requirements - Isolated MCP server per test (prevents state contamination) - OpenRouter integration (agent + judge models) - Configurable tool timeout (default: 60s, MCP SDK integration) Architecture: • MCP server spawned fresh per test → test isolation • Tools refreshed after each turn → supports dynamic registration (add-actor) • Strict pass/fail → all tests must pass for CI success • Raw error propagation → LLM receives MCP SDK errors unchanged CLI Usage: npm run evals:workflow npm run evals:workflow -- --tool-timeout 300 --category search CLI Options: --tool-timeout <seconds> Tool call timeout (default: 60) --agent-model <model> Agent model (default: claude-haiku-4.5) --judge-model <model> Judge model (default: grok-4.1-fast) --category <name> Filter by category --id <id> Run specific test --verbose Show full conversations Environment: APIFY_TOKEN - Required for MCP server OPENROUTER_API_KEY - Required for LLM calls This enables systematic testing of MCP tools, agent tool-calling behavior, and automated quality evaluation without manual verification. * refactor(evals): extract shared utilities and unify test case format This commit refactors the evaluation system to eliminate code duplication and standardize test case formats across both tool selection and workflow evaluation systems. - types.ts: Unified type definitions for test cases and tools - config.ts: Shared OpenRouter configuration and environment validation - openai-tools.ts: Consolidated tool transformation utilities - test-case-loader.ts: Unified test case loading and filtering functions - Standardized on 'query' (previously 'prompt' in workflows) - Standardized on 'reference' (previously 'requirements' in workflows) - Added version tracking to workflows/test-cases.json - Maintains backwards compatibility through type exports Removed 7 duplicate functions across the codebase: - Test case loading (evaluation-utils.ts vs test-cases-loader.ts) - Test case filtering (filterById, filterByCategory, filterTestCases) - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools) - OpenRouter configuration (OPENROUTER_CONFIG duplicated) - Environment validation (validateEnvVars duplicated) - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1) - Created Phoenix-specific validation (validatePhoenixEnvVars) - Separated concerns between shared and system-specific config - Updated 11 existing files to use shared utilities - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared) - All imports updated to reference shared modules - Reduced config code by ~37% - Eliminated 100% of duplicate functions - Improved maintainability and consistency - No breaking changes to external APIs - TypeScript compilation: ✓ - Project build: ✓ - All imports verified: ✓ * feat(evals): add parallel execution and fix linting for workflows - Add --concurrency/-c flag to run workflow evals in parallel (default: 4) - Add p-limit dependency for concurrency control - Enable ESLint for evals/workflows/ and evals/shared/ directories - Fix all linting issues (117 errors): - Convert interfaces to types per project convention - Fix import ordering with simple-import-sort - Remove trailing spaces - Fix comma-dangle, arrow-parens, operator-linebreak - Prefer node: protocol for built-in imports - Fix nested ternary in output-formatter.ts - Add logWithPrefix() helper for prefixed live output - Extract runSingleTest() function from main evaluation loop - Remove empty line after test completion in output Breaking changes: None (all changes backward compatible) Usage: npm run evals:workflow -- -c 10 # Run 10 tests in parallel npm run evals:workflow -- -c 1 # Sequential mode * feat(evals): use structured output for judge LLM and fix test filtering - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation - Replace fragile text parsing with guaranteed JSON validation - Fix test case filtering to support wildcard patterns (--category) and regex (--id) - Add responseFormat parameter to LLM client for structured outputs - Update judge prompt to remove manual format instructions - Add test case for weather MCP Actor * feat(evals): MCP instructions, test tracking, and expanded test coverage commit 6dd3b10 Author: Apify Release Bot <noreply@apify.com> Date: Tue Jan 6 14:28:55 2026 +0000 chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci] commit eaeb57b Author: Jiří Spilka <jiri.spilka@apify.com> Date: Tue Jan 6 15:27:51 2026 +0100 fix: Improve README for clarity and MCP clients info at the top (#382)
1 parent fa518e2 commit 8c091af

29 files changed

Lines changed: 3063 additions & 98 deletions

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22

33
All notable changes to this project will be documented in this file.
44

5+
## [0.6.7](https://github.com/apify/apify-mcp-server/releases/tag/v0.6.7) (2026-01-06)
6+
7+
### 🐛 Bug Fixes
8+
9+
- Improve README for clarity and MCP clients info at the top ([#382](https://github.com/apify/apify-mcp-server/pull/382)) ([eaeb57b](https://github.com/apify/apify-mcp-server/commit/eaeb57b8a8a088dc75400a48fb8cc9d8e088fd08)) by [@jirispilka](https://github.com/jirispilka)
10+
11+
512
## [0.6.6](https://github.com/apify/apify-mcp-server/releases/tag/v0.6.6) (2026-01-05)
613

714
### 🚀 Features

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,16 @@
1717
</p>
1818

1919

20-
The Apify Model Context Protocol (MCP) server at [**mcp.apify.com**](https://mcp.apify.com) enables your AI agents to extract data from social media, search engines, maps, e-commerce sites, or any other website using thousands of ready-made scrapers, crawlers, and automation tools available on the [Apify Store](https://apify.com/store).
20+
The Apify Model Context Protocol (MCP) server at [**mcp.apify.com**](https://mcp.apify.com) enables your AI agents to extract data from social media, search engines, maps, e-commerce sites, and any other website using thousands of ready-made scrapers, crawlers, and automation tools from the [Apify Store](https://apify.com/store). It supports OAuth, allowing you to connect from clients like Claude.ai or Visual Studio Code using just the URL.
2121

22-
> **🚀 Try the hosted Apify MCP Server!**
22+
> **🚀 Use the hosted Apify MCP Server!**
2323
>
24-
> For the easiest setup and most powerful features, including the ability to find and use any Actor from Apify Store, connect your AI assistant to our hosted server:
24+
> For the easiest setup and most powerful features, connect your AI assistant to our hosted server:
2525
>
2626
> **[`https://mcp.apify.com`](https://mcp.apify.com)**
27-
>
28-
> It supports OAuth, so you can connect from clients like Claude.ai or Visual Studio Code with just the URL.
27+
28+
Apify MCP Server is compatible with `Claude Code, Claude.ai, Cursor, VS Code` and any client that adheres to the Model Context Protocol.
29+
Check out the [MCP clients section](#-mcp-clients) for more details or visit the [MCP configuration page](https://mcp.apify.com).
2930

3031
![Apify-MCP-server](https://github.com/__raw/apify/apify-mcp-server/refs/heads/master/docs/apify-mcp-server.png)
3132

eslint.config.mjs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@ export default [
2424
ignores: [
2525
'**/dist', // Build output directory
2626
'**/.venv', // Python virtual environment (if present)
27-
'evals/**', // Evaluation scripts directory
27+
'evals/*.ts', // Top-level evaluation scripts
28+
'evals/*.md', // Documentation files
29+
'evals/*.json', // Test case data files
2830
],
2931
},
3032
// Apply the shared Apify TypeScript ESLint configuration

evals/config.ts

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ import { readFileSync } from 'node:fs';
66
import { dirname, join } from 'node:path';
77
import { fileURLToPath } from 'node:url';
88

9+
// Re-export shared config
10+
export { OPENROUTER_CONFIG, sanitizeHeaderValue, validateEnvVars, getRequiredEnvVars } from './shared/config.js';
11+
912
// Read version from test-cases.json
1013
function getTestCasesVersion(): string {
1114
const currentFilename = fileURLToPath(import.meta.url);
@@ -156,24 +159,24 @@ The response must be exactly:
156159
Decision: either "correct" or "incorrect".
157160
Explanation: brief explanation of the decision.
158161
`
159-
export function getRequiredEnvVars(): Record<string, string | undefined> {
162+
/**
163+
* Get required environment variables for Phoenix-based evaluations
164+
* Extends shared config with Phoenix-specific variables
165+
* Note: OPENROUTER_BASE_URL is optional (defaults to https://openrouter.ai/api/v1)
166+
*/
167+
export function getPhoenixEnvVars(): Record<string, string | undefined> {
160168
return {
161169
PHOENIX_BASE_URL: process.env.PHOENIX_BASE_URL,
162170
PHOENIX_API_KEY: process.env.PHOENIX_API_KEY,
163171
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY,
164-
OPENROUTER_BASE_URL: process.env.OPENROUTER_BASE_URL,
165172
};
166173
}
167174

168-
// Removes newlines and trims whitespace. Useful for Authorization header values
169-
// because CI secrets sometimes include trailing newlines or quotes.
170-
export function sanitizeHeaderValue(value?: string): string | undefined {
171-
if (value == null) return value;
172-
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, '');
173-
}
174-
175-
export function validateEnvVars(): boolean {
176-
const envVars = getRequiredEnvVars();
175+
/**
176+
* Validate Phoenix-specific environment variables
177+
*/
178+
export function validatePhoenixEnvVars(): boolean {
179+
const envVars = getPhoenixEnvVars();
177180
const missing = Object.entries(envVars)
178181
.filter(([, value]) => !value)
179182
.map(([key]) => key);

evals/create-dataset.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ import { hideBin } from 'yargs/helpers';
1414

1515
import log from '@apify/log';
1616

17-
import { sanitizeHeaderValue, validateEnvVars } from './config.js';
17+
import { sanitizeHeaderValue, validatePhoenixEnvVars } from './config.js';
1818
import { loadTestCases, filterByCategory, filterById, type TestCase } from './evaluation-utils.js';
1919

2020
// Set log level to debug
@@ -81,7 +81,7 @@ async function createDatasetFromTestCases(
8181
log.info('Creating Phoenix dataset from test cases...');
8282

8383
// Validate environment variables
84-
if (!validateEnvVars()) {
84+
if (!validatePhoenixEnvVars()) {
8585
process.exit(1);
8686
}
8787

evals/evaluation-utils.ts

Lines changed: 15 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,6 @@
22
* Shared evaluation utilities extracted from run-evaluation.ts
33
*/
44

5-
import { readFileSync } from 'node:fs';
6-
import { dirname as pathDirname, join } from 'node:path';
7-
import { fileURLToPath } from 'node:url';
8-
95
import OpenAI from 'openai';
106
import { createOpenAI } from '@ai-sdk/openai';
117
import { asEvaluator } from '@arizeai/phoenix-client/experiments';
@@ -24,50 +20,24 @@ import {
2420
TEMPERATURE,
2521
sanitizeHeaderValue
2622
} from './config.js';
23+
import { loadTestCases as loadTestCasesShared, filterByCategory, filterById } from './shared/test-case-loader.js';
24+
import { transformToolsToOpenAIFormat } from './shared/openai-tools.js';
25+
import type { ToolSelectionTestCase, TestData } from './shared/types.js';
2726

28-
type ExampleInputOnly = { input: Record<string, unknown>, metadata?: Record<string, unknown>, output?: never };
27+
// Re-export types for backwards compatibility
28+
export type TestCase = ToolSelectionTestCase;
29+
export type { TestData } from './shared/types.js';
2930

30-
export type TestCase = {
31-
id: string;
32-
category: string;
33-
query: string;
34-
context?: string | string[];
35-
expectedTools?: string[];
36-
reference?: string;
37-
};
38-
39-
export type TestData = {
40-
version: string;
41-
testCases: TestCase[];
42-
};
43-
44-
// eslint-disable-next-line consistent-return
45-
export function loadTestCases(filePath: string): TestData {
46-
const filename = fileURLToPath(import.meta.url);
47-
const dirname = pathDirname(filename);
48-
const testCasesPath = join(dirname, filePath);
49-
50-
try {
51-
const fileContent = readFileSync(testCasesPath, 'utf-8');
52-
return JSON.parse(fileContent) as TestData;
53-
} catch {
54-
log.error(`Error: Test cases file not found at ${testCasesPath}`);
55-
process.exit(1);
56-
}
57-
}
58-
59-
export function filterByCategory(testCases: TestCase[], category: string): TestCase[] {
60-
// Convert wildcard pattern to regex
61-
const pattern = category.replace(/\*/g, '.*');
62-
const regex = new RegExp(`^${pattern}$`);
63-
64-
return testCases.filter((testCase) => regex.test(testCase.category));
65-
}
31+
// Re-export shared functions for backwards compatibility
32+
export { filterByCategory, filterById } from './shared/test-case-loader.js';
6633

67-
export function filterById(testCases: TestCase[], idPattern: string): TestCase[] {
68-
const regex = new RegExp(idPattern);
34+
type ExampleInputOnly = { input: Record<string, unknown>, metadata?: Record<string, unknown>, output?: never };
6935

70-
return testCases.filter((testCase) => regex.test(testCase.id));
36+
/**
37+
* Load test cases from a JSON file (wrapper around shared function)
38+
*/
39+
export function loadTestCases(filePath: string): TestData {
40+
return loadTestCasesShared(filePath);
7141
}
7242

7343
export async function loadTools(): Promise<ToolBase[]> {
@@ -76,22 +46,11 @@ export async function loadTools(): Promise<ToolBase[]> {
7646
return urlTools.map((t: ToolEntry) => getToolPublicFieldOnly(t)) as ToolBase[];
7747
}
7848

79-
export function transformToolsToOpenAIFormat(tools: ToolBase[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
80-
return tools.map((tool) => ({
81-
type: 'function',
82-
function: {
83-
name: tool.name,
84-
description: tool.description,
85-
parameters: tool.inputSchema as OpenAI.Chat.ChatCompletionTool['function']['parameters'],
86-
},
87-
}));
88-
}
89-
9049
export function createOpenRouterTask(modelName: string, tools: ToolBase[]) {
9150
const toolsOpenAI = transformToolsToOpenAIFormat(tools);
9251

9352
return async (example: ExampleInputOnly): Promise<{
94-
tool_calls: Array<{ function?: { name?: string } }>;
53+
tool_calls: OpenAI.Chat.Completions.ChatCompletionMessageToolCall[];
9554
llm_response: string;
9655
query: string;
9756
context: string;

evals/run-evaluation.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ import {
2828
EVALUATOR_NAMES,
2929
type EvaluatorName,
3030
sanitizeHeaderValue,
31-
validateEnvVars
31+
validatePhoenixEnvVars
3232
} from './config.js';
3333

3434
type EvaluatorResult = {
@@ -202,7 +202,7 @@ function printResults(results: EvaluatorResult[]): void {
202202
async function main(datasetName: string): Promise<number> {
203203
log.info('Starting MCP tool calling evaluation');
204204

205-
if (!validateEnvVars()) {
205+
if (!validatePhoenixEnvVars()) {
206206
return 1;
207207
}
208208

evals/shared/config.ts

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
/**
2+
* Shared configuration for evaluation systems
3+
* Contains OpenRouter config, environment validation, and common utilities
4+
*/
5+
6+
/**
7+
* OpenRouter API configuration
8+
* OPENROUTER_BASE_URL is optional and defaults to the standard OpenRouter API URL
9+
*/
10+
export const OPENROUTER_CONFIG = {
11+
baseURL: process.env.OPENROUTER_BASE_URL || 'https://openrouter.ai/api/v1',
12+
apiKey: process.env.OPENROUTER_API_KEY || '',
13+
};
14+
15+
/**
16+
* Get required environment variables
17+
* Note: OPENROUTER_BASE_URL is optional (defaults to https://openrouter.ai/api/v1)
18+
*/
19+
export function getRequiredEnvVars(): Record<string, string | undefined> {
20+
return {
21+
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY,
22+
};
23+
}
24+
25+
/**
26+
* Removes newlines and trims whitespace. Useful for Authorization header values
27+
* because CI secrets sometimes include trailing newlines or quotes.
28+
*/
29+
export function sanitizeHeaderValue(value?: string): string | undefined {
30+
if (value == null) return value;
31+
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, '');
32+
}
33+
34+
/**
35+
* Validate that all required environment variables are present
36+
*/
37+
export function validateEnvVars(): boolean {
38+
const envVars = getRequiredEnvVars();
39+
const missing = Object.entries(envVars)
40+
.filter(([, value]) => !value)
41+
.map(([key]) => key);
42+
43+
if (missing.length > 0) {
44+
// eslint-disable-next-line no-console
45+
console.error(`Missing required environment variables: ${missing.join(', ')}`);
46+
return false;
47+
}
48+
49+
return true;
50+
}

evals/shared/openai-tools.ts

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
/**
2+
* Convert tool definitions to OpenAI format
3+
* Unified function for both MCP tools and internal ToolBase types
4+
*/
5+
6+
import type OpenAI from 'openai';
7+
8+
import type { McpTool } from './types.js';
9+
10+
/**
11+
* Generic tool interface that matches both ToolBase and McpTool
12+
*/
13+
type GenericTool = {
14+
name: string;
15+
description?: string;
16+
inputSchema: Record<string, unknown>;
17+
}
18+
19+
/**
20+
* Convert tools to OpenAI Chat Completion format
21+
* Works with both MCP tools and ToolBase from the server
22+
*/
23+
export function transformToolsToOpenAIFormat(tools: GenericTool[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
24+
return tools.map((tool) => ({
25+
type: 'function' as const,
26+
function: {
27+
name: tool.name,
28+
description: tool.description || '',
29+
parameters: tool.inputSchema,
30+
},
31+
}));
32+
}
33+
34+
/**
35+
* Alias for MCP-specific usage (keeps backwards compatibility)
36+
*/
37+
export function mcpToolsToOpenAiTools(mcpTools: McpTool[]): OpenAI.Chat.Completions.ChatCompletionTool[] {
38+
return transformToolsToOpenAIFormat(mcpTools);
39+
}

0 commit comments

Comments
 (0)