Skip to content

fix: Improve README for clarity and MCP clients info at the top#382

Merged
jirispilka merged 1 commit intomasterfrom
fix/improve-readme
Jan 6, 2026
Merged

fix: Improve README for clarity and MCP clients info at the top#382
jirispilka merged 1 commit intomasterfrom
fix/improve-readme

Conversation

@jirispilka
Copy link
Copy Markdown
Collaborator

No description provided.

@jirispilka jirispilka requested a review from MQ37 January 6, 2026 14:12
@github-actions github-actions Bot added the t-ai Issues owned by the AI team. label Jan 6, 2026
@jirispilka jirispilka merged commit eaeb57b into master Jan 6, 2026
2 checks passed
@jirispilka jirispilka deleted the fix/improve-readme branch January 6, 2026 14:27
MQ37 added a commit that referenced this pull request Jan 7, 2026
commit d1f7dc7
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 14:03:21 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384)

    * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json

    * fix: remove pollInterval from task creation in tool call request

commit 4270b02
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 12:10:14 2026 +0100

    feat(evals): add llm driven workflow evals with llm as a judge (#383)

    * feat(evals): add llm driven workflow evals with llm as a judge

    Add workflow evaluation system for testing AI agents in multi-turn
    conversations using Apify MCP tools, with LLM-based evaluation.

    Core Components:
    - Multi-turn conversation executor with dynamic tool discovery
    - LLM judge for evaluating agent performance against requirements
    - Isolated MCP server per test (prevents state contamination)
    - OpenRouter integration (agent + judge models)
    - Configurable tool timeout (default: 60s, MCP SDK integration)

    Architecture:
    • MCP server spawned fresh per test → test isolation
    • Tools refreshed after each turn → supports dynamic registration (add-actor)
    • Strict pass/fail → all tests must pass for CI success
    • Raw error propagation → LLM receives MCP SDK errors unchanged

    CLI Usage:
    npm run evals:workflow
    npm run evals:workflow -- --tool-timeout 300 --category search

    CLI Options:
    --tool-timeout <seconds>  Tool call timeout (default: 60)
    --agent-model <model>     Agent model (default: claude-haiku-4.5)
    --judge-model <model>     Judge model (default: grok-4.1-fast)
    --category <name>         Filter by category
    --id <id>                 Run specific test
    --verbose                 Show full conversations

    Environment:
    APIFY_TOKEN - Required for MCP server
    OPENROUTER_API_KEY - Required for LLM calls

    This enables systematic testing of MCP tools, agent tool-calling behavior,
    and automated quality evaluation without manual verification.

    * refactor(evals): extract shared utilities and unify test case format

    This commit refactors the evaluation system to eliminate code duplication
    and standardize test case formats across both tool selection and workflow
    evaluation systems.
    - types.ts: Unified type definitions for test cases and tools
    - config.ts: Shared OpenRouter configuration and environment validation
    - openai-tools.ts: Consolidated tool transformation utilities
    - test-case-loader.ts: Unified test case loading and filtering functions
    - Standardized on 'query' (previously 'prompt' in workflows)
    - Standardized on 'reference' (previously 'requirements' in workflows)
    - Added version tracking to workflows/test-cases.json
    - Maintains backwards compatibility through type exports
    Removed 7 duplicate functions across the codebase:
    - Test case loading (evaluation-utils.ts vs test-cases-loader.ts)
    - Test case filtering (filterById, filterByCategory, filterTestCases)
    - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools)
    - OpenRouter configuration (OPENROUTER_CONFIG duplicated)
    - Environment validation (validateEnvVars duplicated)
    - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1)
    - Created Phoenix-specific validation (validatePhoenixEnvVars)
    - Separated concerns between shared and system-specific config
    - Updated 11 existing files to use shared utilities
    - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared)
    - All imports updated to reference shared modules
    - Reduced config code by ~37%
    - Eliminated 100% of duplicate functions
    - Improved maintainability and consistency
    - No breaking changes to external APIs
    - TypeScript compilation: ✓
    - Project build: ✓
    - All imports verified: ✓

    * feat(evals): add parallel execution and fix linting for workflows

    - Add --concurrency/-c flag to run workflow evals in parallel (default: 4)
    - Add p-limit dependency for concurrency control
    - Enable ESLint for evals/workflows/ and evals/shared/ directories
    - Fix all linting issues (117 errors):
      - Convert interfaces to types per project convention
      - Fix import ordering with simple-import-sort
      - Remove trailing spaces
      - Fix comma-dangle, arrow-parens, operator-linebreak
      - Prefer node: protocol for built-in imports
      - Fix nested ternary in output-formatter.ts
    - Add logWithPrefix() helper for prefixed live output
    - Extract runSingleTest() function from main evaluation loop
    - Remove empty line after test completion in output

    Breaking changes: None (all changes backward compatible)

    Usage:
      npm run evals:workflow -- -c 10  # Run 10 tests in parallel
      npm run evals:workflow -- -c 1   # Sequential mode

    * feat(evals): use structured output for judge LLM and fix test filtering

    - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation
    - Replace fragile text parsing with guaranteed JSON validation
    - Fix test case filtering to support wildcard patterns (--category) and regex (--id)
    - Add responseFormat parameter to LLM client for structured outputs
    - Update judge prompt to remove manual format instructions
    - Add test case for weather MCP Actor

    * feat(evals): MCP instructions, test tracking, and expanded test coverage

commit 6dd3b10
Author: Apify Release Bot <noreply@apify.com>
Date:   Tue Jan 6 14:28:55 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit eaeb57b
Author: Jiří Spilka <jiri.spilka@apify.com>
Date:   Tue Jan 6 15:27:51 2026 +0100

    fix: Improve README for clarity and MCP clients info at the top (#382)
MQ37 added a commit that referenced this pull request Jan 8, 2026
commit c1c415f
Author: Apify Release Bot <noreply@apify.com>
Date:   Thu Jan 8 09:53:59 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit 31c3bdd
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Thu Jan 8 10:53:06 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.2 in package.json and package-lock.json (#385)

commit d1f7dc7
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 14:03:21 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384)

    * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json

    * fix: remove pollInterval from task creation in tool call request

commit 4270b02
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 12:10:14 2026 +0100

    feat(evals): add llm driven workflow evals with llm as a judge (#383)

    * feat(evals): add llm driven workflow evals with llm as a judge

    Add workflow evaluation system for testing AI agents in multi-turn
    conversations using Apify MCP tools, with LLM-based evaluation.

    Core Components:
    - Multi-turn conversation executor with dynamic tool discovery
    - LLM judge for evaluating agent performance against requirements
    - Isolated MCP server per test (prevents state contamination)
    - OpenRouter integration (agent + judge models)
    - Configurable tool timeout (default: 60s, MCP SDK integration)

    Architecture:
    • MCP server spawned fresh per test → test isolation
    • Tools refreshed after each turn → supports dynamic registration (add-actor)
    • Strict pass/fail → all tests must pass for CI success
    • Raw error propagation → LLM receives MCP SDK errors unchanged

    CLI Usage:
    npm run evals:workflow
    npm run evals:workflow -- --tool-timeout 300 --category search

    CLI Options:
    --tool-timeout <seconds>  Tool call timeout (default: 60)
    --agent-model <model>     Agent model (default: claude-haiku-4.5)
    --judge-model <model>     Judge model (default: grok-4.1-fast)
    --category <name>         Filter by category
    --id <id>                 Run specific test
    --verbose                 Show full conversations

    Environment:
    APIFY_TOKEN - Required for MCP server
    OPENROUTER_API_KEY - Required for LLM calls

    This enables systematic testing of MCP tools, agent tool-calling behavior,
    and automated quality evaluation without manual verification.

    * refactor(evals): extract shared utilities and unify test case format

    This commit refactors the evaluation system to eliminate code duplication
    and standardize test case formats across both tool selection and workflow
    evaluation systems.
    - types.ts: Unified type definitions for test cases and tools
    - config.ts: Shared OpenRouter configuration and environment validation
    - openai-tools.ts: Consolidated tool transformation utilities
    - test-case-loader.ts: Unified test case loading and filtering functions
    - Standardized on 'query' (previously 'prompt' in workflows)
    - Standardized on 'reference' (previously 'requirements' in workflows)
    - Added version tracking to workflows/test-cases.json
    - Maintains backwards compatibility through type exports
    Removed 7 duplicate functions across the codebase:
    - Test case loading (evaluation-utils.ts vs test-cases-loader.ts)
    - Test case filtering (filterById, filterByCategory, filterTestCases)
    - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools)
    - OpenRouter configuration (OPENROUTER_CONFIG duplicated)
    - Environment validation (validateEnvVars duplicated)
    - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1)
    - Created Phoenix-specific validation (validatePhoenixEnvVars)
    - Separated concerns between shared and system-specific config
    - Updated 11 existing files to use shared utilities
    - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared)
    - All imports updated to reference shared modules
    - Reduced config code by ~37%
    - Eliminated 100% of duplicate functions
    - Improved maintainability and consistency
    - No breaking changes to external APIs
    - TypeScript compilation: ✓
    - Project build: ✓
    - All imports verified: ✓

    * feat(evals): add parallel execution and fix linting for workflows

    - Add --concurrency/-c flag to run workflow evals in parallel (default: 4)
    - Add p-limit dependency for concurrency control
    - Enable ESLint for evals/workflows/ and evals/shared/ directories
    - Fix all linting issues (117 errors):
      - Convert interfaces to types per project convention
      - Fix import ordering with simple-import-sort
      - Remove trailing spaces
      - Fix comma-dangle, arrow-parens, operator-linebreak
      - Prefer node: protocol for built-in imports
      - Fix nested ternary in output-formatter.ts
    - Add logWithPrefix() helper for prefixed live output
    - Extract runSingleTest() function from main evaluation loop
    - Remove empty line after test completion in output

    Breaking changes: None (all changes backward compatible)

    Usage:
      npm run evals:workflow -- -c 10  # Run 10 tests in parallel
      npm run evals:workflow -- -c 1   # Sequential mode

    * feat(evals): use structured output for judge LLM and fix test filtering

    - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation
    - Replace fragile text parsing with guaranteed JSON validation
    - Fix test case filtering to support wildcard patterns (--category) and regex (--id)
    - Add responseFormat parameter to LLM client for structured outputs
    - Update judge prompt to remove manual format instructions
    - Add test case for weather MCP Actor

    * feat(evals): MCP instructions, test tracking, and expanded test coverage

commit 6dd3b10
Author: Apify Release Bot <noreply@apify.com>
Date:   Tue Jan 6 14:28:55 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit eaeb57b
Author: Jiří Spilka <jiri.spilka@apify.com>
Date:   Tue Jan 6 15:27:51 2026 +0100

    fix: Improve README for clarity and MCP clients info at the top (#382)
MQ37 added a commit that referenced this pull request Jan 14, 2026
#387)

* plan.md

* Squashed commit of the following:

commit d1f7dc7
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 14:03:21 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384)

    * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json

    * fix: remove pollInterval from task creation in tool call request

commit 4270b02
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 12:10:14 2026 +0100

    feat(evals): add llm driven workflow evals with llm as a judge (#383)

    * feat(evals): add llm driven workflow evals with llm as a judge

    Add workflow evaluation system for testing AI agents in multi-turn
    conversations using Apify MCP tools, with LLM-based evaluation.

    Core Components:
    - Multi-turn conversation executor with dynamic tool discovery
    - LLM judge for evaluating agent performance against requirements
    - Isolated MCP server per test (prevents state contamination)
    - OpenRouter integration (agent + judge models)
    - Configurable tool timeout (default: 60s, MCP SDK integration)

    Architecture:
    • MCP server spawned fresh per test → test isolation
    • Tools refreshed after each turn → supports dynamic registration (add-actor)
    • Strict pass/fail → all tests must pass for CI success
    • Raw error propagation → LLM receives MCP SDK errors unchanged

    CLI Usage:
    npm run evals:workflow
    npm run evals:workflow -- --tool-timeout 300 --category search

    CLI Options:
    --tool-timeout <seconds>  Tool call timeout (default: 60)
    --agent-model <model>     Agent model (default: claude-haiku-4.5)
    --judge-model <model>     Judge model (default: grok-4.1-fast)
    --category <name>         Filter by category
    --id <id>                 Run specific test
    --verbose                 Show full conversations

    Environment:
    APIFY_TOKEN - Required for MCP server
    OPENROUTER_API_KEY - Required for LLM calls

    This enables systematic testing of MCP tools, agent tool-calling behavior,
    and automated quality evaluation without manual verification.

    * refactor(evals): extract shared utilities and unify test case format

    This commit refactors the evaluation system to eliminate code duplication
    and standardize test case formats across both tool selection and workflow
    evaluation systems.
    - types.ts: Unified type definitions for test cases and tools
    - config.ts: Shared OpenRouter configuration and environment validation
    - openai-tools.ts: Consolidated tool transformation utilities
    - test-case-loader.ts: Unified test case loading and filtering functions
    - Standardized on 'query' (previously 'prompt' in workflows)
    - Standardized on 'reference' (previously 'requirements' in workflows)
    - Added version tracking to workflows/test-cases.json
    - Maintains backwards compatibility through type exports
    Removed 7 duplicate functions across the codebase:
    - Test case loading (evaluation-utils.ts vs test-cases-loader.ts)
    - Test case filtering (filterById, filterByCategory, filterTestCases)
    - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools)
    - OpenRouter configuration (OPENROUTER_CONFIG duplicated)
    - Environment validation (validateEnvVars duplicated)
    - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1)
    - Created Phoenix-specific validation (validatePhoenixEnvVars)
    - Separated concerns between shared and system-specific config
    - Updated 11 existing files to use shared utilities
    - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared)
    - All imports updated to reference shared modules
    - Reduced config code by ~37%
    - Eliminated 100% of duplicate functions
    - Improved maintainability and consistency
    - No breaking changes to external APIs
    - TypeScript compilation: ✓
    - Project build: ✓
    - All imports verified: ✓

    * feat(evals): add parallel execution and fix linting for workflows

    - Add --concurrency/-c flag to run workflow evals in parallel (default: 4)
    - Add p-limit dependency for concurrency control
    - Enable ESLint for evals/workflows/ and evals/shared/ directories
    - Fix all linting issues (117 errors):
      - Convert interfaces to types per project convention
      - Fix import ordering with simple-import-sort
      - Remove trailing spaces
      - Fix comma-dangle, arrow-parens, operator-linebreak
      - Prefer node: protocol for built-in imports
      - Fix nested ternary in output-formatter.ts
    - Add logWithPrefix() helper for prefixed live output
    - Extract runSingleTest() function from main evaluation loop
    - Remove empty line after test completion in output

    Breaking changes: None (all changes backward compatible)

    Usage:
      npm run evals:workflow -- -c 10  # Run 10 tests in parallel
      npm run evals:workflow -- -c 1   # Sequential mode

    * feat(evals): use structured output for judge LLM and fix test filtering

    - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation
    - Replace fragile text parsing with guaranteed JSON validation
    - Fix test case filtering to support wildcard patterns (--category) and regex (--id)
    - Add responseFormat parameter to LLM client for structured outputs
    - Update judge prompt to remove manual format instructions
    - Add test case for weather MCP Actor

    * feat(evals): MCP instructions, test tracking, and expanded test coverage

commit 6dd3b10
Author: Apify Release Bot <noreply@apify.com>
Date:   Tue Jan 6 14:28:55 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit eaeb57b
Author: Jiří Spilka <jiri.spilka@apify.com>
Date:   Tue Jan 6 15:27:51 2026 +0100

    fix: Improve README for clarity and MCP clients info at the top (#382)

* refactor: simplify call-actor tool to single-step workflow and enhance fetch-actor-details

## Summary

This commit simplifies the Actor calling workflow from a mandatory two-step process to a more intuitive single-step approach, while enhancing tool capabilities and improving evaluation infrastructure.

## Core Changes

### call-actor Tool Simplification (src/tools/actor.ts)
- **Removed**: Mandatory two-step workflow (step='info' then step='call')
- **Changed**: Input is now required (was optional in step='call')
- **Updated**: Tool description to guide users to fetch-actor-details first
- **Simplified**: Removed conditional logic for step parameter
- **Added**: Support for MCP server Actors using 'actorName:toolName' format
- **Added**: taskSupport: 'optional' execution annotation for long-running tasks

### fetch-actor-details Enhancement (src/tools/fetch-actor-details.ts)
- **Added**: 'output' parameter to control response content (description, stats, pricing, input-schema, readme, mcp-tools)
- **Added**: Support for listing available MCP tools directly (output=['mcp-tools'])
- **Enhanced**: Better token efficiency with output=['input-schema'] for minimal responses
- **Updated**: Documentation with usage examples and parameter descriptions
- **Added**: MCP client connection and tool listing logic

### Tool Architecture Refactoring
- **Created**: src/tools/categories.ts - Separate module for tool categories to avoid circular dependencies
- **Created**: src/utils/tool-categories-helpers.ts - Helper functions for category operations
- **Refactored**: src/tools/index.ts - Now uses string constants and imports from categories.ts
- **Fixed**: Circular dependency: tools/index.ts → utils/tools.ts → tools/categories.ts → tools/index.ts

### Evaluation System Improvements
- **Updated**: evals/config.ts - Tool selection guidelines reflect new single-step workflow
- **Removed**: evals/run-evaluation.ts - Normalization of call-actor step='info' to fetch-actor-details (tools now independent)
- **Enhanced**: evals/workflows/mcp-client.ts:
  - Better error handling with error field population
  - Timeout-based cleanup (2s default) to prevent indefinite waiting
  - Force kill of transport process if graceful shutdown fails
  - More robust state cleanup

### Test Infrastructure Enhancements
- **Created**: evals/shared/line-range-parser.ts - Parse line ranges from strings
- **Created**: evals/shared/line-range-filter.ts - Filter test cases by line numbers
- **Enhanced**: evals/workflows/test-cases-loader.ts - New loadTestCasesWithLineNumbers() function
- **Added**: evals/workflows/run-workflow-evals.ts - Line range filtering support (--lines flag)
- **Added**: evals/workflows/output-formatter.ts - Tool result display in verbose mode
- **Updated**: evals/workflows/README.md - Documentation for line range filtering

### Documentation Updates
- **Updated**: README.md - call-actor description to emphasize fetch-actor-details requirement
- **Updated**: AGENTS.md - Added quick validation workflow section

## Breaking Changes

- **call-actor**: No longer supports step='info' parameter. Use fetch-actor-details instead.
- **call-actor**: Input parameter is now required (was optional before).

## Benefits

1. **Simpler workflow**: Users call the Actor directly without intermediate schema fetch
2. **Clearer tool division**: fetch-actor-details handles all documentation/schema needs
3. **Better UX**: fetch-actor-details with output=['input-schema'] provides token-efficient schema retrieval
4. **MCP tool discovery**: fetch-actor-details with output=['mcp-tools'] lists available tools
5. **Cleaner code**: Removed two-step orchestration logic from call-actor
6. **Better testing**: Line range filtering for test case evaluation
7. **Robust cleanup**: Timeout-based cleanup prevents hanging processes

* Squashed commit of the following:

commit c1c415f
Author: Apify Release Bot <noreply@apify.com>
Date:   Thu Jan 8 09:53:59 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit 31c3bdd
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Thu Jan 8 10:53:06 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.2 in package.json and package-lock.json (#385)

commit d1f7dc7
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 14:03:21 2026 +0100

    fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384)

    * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json

    * fix: remove pollInterval from task creation in tool call request

commit 4270b02
Author: Jakub Kopecký <themq37@gmail.com>
Date:   Wed Jan 7 12:10:14 2026 +0100

    feat(evals): add llm driven workflow evals with llm as a judge (#383)

    * feat(evals): add llm driven workflow evals with llm as a judge

    Add workflow evaluation system for testing AI agents in multi-turn
    conversations using Apify MCP tools, with LLM-based evaluation.

    Core Components:
    - Multi-turn conversation executor with dynamic tool discovery
    - LLM judge for evaluating agent performance against requirements
    - Isolated MCP server per test (prevents state contamination)
    - OpenRouter integration (agent + judge models)
    - Configurable tool timeout (default: 60s, MCP SDK integration)

    Architecture:
    • MCP server spawned fresh per test → test isolation
    • Tools refreshed after each turn → supports dynamic registration (add-actor)
    • Strict pass/fail → all tests must pass for CI success
    • Raw error propagation → LLM receives MCP SDK errors unchanged

    CLI Usage:
    npm run evals:workflow
    npm run evals:workflow -- --tool-timeout 300 --category search

    CLI Options:
    --tool-timeout <seconds>  Tool call timeout (default: 60)
    --agent-model <model>     Agent model (default: claude-haiku-4.5)
    --judge-model <model>     Judge model (default: grok-4.1-fast)
    --category <name>         Filter by category
    --id <id>                 Run specific test
    --verbose                 Show full conversations

    Environment:
    APIFY_TOKEN - Required for MCP server
    OPENROUTER_API_KEY - Required for LLM calls

    This enables systematic testing of MCP tools, agent tool-calling behavior,
    and automated quality evaluation without manual verification.

    * refactor(evals): extract shared utilities and unify test case format

    This commit refactors the evaluation system to eliminate code duplication
    and standardize test case formats across both tool selection and workflow
    evaluation systems.
    - types.ts: Unified type definitions for test cases and tools
    - config.ts: Shared OpenRouter configuration and environment validation
    - openai-tools.ts: Consolidated tool transformation utilities
    - test-case-loader.ts: Unified test case loading and filtering functions
    - Standardized on 'query' (previously 'prompt' in workflows)
    - Standardized on 'reference' (previously 'requirements' in workflows)
    - Added version tracking to workflows/test-cases.json
    - Maintains backwards compatibility through type exports
    Removed 7 duplicate functions across the codebase:
    - Test case loading (evaluation-utils.ts vs test-cases-loader.ts)
    - Test case filtering (filterById, filterByCategory, filterTestCases)
    - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools)
    - OpenRouter configuration (OPENROUTER_CONFIG duplicated)
    - Environment validation (validateEnvVars duplicated)
    - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1)
    - Created Phoenix-specific validation (validatePhoenixEnvVars)
    - Separated concerns between shared and system-specific config
    - Updated 11 existing files to use shared utilities
    - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared)
    - All imports updated to reference shared modules
    - Reduced config code by ~37%
    - Eliminated 100% of duplicate functions
    - Improved maintainability and consistency
    - No breaking changes to external APIs
    - TypeScript compilation: ✓
    - Project build: ✓
    - All imports verified: ✓

    * feat(evals): add parallel execution and fix linting for workflows

    - Add --concurrency/-c flag to run workflow evals in parallel (default: 4)
    - Add p-limit dependency for concurrency control
    - Enable ESLint for evals/workflows/ and evals/shared/ directories
    - Fix all linting issues (117 errors):
      - Convert interfaces to types per project convention
      - Fix import ordering with simple-import-sort
      - Remove trailing spaces
      - Fix comma-dangle, arrow-parens, operator-linebreak
      - Prefer node: protocol for built-in imports
      - Fix nested ternary in output-formatter.ts
    - Add logWithPrefix() helper for prefixed live output
    - Extract runSingleTest() function from main evaluation loop
    - Remove empty line after test completion in output

    Breaking changes: None (all changes backward compatible)

    Usage:
      npm run evals:workflow -- -c 10  # Run 10 tests in parallel
      npm run evals:workflow -- -c 1   # Sequential mode

    * feat(evals): use structured output for judge LLM and fix test filtering

    - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation
    - Replace fragile text parsing with guaranteed JSON validation
    - Fix test case filtering to support wildcard patterns (--category) and regex (--id)
    - Add responseFormat parameter to LLM client for structured outputs
    - Update judge prompt to remove manual format instructions
    - Add test case for weather MCP Actor

    * feat(evals): MCP instructions, test tracking, and expanded test coverage

commit 6dd3b10
Author: Apify Release Bot <noreply@apify.com>
Date:   Tue Jan 6 14:28:55 2026 +0000

    chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci]

commit eaeb57b
Author: Jiří Spilka <jiri.spilka@apify.com>
Date:   Tue Jan 6 15:27:51 2026 +0100

    fix: Improve README for clarity and MCP clients info at the top (#382)

* feat: add granular output controls for Actor card sections

Implement granular control over Actor card output to enable token-efficient
information retrieval. Users can now request specific sections (description,
stats, pricing, rating, metadata) independently instead of receiving all
information bundled together.

Changes:
- Add ActorCardOptions type with 5 boolean flags for granular control
- Update formatActorToActorCard() and formatActorToStructuredCard() with
  conditional rendering based on options
- Fix rating and bookmarkCount to check both ActorStoreList and Actor.stats
  locations for better compatibility
- Add comprehensive unit tests (32 tests) using real apify/rag-web-browser data
- Update integration tests to validate granular output functionality
- Update fetch-actor-details tool schema to include 'rating' and 'metadata'
  as separate output options

Benefits:
- Reduces token usage by allowing users to request only needed information
- Maintains backwards compatibility (all options default to true)
- Improves flexibility for different use cases (e.g., pricing-only queries)

* remove plan.md

* refactor: use object with boolean flags for fetch-actor-details output parameter

Changes the output parameter from array of enums to object with boolean flags
for better LLM performance and simpler internal logic. Also includes related
improvements to error handling and documentation.

Key changes:
- Replace output array (e.g., ['input-schema']) with object flags (e.g., { inputSchema: true })
- Rework missing input error to include schema directly (reduces LLM round trips)
- Remove cross-file duplicate instructions in actor.ts parameter descriptions
- Simplify fetch-actor-details description by ~29% (remove redundant text)
- Update all tests to use new object format
- Add "do not run" instructions to all search category eval test cases

Benefits:
- Better LLM performance and reliability with explicit boolean flags
- Consistent error handling pattern (missing input now matches invalid input)
- Reduced token usage in tool descriptions
- Cleaner, more maintainable code (no .includes() checks)

* Squashed commit of the following:

commit efc6ade
Merge: c1c415f 878ada4
Author: jakcinmarina <52315405+jakcinmarina@users.noreply.github.com>
Date:   Wed Jan 14 13:57:21 2026 +0100

    feat(gpt-apps): add tools and widget descriptors (#375)

commit 878ada4
Author: Marina Jakčin <marina.jakcin@applifting.cz>
Date:   Wed Jan 14 10:41:46 2026 +0100

    feat(gpt-apps): auto-include get-actor-run for call-actor and uiMode

commit a9241e2
Author: Marina Jakčin <marina.jakcin@applifting.cz>
Date:   Wed Jan 14 09:04:20 2026 +0100

    refactor(gpt-apps): centralize and improve widget setup

commit a2339ff
Author: Marina Jakčin <marina.jakcin@applifting.cz>
Date:   Fri Jan 9 10:51:46 2026 +0100

    chore(gpt-apps): update widget CSP and widgetDomain configuration

commit 118037f
Author: Marina Jakčin <marina.jakcin@applifting.cz>
Date:   Fri Jan 9 10:25:03 2026 +0100

    refactor(gpt-apps): merge separate widget tools with existing ones

commit fa46c09
Author: Marina Jakčin <marina.jakcin@applifting.cz>
Date:   Mon Dec 22 16:36:35 2025 +0100

    feat(gpt-apps): add tools and widget descriptors

* refactor: make output parameters optional in fetchActorDetailsToolArgsSchema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-ai Issues owned by the AI team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants