Skip to content

feat: add giskard-llm as lean litellm replacement#2329

Merged
kevinmessiaen merged 68 commits intomainfrom
feat/giskard-llm
Apr 29, 2026
Merged

feat: add giskard-llm as lean litellm replacement#2329
kevinmessiaen merged 68 commits intomainfrom
feat/giskard-llm

Conversation

@Hartorn
Copy link
Copy Markdown
Member

@Hartorn Hartorn commented Mar 24, 2026

Summary

  • Introduces a new giskard-llm library that replaces litellm as the default routing layer for giskard-agents, wrapping native provider SDKs (OpenAI, Google GenAI, Anthropic, Azure OpenAI, Azure AI Foundry) and eliminating 200+ transitive dependencies.
  • Updates giskard-agents to import from giskard.llm via GiskardLLMGenerator, with pass-through extras so users install only the provider SDKs they need (pip install giskard-agents[openai]).
  • Keeps LiteLLMGenerator available as a first-class alternative backend behind an optional litellm extra, so both backends can coexist in the same process.
  • Adds provider-specific pytest marks (@pytest.mark.google, @pytest.mark.litellm, etc.) with auto-skip logic and a CI matrix that installs/tests each backend + provider independently.

What changed

New: libs/giskard-llm/

  • Protocol-based provider system (providers/base.py) — CompletionProvider, EmbeddingProvider, ResponseProvider protocols. Providers implement only the protocols they support (e.g. Anthropic implements only CompletionProvider).
  • Provider adapters (providers/openai.py, providers/google.py, providers/anthropic.py, providers/azure_openai.py, providers/azure_ai.py) — thin wrappers over native SDKs with unified error mapping via _map_error(), structured output via native SDK features, and class-level _PROVIDER attribute for correct error attribution in Azure subclasses.
  • Response/Interactions APIResponseProvider protocol with respond() method, supporting OpenAI Responses API and Gemini Interactions API with unified ResponseResult, ResponseOutputText, ResponseOutputFunctionCall types. Tool definitions are converted from nested Chat Completions format to the flat format expected by each API.
  • Canonical tool result formatFunctionCallOutput TypedDict as the unified input format for feeding back tool results. GoogleProvider normalizes this to function_result format internally.
  • Unified error hierarchy (errors.py) — LLMError base with AuthenticationError, RateLimitError, ServerError, LLMTimeoutError, BadRequestError, UnsupportedOperationError, ProviderNotAvailableError, each carrying a status_code for retry decisions. ProviderNotAvailableError accepts an optional extra hint (e.g. extra="azure"pip install giskard-llm[azure]).
  • Routing (routing.py) — parses "provider/model" strings and dispatches to the correct provider adapter. LLMClient supports named configurations via configure(). Module-level configure(), reset(), acompletion(), aembedding(), aresponse() convenience functions.
  • Response types (types.py) — Pydantic v2 models inheriting from _BaseModel (which defaults model_dump to exclude_none=True). CompletionResponse, EmbeddingResponse, ResponseResult. ToolCallFunction.arguments is str (JSON) for wire-compatible round-trips. TypedDicts (ToolDef, ChatMessage, FunctionCallOutput) for user-constructed inputs.
  • Retry helper (retry.py) — class-based RETRYABLE_ERRORS = frozenset({LLMTimeoutError, RateLimitError, ServerError}) and a should_retry(exc) helper — no HTTP-status branching at the retry layer.
  • Design documentation (docs/design.md) — documents type conventions (TypedDict vs Pydantic), tool format differences, canonical tool result format, and key design decisions.
  • Unit tests — cover routing, retry, types, errors, error mapping completeness, message conversion, input normalization, and mocked provider integration including protocol conformance checks.
  • Functional tests — per-provider tests for completion, embedding, response API, tool roundtrips (hardcoded + parallel), multi-system messages, and behavior config, auto-skipped if the SDK extra isn't installed.

Changed: libs/giskard-agents/

  • GiskardLLMGenerator (renamed from LiteLLMGenerator) and litellm_embedding_model.py now import acompletion, aembedding, should_retry from giskard.llm.
  • pyproject.toml replaces the litellm hard dep with giskard-llm>=1.0.0a1 and re-exports its provider extras (openai, google, anthropic, all). A new litellm extra re-enables the optional LiteLLMGenerator.
  • LiteLLMGenerator restored as a first-class alternative backend using the litellm SDK. Importing it without the extra raises ImportError with a clear install hint; it is lazily re-exported from generators/__init__.py via __getattr__ so plain imports of giskard.agents never require litellm. GiskardLLMGenerator no longer claims the "litellm" discriminator kind — each class owns its own kind.
  • Test fixtures updated to use CompletionResponse / EmbeddingResponse from giskard.llm.
  • conftest.py adds a pytest_collection_modifyitems hook to auto-skip provider-marked tests when the SDK isn't installed, plus an autouse fixture to clear the provider cache between tests.
  • New test_litellm_generator.py — unit tests for the LiteLLM adapter (mocked acompletion, retry middleware status-code mapping, discriminator round-trip).
  • New test_generator_backends.py — functional tests exercising both backends through direct complete() and ChatWorkflow, plus a test_backends_coexist_in_same_process test that only runs when both SDKs are installed.

Changed: CI

  • ci.yml adds giskard-llm to the unit test matrix (Python 3.12–3.14) and a test-no-providers job for SDK-less error handling.
  • integration-tests.yml:
    • test-agents-functional uses a 3-layout × 3-python matrix (giskard-llm / litellm / both) installed via top-level giskard-agents[...] extras so the pass-through declarations are exercised.
    • Adds test-llm-functional: matrix over 5 providers (openai, google, anthropic, azure, azure_ai), secrets scoped to ci environment.
    • Adds test-checks-functional: matrix over google provider, secrets scoped to ci environment.
    • Adds workflow_dispatch trigger for manual runs with head.sha || github.ref fallback.
    • All checkout steps use persist-credentials: false.
  • labeler.yml adds a Scope: LLM label for libs/giskard-llm/**.
  • Makefile gains test-no-providers and install-no-providers targets.
  • Default models: gpt-4.1-nano (OpenAI/Azure), gemini-2.5-flash (Google), claude-haiku-4-5-20251001 (Anthropic). Azure OpenAI default api_version updated to 2024-10-21.

Bug fixes (from review follow-up)

  • Google Interactions API input normalization — rewrote _normalize_input_items to convert all items to Google's preferred flat ContentParam format (Pattern A). function_call uses id (not call_id) with arguments as dict; function_call_outputfunction_result with resolved name; role-tagged turns → TextContentParam. Docstring links to SDK type definitions, official example, and known issue #1906.
  • Google _to_response_result — prioritize output.id over output.call_id for function call IDs (matching what Google actually returns).
  • Google _convert_messages — resolve tool_call_id to actual function name from preceding assistant tool_calls; set role="user" for role="tool" messages.
  • Google _map_error — handle API_KEY_INVALID string check for AuthenticationError; fall back to status 500 on SDK APIError.
  • Anthropic _validate_messages — skip alternation check when either role is "tool" (consecutive tool messages are valid since they merge).
  • Anthropic _convert_messages — merge consecutive role="tool" messages into a single user message with multiple tool_result blocks.
  • OpenAI _normalize_input_item — serialize function_call.arguments from dict to JSON string (OpenAI Responses API rejects dicts).
  • OpenAI _map_error — tolerate APIConnectionError missing a status_code attribute.
  • Moved pyright ignores to the from line so they apply to optional SDK imports without leaking to the rest of the module.

Misc

  • pyrightconfig.json adds libs/giskard-llm/src to execution environments.
  • Workspace pyproject.toml adds the new package.
  • uv.lock regenerated after the litellm removal and follow-up dependency upgrades (anthropic, cryptography, google-genai, logfire-api, numpy, and other transitive bumps).
  • .secrets.baseline refreshed to track new test-file line numbers already flagged as false positives.

Test plan

To re-run integration tests manually: Dispatch workflow with ref feat/giskard-llm.

Made with Cursor

@Hartorn Hartorn requested a review from a team as a code owner March 24, 2026 18:24
@Hartorn Hartorn marked this pull request as draft March 24, 2026 18:24
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the LLM integration within the project by replacing the litellm dependency with a custom-built giskard-llm library. This change aims to streamline LLM interactions, reduce the overall dependency footprint, and provide more granular control over provider-specific functionalities. The update also includes improvements to the testing framework, enabling more efficient and targeted testing of different LLM providers.

Highlights

  • New giskard-llm Library: Introduced a new giskard-llm library to replace litellm, providing a lightweight routing layer over native provider SDKs (OpenAI, Google GenAI, Anthropic) and significantly reducing transitive dependencies.
  • Updated giskard-agents Integration: giskard-agents now imports directly from giskard.llm and supports pass-through extras, allowing users to install only the necessary provider SDKs (e.g., pip install giskard-agents[openai]).
  • Enhanced Testing Infrastructure: Added provider-specific pytest marks (@pytest.mark.google, etc.) with automatic skip logic and a CI matrix to install and test each provider independently.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/ci.yml
    • .github/workflows/integration-tests.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new giskard-llm library, serving as a lightweight routing layer for various LLM providers (OpenAI, Google Gemini, Anthropic). The giskard-agents library has been refactored to depend on and utilize this new giskard-llm library, replacing its direct dependency on litellm. This change involves updating imports, adapting to the new unified response types for completions and embeddings, and modifying the retry middleware to use giskard-llm's error handling. Additionally, the build system and test suite have been updated to include the new library and support provider-specific functional testing, allowing tests to be skipped if a provider's SDK is not installed. No feedback to provide.

kevinmessiaen and others added 29 commits April 24, 2026 09:37
- Introduced a new `_coerce_json` function to handle JSON string coercion in argument parsing.
- Updated `ToolCallFunction` and `ResponseFunctionToolCall` classes to use `ArgumentDict`, improving type safety and consistency in argument handling.
- Added necessary imports to support the new type definitions and validation logic.
…ment

- Changed the error message in `AnthropicProvider` to specify that messages must contain at least one non-system message, improving clarity for users regarding message requirements.
- Modified the `transcript` property in `AssistantMessage` to handle empty output text gracefully, ensuring it defaults to an empty string instead of including a colon.
- Added new tests to verify the correct formatting of the transcript, including cases with tool calls, ensuring no duplicated role prefixes in the output.
…onMessage transcript handling

- Revised the documentation for output types to include `AssistantMessage` in the list of Pydantic models.
- Modified the `transcript` property in `FunctionMessage` to return an empty string when content is None, improving output consistency.
- Update design.md to reflect ArgumentDict (dict) contract instead of stale str claim
- Add exhaustive ValueError fallback in _completion_content_to_block and _completion_content_to_parts to prevent silent None returns
- Fix _validate_messages to count developer role alongside system in instruction message limit (system+developer combo now correctly requires merge_system=True)
- Add FunctionMessage transcript tests and translator tests (raises for Anthropic/Google, passes through for OpenAI)
- Remove no-op deserialize_arguments calls where arguments is already a dict (ArgumentDict)
- Add test_anthropic_validate_system_and_developer_raises_without_merge for the previously unchecked mixed case

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Raise ValueError for unsupported content type/role in OpenAI chat translator instead of returning None implicitly
- Wrap translator calls in try/except ValueError -> BadRequestError in Anthropic, Google, and OpenAI providers so translation errors flow through the error hierarchy
- Treat developer role as instruction-only in Google provider _validate_messages to prevent empty contents list
- Use dedicated fc_idx counter for Google tool call IDs instead of content part index
- Remove dead ResponseTranslator class from openai_chat.py
- Add case _ guard in chat.message() match statement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Updated numpy and giskard-llm version specifications to include upper bounds, ensuring compatibility and preventing potential breaking changes.
- Refactored test assertions in test_generator_backends.py to align with updated response structure from the API.
- Introduced a new utility function to extract system messages, enhancing code clarity and reusability across translators.
- Changed the return type in LoggingMiddleware from Response to CompletionResponse to align with the updated response structure.
- Updated the workflow step runner to correctly handle tool calls from the response message.
- Refactored GoogleChatTranslator to generate unique tool call IDs using uuid4 instead of a simple counter.
- Enhanced error handling in deserialize_arguments to raise a ValueError for invalid JSON input.
- Adjusted imports and removed unused Response references in generator initialization.
…erators

- Changed the type of messages in LoggingMiddleware from Message to ChatMessage for better type alignment.
- Enhanced LiteLLMGenerator to serialize tool call arguments correctly, ensuring proper handling of function arguments.
- Updated GoogleChatTranslator to default function descriptions to an empty string if None, improving robustness.
- Updated exception handling in AnthropicProvider and OpenAIProvider to catch all exceptions, allowing for broader error management.
- Ensured that the _map_error method is invoked for all exceptions, improving robustness in response processing.
…orkflow and generators

- Added RuntimeError for empty choices list in _StepRunner to enhance error handling during response processing.
- Updated type hints in BaseGenerator to reflect CompletionResponse instead of Response, ensuring consistency across the codebase.
- Enhanced GoogleProvider to validate message roles more accurately, preventing empty system messages.
…I format

- Refactored LiteLLMGenerator to utilize OpenAIChatTranslator for message serialization, ensuring compatibility with OpenAI's expected input format.
- Updated type hints in the _serialize_messages method to reflect the new return type, enhancing type safety and clarity.
- Removed outdated serialization logic to streamline the codebase.
…rialization

- Removed unused imports and outdated serialization methods in OpenAIChatTranslator to simplify the codebase.
- Updated tool and message serialization to utilize model_dump for improved consistency with OpenAI's expected format.
- Enhanced ArgumentDict to include a PlainSerializer for JSON serialization based on context, ensuring proper handling of arguments.
…ranslator

- Eliminated outdated tool call methods and imports in OpenAIChatTranslator to simplify the codebase.
- Streamlined the class by focusing on essential functionality, enhancing maintainability and clarity.
…response handling

- Renamed `prompt_tokens` and `completion_tokens` to `input_tokens` and `output_tokens` across various translators and response models for consistency.
- Simplified the OpenAIResponseTranslator by removing outdated input handling methods and utilizing `model_dump` for serialization.
- Updated tests to reflect the new naming conventions, ensuring alignment with the refactored usage structure.
…r implementations

- Introduced a new serialization framework for various models, improving consistency across Google and OpenAI translators.
- Updated the GoogleResponseTranslator and OpenAIResponseTranslator to utilize the new serialization methods, enhancing maintainability.
- Refactored the Anthropic and Google Chat translators to improve message handling and serialization logic.
- Added new utility functions for text content serialization and streamlined input handling across multiple response models.
- Updated tests to ensure compatibility with the new serialization structure and improved overall test coverage.
- Removed dependency on OpenAIChatTranslator for message serialization in LiteLLMGenerator.
- Implemented a new serialization method using model_dump for converting messages to OpenAI's expected format.
- Enhanced type safety by updating return type hints in the _serialize_messages method.
…esult

- Enhanced the output_text method to filter out None values from response outputs, ensuring only valid text is concatenated.
- Improved readability by restructuring the list comprehension for better clarity and maintainability.
…atParams

- Updated the response_format field validator to use class method syntax for improved clarity and consistency.
- Changed type hints from `dict[str, object]` to `dict[str, Any]` to enhance type safety.
- Simplified the validation logic to ensure proper handling of BaseModel subclasses.
…oss translators

- Updated message content handling in OpenAI, Anthropic, and Google Chat translators to support sequences of text content.
- Introduced new utility methods for converting content blocks to Giskard format, improving consistency in message serialization.
- Refactored message types to accommodate both string and sequence types for content, enhancing flexibility in message representation.
- Removed unused utility functions and streamlined imports to improve code clarity and maintainability.
- Updated tests to validate new content handling and serialization logic, ensuring compatibility with the revised structure.
…across providers

- Updated error handling in Google, OpenAI, and Anthropic providers to consistently raise BadRequestError on validation failures.
- Refactored message validation logic to enhance clarity and maintainability.
- Adjusted GoogleChatTranslator to include message count in tool call IDs for better traceability.
- Simplified usage token handling in GoogleResponseTranslator to align with updated response structures.
- Enhanced tests to validate new error handling and response formats, ensuring robustness across providers.
…rkflow components

- Replaced `output_text` with `text` in various classes to standardize message content retrieval.
- Enhanced message extraction logic to handle None values and improve clarity in message representation.
- Updated tests to reflect changes in message handling, ensuring robust validation of text content across components.
… update test assertions

- Updated `model_dump` calls in `GiskardLLMGenerator` and `LiteLLMGenerator` to include `exclude_unset=True` for improved parameter handling.
- Refactored test assertions to check for `text` instead of `content` in message responses, ensuring consistency across tests.
- Enhanced type hints in test functions to improve code clarity and maintainability.
- Introduced a new class method `register_serializer` in `_BaseModel` to facilitate the registration of serializers.
- Updated import statements to include the new registration function, enhancing the model's serialization capabilities.
…sponse translators

- Replaced the `register_serializer` decorator with a class method approach for `ToolDef`, `RefusalContent`, `TextContent`, and other message types across Anthropic, Google, and OpenAI translators.
- Introduced a `_PROVIDER` constant in each translator file to standardize provider naming in serialization context.
- Updated `model_dump` calls to utilize the new `_PROVIDER` constant for improved consistency in context handling.
refactor(giskard-llm): translator layer, split types, and provider slim-down
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants