Skip to content

Testing strategy

How to structure your AI agent test suite using Tenro alongside other testing approaches.

The testing pyramid for AI agents

                    ┌───────────────────┐
                    │   End-to-End      │  Few, slow, expensive
                    │   (Real APIs)     │  Smoke tests only
                    ├───────────────────┤
                    │   Integration     │  More tests
                    │   (Tenro)         │  Fast, free, reliable
                    ├───────────────────┤
                    │   Unit Tests      │  Many tests
                    │   (Pure logic)    │  Instant, no dependencies
                    └───────────────────┘

When to use each approach

Unit tests (no simulation needed)

Test pure logic that doesn't involve LLMs or external tools:

def test_parse_response():
    """Test response parsing logic."""
    raw = '{"name": "Alice", "age": 30}'
    result = parse_user_response(raw)
    assert result.name == "Alice"

Use for: parsers, formatters, validators, data transformations.

Integration tests (Tenro simulation)

Test agent behaviour with simulated dependencies:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("search")
def search(query: str) -> list[str]:
    return api.search(query)

@tenro
def test_agent_workflow():
    """Test the agent calls tools and produces output."""
    tool.simulate(search, result=["doc1", "doc2"])
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(search, query="AI trends"),
        "Summary: AI is advancing rapidly.",
    ])

    result = research_agent.run("AI trends")

    tool.verify_many(search, count=1)
    assert "Summary" in result

Use for: agent workflows, tool calling, multi-turn conversations, error handling.

End-to-end tests (real APIs)

Smoke tests with real providers (run sparingly):

@pytest.mark.e2e
def test_real_api_smoke():
    """Verify real API integration works."""
    result = research_agent.run("Hello")
    assert result is not None

Use for: deployment verification, API compatibility checks.

What to test with Tenro

1. Tool calling

Verify your agent calls the right tools with correct arguments:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("get_weather")
def get_weather(city: str) -> dict:
    return weather_api.fetch(city)

@tenro
def test_calls_correct_tool():
    # Simulate tool result
    tool.simulate(get_weather, result={"temp": 72})
    # LLM decides to call tool, then responds with final answer
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(get_weather, city="Paris"),
        "The weather in Paris is 72°F.",
    ])

    weather_agent.run("What's the weather in Paris?")

    tool.verify(get_weather, city="Paris")

2. Response handling

Verify your agent processes LLM responses correctly:

from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_refusal():
    llm.simulate(
        provider=Provider.OPENAI,
        response="I cannot help with that request.",
    )

    result = agent.run("Do something harmful")

    assert "cannot help" in result

3. Multi-step workflows

Verify complex agent sequences where the LLM decides which tools to call:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("search")
def search(query: str) -> list[str]:
    return api.search(query)

@link_tool("fetch")
def fetch(doc_id: str) -> str:
    return api.fetch(doc_id)

@tenro
def test_research_workflow():
    # Simulate tool results
    tool.simulate(search, result=["doc1"])
    tool.simulate(fetch, result="Document content...")
    # LLM decides to search, then fetch, then summarize
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(search, query="AI papers"),
        ToolCall(fetch, doc_id="doc1"),
        "Summary of the document.",
    ])

    research_agent.run("Summarize AI papers")

    # Verify sequence
    tool.verify_many(search, count=1)
    tool.verify_many(fetch, count=1)

4. Error handling

Verify graceful degradation:

from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_api_error():
    llm.simulate(
        provider=Provider.OPENAI,
        responses=[ConnectionError("API unavailable")],
        use_http=False,
    )

    result = agent.run("Hello")

    assert "sorry" in result.lower()  # Graceful fallback

Structuring your test suite

tests/
├── unit/                    # Pure logic tests
│   ├── test_parsers.py
│   └── test_validators.py
├── integration/             # Tenro simulation tests
│   ├── test_agents.py
│   ├── test_workflows.py
│   └── test_error_handling.py
└── e2e/                     # Real API tests (sparse)
    └── test_smoke.py

Run different test tiers:

# Fast feedback (unit + integration)
uv run pytest tests/unit tests/integration

# Full suite including E2E
uv run pytest --run-e2e

Best practices

Do

  • Test observable behaviour (outputs, tool calls)
  • Use descriptive test names explaining the scenario
  • Keep simulations minimal (only what the test needs)
  • Test error paths, not just happy paths

Don't

  • Test internal implementation details
  • Over-simulate (simulate things that don't matter)
  • Write flaky tests that depend on timing
  • Skip error handling tests

Next steps