Testing strategy¶

How to structure your AI agent test suite using Tenro alongside other testing approaches.

The testing pyramid for AI agents¶

                    ┌───────────────────┐
                    │   End-to-End      │  Few, slow, expensive
                    │   (Real APIs)     │  Smoke tests only
                    ├───────────────────┤
                    │   Integration     │  More tests
                    │   (Tenro)         │  Fast, free, reliable
                    ├───────────────────┤
                    │   Unit Tests      │  Many tests
                    │   (Pure logic)    │  Instant, no dependencies
                    └───────────────────┘

When to use each approach¶

Unit tests (no simulation needed)¶

Test pure logic that doesn't involve LLMs or external tools:

def test_parse_response():
    """Test response parsing logic."""
    raw = '{"name": "Alice", "age": 30}'
    result = parse_user_response(raw)
    assert result.name == "Alice"

Use for: parsers, formatters, validators, data transformations.

Integration tests (Tenro simulation)¶

Test agent behaviour with simulated dependencies:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("search")
def search(query: str) -> list[str]:
    return api.search(query)

@tenro
def test_agent_workflow():
    """Test the agent calls tools and produces output."""
    tool.simulate(search, result=["doc1", "doc2"])
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(search, query="AI trends"),
        "Summary: AI is advancing rapidly.",
    ])

    result = research_agent.run("AI trends")

    tool.verify_many(search, count=1)
    assert "Summary" in result

Use for: agent workflows, tool calling, multi-turn conversations, error handling.

End-to-end tests (real APIs)¶

Smoke tests with real providers (run sparingly):

@pytest.mark.e2e
def test_real_api_smoke():
    """Verify real API integration works."""
    result = research_agent.run("Hello")
    assert result is not None

Use for: deployment verification, API compatibility checks.

What to test with Tenro¶

1. Tool calling¶

Verify your agent calls the right tools with correct arguments:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("get_weather")
def get_weather(city: str) -> dict:
    return weather_api.fetch(city)

@tenro
def test_calls_correct_tool():
    # Simulate tool result
    tool.simulate(get_weather, result={"temp": 72})
    # LLM decides to call tool, then responds with final answer
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(get_weather, city="Paris"),
        "The weather in Paris is 72°F.",
    ])

    weather_agent.run("What's the weather in Paris?")

    tool.verify(get_weather, city="Paris")

2. Response handling¶

Verify your agent processes LLM responses correctly:

from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_refusal():
    llm.simulate(
        provider=Provider.OPENAI,
        response="I cannot help with that request.",
    )

    result = agent.run("Do something harmful")

    assert "cannot help" in result

3. Multi-step workflows¶

Verify complex agent sequences where the LLM decides which tools to call:

from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro

@link_tool("search")
def search(query: str) -> list[str]:
    return api.search(query)

@link_tool("fetch")
def fetch(doc_id: str) -> str:
    return api.fetch(doc_id)

@tenro
def test_research_workflow():
    # Simulate tool results
    tool.simulate(search, result=["doc1"])
    tool.simulate(fetch, result="Document content...")
    # LLM decides to search, then fetch, then summarize
    llm.simulate(Provider.OPENAI, responses=[
        ToolCall(search, query="AI papers"),
        ToolCall(fetch, doc_id="doc1"),
        "Summary of the document.",
    ])

    research_agent.run("Summarize AI papers")

    # Verify sequence
    tool.verify_many(search, count=1)
    tool.verify_many(fetch, count=1)

4. Error handling¶

Verify graceful degradation:

from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_api_error():
    llm.simulate(
        provider=Provider.OPENAI,
        responses=[ConnectionError("API unavailable")],
        use_http=False,
    )

    result = agent.run("Hello")

    assert "sorry" in result.lower()  # Graceful fallback

Structuring your test suite¶

tests/
├── unit/                    # Pure logic tests
│   ├── test_parsers.py
│   └── test_validators.py
├── integration/             # Tenro simulation tests
│   ├── test_agents.py
│   ├── test_workflows.py
│   └── test_error_handling.py
└── e2e/                     # Real API tests (sparse)
    └── test_smoke.py

Run different test tiers:

# Fast feedback (unit + integration)
uv run pytest tests/unit tests/integration

# Full suite including E2E
uv run pytest --run-e2e

Best practices¶

Do¶

Test observable behaviour (outputs, tool calls)
Use descriptive test names explaining the scenario
Keep simulations minimal (only what the test needs)
Test error paths, not just happy paths

Don't¶

Test internal implementation details
Over-simulate (simulate things that don't matter)
Write flaky tests that depend on timing
Skip error handling tests

Next steps¶

Testing patterns: Common patterns for simulation and verification
Examples: Real-world test examples