Testing strategy¶
How to structure your AI agent test suite using Tenro alongside other testing approaches.
The testing pyramid for AI agents¶
┌───────────────────┐
│ End-to-End │ Few, slow, expensive
│ (Real APIs) │ Smoke tests only
├───────────────────┤
│ Integration │ More tests
│ (Tenro) │ Fast, free, reliable
├───────────────────┤
│ Unit Tests │ Many tests
│ (Pure logic) │ Instant, no dependencies
└───────────────────┘
When to use each approach¶
Unit tests (no simulation needed)¶
Test pure logic that doesn't involve LLMs or external tools:
def test_parse_response():
"""Test response parsing logic."""
raw = '{"name": "Alice", "age": 30}'
result = parse_user_response(raw)
assert result.name == "Alice"
Use for: parsers, formatters, validators, data transformations.
Integration tests (Tenro simulation)¶
Test agent behaviour with simulated dependencies:
from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro
@link_tool("search")
def search(query: str) -> list[str]:
return api.search(query)
@tenro
def test_agent_workflow():
"""Test the agent calls tools and produces output."""
tool.simulate(search, result=["doc1", "doc2"])
llm.simulate(Provider.OPENAI, responses=[
ToolCall(search, query="AI trends"),
"Summary: AI is advancing rapidly.",
])
result = research_agent.run("AI trends")
tool.verify_many(search, count=1)
assert "Summary" in result
Use for: agent workflows, tool calling, multi-turn conversations, error handling.
End-to-end tests (real APIs)¶
Smoke tests with real providers (run sparingly):
@pytest.mark.e2e
def test_real_api_smoke():
"""Verify real API integration works."""
result = research_agent.run("Hello")
assert result is not None
Use for: deployment verification, API compatibility checks.
What to test with Tenro¶
1. Tool calling¶
Verify your agent calls the right tools with correct arguments:
from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro
@link_tool("get_weather")
def get_weather(city: str) -> dict:
return weather_api.fetch(city)
@tenro
def test_calls_correct_tool():
# Simulate tool result
tool.simulate(get_weather, result={"temp": 72})
# LLM decides to call tool, then responds with final answer
llm.simulate(Provider.OPENAI, responses=[
ToolCall(get_weather, city="Paris"),
"The weather in Paris is 72°F.",
])
weather_agent.run("What's the weather in Paris?")
tool.verify(get_weather, city="Paris")
2. Response handling¶
Verify your agent processes LLM responses correctly:
from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_refusal():
llm.simulate(
provider=Provider.OPENAI,
response="I cannot help with that request.",
)
result = agent.run("Do something harmful")
assert "cannot help" in result
3. Multi-step workflows¶
Verify complex agent sequences where the LLM decides which tools to call:
from tenro import link_tool, Provider, ToolCall
from tenro.simulate import llm, tool
from tenro.testing import tenro
@link_tool("search")
def search(query: str) -> list[str]:
return api.search(query)
@link_tool("fetch")
def fetch(doc_id: str) -> str:
return api.fetch(doc_id)
@tenro
def test_research_workflow():
# Simulate tool results
tool.simulate(search, result=["doc1"])
tool.simulate(fetch, result="Document content...")
# LLM decides to search, then fetch, then summarize
llm.simulate(Provider.OPENAI, responses=[
ToolCall(search, query="AI papers"),
ToolCall(fetch, doc_id="doc1"),
"Summary of the document.",
])
research_agent.run("Summarize AI papers")
# Verify sequence
tool.verify_many(search, count=1)
tool.verify_many(fetch, count=1)
4. Error handling¶
Verify graceful degradation:
from tenro.simulate import llm
from tenro.testing import tenro
@tenro
def test_handles_api_error():
llm.simulate(
provider=Provider.OPENAI,
responses=[ConnectionError("API unavailable")],
use_http=False,
)
result = agent.run("Hello")
assert "sorry" in result.lower() # Graceful fallback
Structuring your test suite¶
tests/
├── unit/ # Pure logic tests
│ ├── test_parsers.py
│ └── test_validators.py
├── integration/ # Tenro simulation tests
│ ├── test_agents.py
│ ├── test_workflows.py
│ └── test_error_handling.py
└── e2e/ # Real API tests (sparse)
└── test_smoke.py
Run different test tiers:
# Fast feedback (unit + integration)
uv run pytest tests/unit tests/integration
# Full suite including E2E
uv run pytest --run-e2e
Best practices¶
Do¶
- Test observable behaviour (outputs, tool calls)
- Use descriptive test names explaining the scenario
- Keep simulations minimal (only what the test needs)
- Test error paths, not just happy paths
Don't¶
- Test internal implementation details
- Over-simulate (simulate things that don't matter)
- Write flaky tests that depend on timing
- Skip error handling tests
Next steps¶
- Testing patterns: Common patterns for simulation and verification
- Examples: Real-world test examples