Testing
The suite is organised into four tiers, from fastest/simplest to slowest/most comprehensive. Start with the fastest tier that covers your change; add higher tiers if they're warranted.
Tier summary
| Tier | Marker | Count | What it tests | Requirements |
|---|---|---|---|---|
| Unit | (default) | 109 | Helper functions, SpanTracker, tracer init, hook callbacks | None |
| Integration | (default) | 19 | Full span pipeline with InMemorySpanExporter, hierarchy, token roll-up, metrics | None |
| E2E | -m e2e | 6 | OTLP export to real Phoenix/Langfuse, queried via GraphQL/REST | Docker |
| Smoke | -m smoke | 6 | Full pipeline — hermes API → plugin → Langfuse | Hermes gateway + Langfuse |
Default pytest runs unit + integration (no Docker, < 1 second). E2E and smoke are opt-in.
Running
# Unit + integration (default)
uv run --extra dev pytest
# All E2E
uv run --extra dev --extra e2e pytest -m e2e
# Single backend
uv run --extra dev --extra e2e pytest -m phoenix
uv run --extra dev --extra e2e pytest -m langfuse
# Smoke tests (full pipeline)
uv run --extra dev --extra e2e pytest -m smoke
# Coverage
uv run --extra dev pytest --cov=hermes_otel --cov-report=term-missing
CI fails if coverage falls below 85%.
Unit tests (tests/unit/)
Cover:
- Helper functions:
_safe_str,_to_int,_detect_session_kind SpanTrackerclass: span lifecycle, parent stack,end_allHermesOTelPlugin.init(): environment detection, backend priorityNoopSpangraceful degradation when OTel is unavailable- All 8 hook callbacks with mocked tracer (span names, attributes, metric recording, module-state management)
Fast, no external deps, deterministic. Run them on every save.
Integration tests (tests/integration/)
Use a real OTel SDK with InMemorySpanExporter — no network:
- Individual hook pairs produce correctly-attributed spans
- Parent-child nesting: Session → LLM → API → Tool (via span context)
- Full session lifecycle with token aggregation and session I/O roll-up
- Metric counters and histograms via
InMemoryMetricReader
Useful for regression-testing attribute conventions — if you change a span attribute name, an integration test will catch it.
E2E tests (tests/e2e/)
Invoke hooks directly against real backends:
- Phoenix — fires hooks, queries Phoenix GraphQL API at
/graphqlto verify spans - Langfuse — fires hooks, queries Langfuse REST API at
GET /api/public/observations
Docker fixtures start/stop the containers automatically. If the container is already running on the expected port, it's reused.
# Start containers manually if preferred
docker compose -f docker-compose/phoenix.yaml up -d
docker compose -f docker-compose/langfuse.yaml up -d
Pre-seeded Langfuse credentials:
- Public key:
lf_pk_test_hermes_otel - Secret key:
lf_sk_test_hermes_otel - OTEL endpoint:
http://localhost:3000/api/public/otel
Smoke tests (tests/smoke/)
Exercise the complete production pipeline:
OpenAI SDK --> hermes API server --> LLM --> OTEL plugin --> Langfuse
(port 8642) (hooks.py) (port 3000)
\ /
`--- pytest sends chat here pytest queries here ---`
Requirements:
-
Hermes Agent API server running with the OTEL plugin loaded. Add to
~/.hermes/.env:API_SERVER_ENABLED=trueStart the gateway:
hermes gateway -
Langfuse running with credentials configured in
~/.hermes/.env(OTEL_LANGFUSE_*variables).
Tests auto-skip with a helpful message if either service isn't reachable. The smoke tests poll the Langfuse observations API (up to 60–90s) to account for async trace ingestion.
Adding tests for a new backend
For a new OTLP-compatible backend:
- Unit: resolver function — env var precedence, header construction, URL normalization. Mock the env, assert on the resulting exporter config.
- Integration: attribute snapshot via
InMemorySpanExporter— spin up the plugin with the backend selected, fire a hook sequence, assert on the exported spans. - E2E (optional): if the backend has backend-specific attributes or a custom auth mechanism worth verifying against a real instance.
See tests/unit/test_backends.py and tests/integration/test_fanout.py for examples.
CI behavior
GitHub Actions runs unit + integration + ruff + black on every push to main and every PR. Coverage is posted as a workflow artifact. E2E and smoke tiers aren't in CI — they need Docker and maintainers run them locally before releases.