Orphan-span sweep
What happens if Hermes crashes mid-turn, or the host is power-cycled, or the session-end hook never fires for some other reason?
Without cleanup, the root session.* span would stay open forever in the plugin's in-memory state, and in the backend UI you'd see a trace that starts but never ends. That's bad both operationally (active-span state leak) and visually (every orphaned turn is a "what's still running??" red herring).
The orphan sweep is a simple TTL-based cleanup that runs at the top of every pre_* hook.
How it works
The SpanTracker keeps a dict of open root spans, each with its start timestamp. On every pre_* hook (there's always at least one per turn), the tracker scans the dict:
for session_id, span_info in list(self._roots.items()):
age_ms = now_ms() - span_info.started_at_ms
if age_ms > self.root_span_ttl_ms:
# Finalize this orphan
span_info.span.set_attribute("hermes.turn.final_status", "timed_out")
span_info.span.end()
del self._roots[session_id]
Any root older than root_span_ttl_ms (default: 10 minutes) gets:
hermes.turn.final_status = "timed_out"attributeStatusCode.OK(notERROR— timeouts shouldn't pollute error dashboards)span.end()called, which enqueues it for export
The sweep runs on the hot path but is O(n) in the number of currently-open sessions, which is typically ~1.
Configuration
# config.yaml
root_span_ttl_ms: 600000 # 10 minutes (default)
Or:
export HERMES_OTEL_ROOT_SPAN_TTL_MS=300000 # 5 minutes
Pick a TTL longer than your longest reasonable turn. A cron job that walks a large codebase might legitimately take 20+ minutes; set the TTL to ~1 hour if your agents have long-running turns.
What it's NOT
- Not a liveness check. The sweep only triggers when another turn fires a hook. If Hermes is completely idle, a stale span stays open until the next user interaction wakes the sweeper.
- Not a heartbeat. There's no background thread polling for expiry. Keeps the plugin simple; the latency is bounded by the next-hook cadence which is at most a few seconds under real load.
- Not a replacement for crash handling. Buffered spans in the
BatchSpanProcessorqueue are lost on a hard crash (SIGKILL, OOM) —atexitdoesn't run. The orphan sweep finalizes the root once a later process reboots and a new hook fires, so if you process-restart, the previous process's orphan gets cleaned up by the new one — but its span is already gone with the queue.
Interaction with graceful shutdown
On graceful shutdown (SIGTERM, hermes gateway stop), the atexit handler:
- Calls
SpanTracker.end_all()— finalizes any still-open roots withfinal_status=incomplete. - Calls
force_flush()on everyBatchSpanProcessorso buffered spans get exported.
The orphan sweep isn't involved in graceful shutdown — it only runs on the hook hot path.
Seeing it in action
The simplest way to verify: start Hermes, begin a turn that does nothing (just sits at a prompt), hard-kill the process, restart Hermes, start a new turn. A few seconds after the new turn's first hook fires, check the backend UI — you should see the previous orphan showing up as completed... wait, let me correct that — as timed_out with a ~TTL-ish duration.
If it's showing up as indefinitely open, either:
- The new turn hasn't fired a
pre_*hook yet (wait a moment, do something) - The TTL is too long (the orphan is still within the grace window)
- Debug log will show the sweep output:
[hermes-otel] swept 1 orphan root(s)