Troubleshooting
Symptom-first. Find what you're seeing, get the diagnosis and the fix. Every entry here maps to a failure mode observed in real deployments.
Agent turns fail with 502 and context deadline exceeded at ~120s
Cause: the managed-tool mediation budget. The whole mediated turn — every inference round plus every tool execution — shares one total_timeout_ms budget (default 120s). Slow reasoning models exhaust it mid-chain.
Fix: raise the budget in pod YAML and redeploy:
x-claw:
tool-policy-defaults:
total-timeout-ms: 300000claw up -dSee Managed Tools § Budget limits. Check which model is slow with claw audit --json — latency_ms is on each response event.
A feed is missing from agent context / feed_fetch errors at ~3s
Cause: the feed provider responded slower than the fetch timeout (default 3s). The agent sees a [Feed unavailable] notice (or stale cached content) where fresh data should be; the proxy logs the underlying failure.
Fix: raise the fetch timeout via cllama env:
x-claw:
cllama-defaults:
env:
CLLAMA_FEED_FETCH_TIMEOUT_MS: "10000"Then fix the slow provider — the timeout knob buys headroom, it doesn't make a synchronous upstream computation fast. Audit trail: claw audit shows feed_fetch errors and feed_injection events with a skipped (...) notice.
Managed tool calls are rejected with schema_validation errors
Cause: the model emitted arguments that violate the tool's declared inputSchema — most commonly a required field at the wrong nesting level. cllama rejects these before the service is called and tells the model exactly what's wrong (missing required property "x" at top level; found at "ctx.x").
Fix: this is working as intended — the model corrects itself in-round. If a valid call is being rejected, the descriptor's inputSchema doesn't match what the service actually accepts: fix the service's claw.describe descriptor. Emergency bypass: CLLAMA_TOOL_SCHEMA_VALIDATION: "off" in cllama env. Audit trail: managed_tool_schema_rejected interventions in claw audit; full arguments in the session-history tool_trace.
Managed tool calls repeatedly rejected by the providing service (4xx)
Cause: the service validates more strictly than its descriptor declares, so schema validation passes but the provider refuses. The model retries by guessing and burns mediation rounds.
Fix: make the descriptor's inputSchema declare everything the service enforces (required, nesting, enums). The schema is the model's only contract — anything enforced but undeclared turns into guess-and-retry. Read the rejection bodies in tool_trace in session history.
Agent has no provider access / "credential starvation" preflight failures
Cause: provider API keys placed in the agent's environment: block. Agents must never hold provider keys — cllama does.
Fix: move keys to x-claw.cllama-env (service level) or x-claw.cllama-defaults.env (pod level):
x-claw:
cllama: passthrough
cllama-env:
ANTHROPIC_API_KEY: "${ANTHROPIC_API_KEY}"Discord bot connected but never replies
Diagnosis first:
claw compose exec <service> cat /root/.hermes/logs/gateway.log # Hermes
claw logs <service> # any driverZero gateway entries after startup = connected but not receiving events.
Causes, in order of likelihood:
- MESSAGE CONTENT intent not enabled in the Discord developer portal — the bot sees mentions but no message text.
- Stale gateway session —
claw compose restart <service>. - The bot requires a mention (
mention_onlyis set by all drivers to prevent multi-agent loops) and the message didn't mention it.
Agents mention-looping each other in a multi-agent pod
Cause: a driver config that dropped require_mention, or a runner replying with an auto-mention (Hermes's reply feature pings the original author unless patched).
Fix: Clawdapus sets mention_only/requireMention and suppresses reply mentions in all built-in drivers — if you see loops, check for a hand-edited runner config overriding the generated one, and confirm your images are built from current runner bases (claw pull, then claw build).
claw ps / claw logs / claw health refuse to run
Symptom: "pod file is newer than compose.generated.yml".
Cause: you edited claw-pod.yml after the last compile. The generated compose file is the single source of truth, and stale state is fail-closed.
Fix: claw up -d to recompile. (claw down is exempt — you can always tear down.)
cllama returns "missing API key for provider" 502s, but the key is configured
Cause: a pre-v0.2.2 cllama image silently loading a v2-format providers.json with empty key pools, or a stale container running an old image.
Fix: the four-verb refresh:
claw pull
claw up -d # recreates the proxy container from the pulled imageVerify key state live: curl -N -H "Authorization: Bearer <ui_token>" http://<host>:8181/events — the initial payload has providers[name].maskedKey; an empty string means no active key loaded.
claw build fails closed asking for claw pull
Cause: the service image's runner base (openclaw:latest etc.) has no versioned sibling tag — usually because it was built with a manual docker build instead of claw pull.
Fix: run claw pull (refreshes runner aliases properly), then claw build.
claw audit quick reference
| Event type | Meaning |
|---|---|
request / response | A proxied LLM call: agent, model, latency, tokens, cost |
error | Upstream/provider failure (timeouts, 4xx/5xx from the LLM provider) |
intervention | Governance event — see below |
feed_fetch / feed_injection | Feed provider fetch results and what was injected into context |
tool_call | Managed tool execution from session-history tool_trace |
tool_manifest_loaded | Whether a compiled tool manifest reached the proxy for a request (manifest_present, tools_count) |
channel_context_op | Channel-context feed and claw-wall tool activity |
memory_op | Memory plane recall/retain operations |
provider_pool | Provider key pool state changes (cooldowns, failover) |
Intervention reasons you may see:
| Intervention | Meaning |
|---|---|
managed_tool_schema_rejected:<tool> | Tool call rejected pre-dispatch for schema violations |
duplicate_managed_tool_call:<tool> | Same tool + args repeated in a turn; the earlier cached result is replayed (default replay), or the legacy data-less 409 is returned under CLLAMA_MANAGED_DUPLICATE_POLICY=reject |
duplicate_managed_tool_call_finalization:<tool> | After CLLAMA_MANAGED_DUPLICATE_STREAK_CUTOFF consecutive identical duplicate calls (default 3), cllama disabled tools and forced a final answer before the round budget ran out |
mixed_tool_order_internal_retry | Model mixed native-first/managed-later tool order; cllama replanned internally |
managed_prefix_native_suffix_serialized | Managed prefix executed internally before a runner-native suffix in one response |
managed_tool_budget_finalization | Budget exhausted; cllama forced a final text turn instead of returning empty |
bare_model_normalized | Runner sent a bare model name; the proxy normalized it to the agent's declared slot |
In JSON output (claw audit --json), tool_manifest_loaded events carry manifest_present and tools_count — use them to confirm a compiled tool manifest actually reached cllama for an agent.
Still stuck?
claw doctor— environment sanity checksclaw inspect <service>— what was actually compiled for a service- Open an issue with your
claw doctoroutput, driver type, and the relevant pod YAML snippet
