Hydra Article 8: Production Resilience — Making the Mesh Fail Loudly, Not Silently

The most dangerous failure mode in an autonomous system is not the one that crashes — it is the one that says "Ok" and does nothing.

This is the eighth article in the AI to Web3 series. We have built all seven layers of Hydra: LangGraph orchestration (Article 1), n8n execution (Article 2), RAG knowledge (Article 3), LangFuse observability (Article 4), fine-tuned specialist (Article 5), cost-aware swarm coordination (Article 6), and the Guardian security gate (Article 7).

Today we harden the system for production. Every silent failure becomes a visible log line. Every external dependency gets a retry or a graceful fallback. And we fix the six Docker configuration issues that prevent LangFuse 3.x from starting on macOS.

Why this article exists

After wiring all seven layers and adding OpenRouter credits, the orchestrator ran — but not well. The output was:

decisions=6 risks=1 signals=1

Six decisions when there should be two. One risk that was actually an analyst node failure, surfaced nowhere. The oracle had silently returned nothing (n8n not running). The analyst had silently fallen back to the cloud (Ollama not running). The strategist had fired twice due to a LangGraph fan-in behaviour.

None of this was visible. The system had succeeded at the wrong thing.

The pattern: nodes must never fail silently

The canonical pattern for every Hydra node:

# Pattern applied across all nodes — commit 3f63f16
async def some_node(state: HydraState) -> HydraState:
    try:
        result = await external_dependency()
        log.info("some_node: success — %s items", len(result))
        return {"signals": [{"source": "some_node", "content": result}]}
    except SpecificError as e:
        log.warning("some_node: degraded — %s", e)
        return {"signals": [{"source": "some_node", "status": "degraded",
                             "content": f"Node unavailable ({e}). Strategist has no context from this source."}]}
    except Exception as e:
        log.error("some_node: failed — %s", e)
        return {"risks": [{"type": "node_error", "node": "some_node",
                           "error": str(e), "message": "..."}]}

Three rules:

Never return empty on failure. Return a signal or risk that explains what happened.
Log at the right level. Soft degradation is WARNING. Hard failure is ERROR. Success is INFO (not DEBUG — you want to see the count in production logs).
Propagate to state. The strategist sees the node_error in risks. It can reason about incomplete information.

Oracle: n8n failures become explicit signals

# hydra/oracle.py — commit 3f63f16
async def oracle_node(state: HydraState) -> HydraState:
    fetch_errors: list[str] = []

    async with httpx.AsyncClient() as client:
        for query in queries:
            try:
                response = await client.get(N8N_RESEARCH_WEBHOOK, params={"q": query}, timeout=15.0)
                response.raise_for_status()
                raw_signals.extend(response.json().get("results", []))
            except httpx.HTTPStatusError as e:
                msg = f"n8n webhook {e.response.status_code} for query '{query}'"
                log.warning("oracle: %s", msg)
                fetch_errors.append(msg)
            except Exception as e:
                log.warning("oracle: n8n unreachable for '%s': %s", query, e)
                fetch_errors.append(str(e))

    if not raw_signals:
        log.warning("oracle: no raw signals (%d fetch errors)", len(fetch_errors))
        return {"signals": [{"source": "oracle", "status": "no_data",
                             "confidence": 0.0, "errors": fetch_errors,
                             "content": "Oracle returned no signals. Strategist should proceed with Sentinel data only."}]}

When n8n is down, the strategist now sees a no_data signal with confidence 0.0 rather than an empty signals list. It can weight this accordingly.

Guardian: LLM failure → escalate, not approve

The security posture choice matters:

# hydra/guardian.py — commit 3f63f16
try:
    assessment = await (GUARDIAN_PROMPT | llm).ainvoke({...})
    verdict = assessment.content.split("\n")[0].strip().upper()
except Exception as e:
    # LLM failure: escalate rather than silently approve (conservative)
    log.error("guardian: LLM assessment failed — escalating to human: %s", e)
    safe_decisions.append({
        **decision,
        "require_human_approval": True,
        "guardian_note": f"Guardian LLM failed ({e}). Escalated to human by default.",
    })
    continue

The wrong default is safe_decisions.append(decision) (silently approve if LLM is down). The right default is escalate to human — which routes the decision to n8n for manual review. Security gates should fail closed, not open.

Executor: tenacity retry for transient n8n failures

# hydra/executor.py — commit d21bf0f
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    retry=retry_if_exception_type((httpx.TransportError, httpx.TimeoutException)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    reraise=True,
)
async def _post_to_n8n(client: httpx.AsyncClient, payload: dict) -> dict:
    response = await client.post(EXECUTOR_WEBHOOK_URL, json=payload, timeout=60.0)
    response.raise_for_status()
    return response.json()

Three attempts with exponential backoff (2s, 4s, 8s). Only retries on transient network errors — not on HTTP 400/500 responses (those are deterministic failures, not transient). On persistent failure, the error lands in messages with the decisions preserved in state for manual replay.

Ingestor: three defensive fixes

# hydra/ingestor.py — commit d21bf0f

# 1. JSON-RPC error in body — HTTP 200 does not mean success
body = resp.json()
if "error" in body:
    raise RuntimeError(f"JSON-RPC error {body['error'].get('code')}: {body['error'].get('message')}")
return body.get("result")

# 2. Block field access — use .get() to avoid KeyError on partial responses
block_number = int(block.get("number", "0x0"), 16) if block else 0
block_timestamp = int(block.get("timestamp", "0x0"), 16) if block else 0

# 3. DB upsert — wrap so one bad row doesn't abort the loop
try:
    with engine.begin() as conn:
        conn.execute(text("INSERT INTO defi_knowledge ..."), {...})
except Exception as e:
    log.error("ingestor: DB upsert failed for hash %s (%s): %s",
              content_hash, metadata.get("protocol"), e)

Orchestrator: fan-in join node and safe_decisions channel

Two LangGraph state design issues required structural fixes.

Problem 1: strategist fires twice. In LangGraph, multiple incoming edges to a node create individual triggers — not a barrier. When both oracle → strategist and analyst → strategist exist, strategist fires once per completing predecessor. The fix is an explicit no-op join node:

# hydra/orchestrator.py — commit d879df2
async def _join(_state: HydraState) -> dict:
    """Force fan-in: wait for both oracle and analyst before strategist fires."""
    return {}

graph.add_node("join", _join)
graph.add_edge("analyst", "join")
graph.add_edge("oracle", "join")
graph.add_edge("join", "strategist")  # single edge in — fires once

Problem 2: guardian doubles decisions. The decisions channel uses operator.add so multiple writers can contribute. If guardian reads the accumulated decisions and writes them back as safe_decisions into the same channel, the count doubles. The fix is a separate safe_decisions plain list — owned solely by guardian, read solely by executor:

class HydraState(TypedDict):
    signals: Annotated[list[dict], operator.add]   # parallel writers
    risks: Annotated[list[dict], operator.add]      # parallel writers
    decisions: Annotated[list[dict], operator.add]  # strategist adds
    messages: Annotated[list, operator.add]         # any node adds
    safe_decisions: list[dict]  # guardian owns; executor reads
    human_approved: bool
    threats: list[dict]
    trace_id: str

Structured logging from the first line

# hydra/orchestrator.py — commit d879df2
def _configure_logging() -> None:
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%S",
        stream=sys.stdout,
    )
    logging.getLogger("httpx").setLevel(logging.WARNING)
    logging.getLogger("httpcore").setLevel(logging.WARNING)
    logging.getLogger("llama_index").setLevel(logging.WARNING)

The result — every node failure is visible, every degraded path is named:

2026-04-14T16:03:37 [WARNING] hydra.oracle: oracle: n8n webhook 404 for query 'Uniswap v3 ...'
2026-04-14T16:03:37 [WARNING] hydra.oracle: oracle: no raw signals collected (3 fetch errors)
2026-04-14T16:03:37 [WARNING] hydra.analyst: analyst: local Ollama model unavailable — falling back to cloud
2026-04-14T16:04:36 [INFO] __main__: hydra: cycle complete — pending=2 approved=2 risks=1 signals=1
2026-04-14T16:04:36 [WARNING] __main__: hydra: 1 node(s) reported errors:
2026-04-14T16:04:36 [WARNING] __main__:   [analyst] Analyst unavailable — strategist proceeded without domain analysis.

Exit code 0 with full visibility into what degraded. Exit code 1 on fatal errors.

The self-hosted LangFuse stack on macOS

LangFuse 3.x introduced ClickHouse as a required dependency. Getting it running on macOS Docker Desktop requires resolving six distinct issues that surface one after another.

Issue 1: CLICKHOUSE_MIGRATION_URL is not configured

LangFuse 3.x added a separate env var for its Go-based migration runner. It is distinct from CLICKHOUSE_URL (HTTP, used at runtime) and must use the native TCP scheme:

# docker-compose.yml — commit c221d89
langfuse-server:
  environment:
    CLICKHOUSE_URL: http://clickhouse:8123          # HTTP — runtime queries
    CLICKHOUSE_MIGRATION_URL: clickhouse://default:hydra@clickhouse:9000/default  # TCP — Go migration runner

The error unknown driver http confirms the scheme mismatch: the Go ClickHouse driver registers as clickhouse://, not http://.

Issue 2: get_mempolicy — Operation not permitted

ClickHouse calls get_mempolicy to detect NUMA memory topology. Docker Desktop for macOS blocks this syscall via its default seccomp profile:

clickhouse:
  security_opt:
    - seccomp:unconfined  # allows get_mempolicy on macOS Docker Desktop

Safe for local development. In production on Linux, this is not needed.

Issue 3: ClickHouse binds IPv6 only

ClickHouse's default config tries [::] on all ports. Docker Desktop for macOS disables IPv6 inside containers, so nothing binds:

<!-- docker/clickhouse/listen.xml — commit c221d89 -->
<clickhouse>
    <listen_host>0.0.0.0</listen_host>
</clickhouse>

Mount this into /etc/clickhouse-server/config.d/listen.xml.

Issue 4: Healthcheck uses localhost → IPv6

Alpine Linux resolves localhost to ::1 (IPv6). Since ClickHouse now only binds IPv4, wget localhost:8123/ping fails with "connection refused" even though wget 127.0.0.1:8123/ping succeeds:

healthcheck:
  # localhost resolves to ::1 in alpine; use 127.0.0.1 explicitly
  test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:8123/ping | grep -q 'Ok.'"]

Issue 5: ReplicatedMergeTree requires Zookeeper

LangFuse's ClickHouse migrations create ReplicatedMergeTree tables, which require a coordination service. For a single-node setup, use ClickHouse's built-in Keeper:

<!-- docker/clickhouse/keeper.xml — commit c221d89 -->
<clickhouse>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>1</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>warning</raft_logs_level>
        </coordination_settings>
        <raft_configuration>
            <server><id>1</id><hostname>localhost</hostname><port>9234</port></server>
        </raft_configuration>
    </keeper_server>
    <zookeeper>
        <node><host>127.0.0.1</host><port>2181</port></node>
    </zookeeper>
    <macros><shard>01</shard><replica>01</replica></macros>
</clickhouse>

Issue 6: CLICKHOUSE_PASSWORD must be non-empty

LangFuse 3.x validates all ClickHouse credentials at startup and rejects empty strings. Set a password in both places:

<!-- docker/clickhouse/users.xml — commit c221d89 -->
<clickhouse>
    <users>
        <default>
            <password>hydra</password>
            <networks><ip>::/0</ip></networks>
        </default>
    </users>
</clickhouse>

langfuse-server:
  environment:
    CLICKHOUSE_PASSWORD: hydra   # must match users.xml

Issue 7: LANGFUSE_S3_EVENT_UPLOAD_BUCKET required

LangFuse 3.x uses blob storage for event data. There is no flag to disable it — you must provide an S3-compatible endpoint. MinIO is the standard self-hosted answer:

# docker-compose.yml — commit c221d89
minio:
  image: minio/minio:latest
  command: server /data --console-address ":9001"
  environment:
    MINIO_ROOT_USER: minioadmin
    MINIO_ROOT_PASSWORD: minioadmin
  healthcheck:
    test: ["CMD-SHELL", "curl -sf http://127.0.0.1:9000/minio/health/live"]
    interval: 10s

minio-create-bucket:
  image: minio/mc:latest
  depends_on:
    minio: { condition: service_healthy }
  entrypoint: >
    sh -c "mc alias set local http://minio:9000 minioadmin minioadmin &&
           mc mb --ignore-existing local/langfuse-events"

langfuse-server:
  depends_on:
    minio: { condition: service_healthy }
  environment:
    LANGFUSE_S3_EVENT_UPLOAD_BUCKET: langfuse-events
    LANGFUSE_S3_EVENT_UPLOAD_ACCESS_KEY_ID: minioadmin
    LANGFUSE_S3_EVENT_UPLOAD_SECRET_ACCESS_KEY: minioadmin
    LANGFUSE_S3_EVENT_UPLOAD_ENDPOINT: http://minio:9000
    LANGFUSE_S3_EVENT_UPLOAD_REGION: us-east-1
    LANGFUSE_S3_EVENT_UPLOAD_FORCE_PATH_STYLE: "true"

After resolving all seven issues, the stack starts cleanly:

$ curl http://localhost:3000/api/public/health
{"status":"OK","version":"3.167.4"}

What the full stack looks like running

$ docker compose ps
hydra-clickhouse-1     Up (healthy)   0.0.0.0:8123->8123/tcp
hydra-langfuse-1       Up             0.0.0.0:3000->3000/tcp
hydra-minio-1          Up (healthy)   0.0.0.0:9001->9001/tcp
hydra-n8n-1            Up             0.0.0.0:5678->5678/tcp
hydra-postgres-1       Up (healthy)   0.0.0.0:5433->5432/tcp

$ python -m hydra.orchestrator
2026-04-14T16:03:34 [INFO] __main__: hydra: starting orchestration cycle
2026-04-14T16:03:37 [INFO] hydra.sentinel: sentinel: retrieved 0 signal(s) across 5 queries
2026-04-14T16:03:37 [WARNING] hydra.oracle: oracle: no raw signals collected (3 fetch errors)
2026-04-14T16:03:37 [WARNING] hydra.analyst: analyst: local Ollama unavailable — falling back to cloud
2026-04-14T16:04:36 [INFO] hydra.oracle: strategist: produced 1 decision(s)
2026-04-14T16:04:36 [INFO] __main__: hydra: cycle complete — pending=2 approved=2 risks=1 signals=1

Every degraded node is named. The system continues. The operator knows what to fix.

Access the running services

Service	URL	Notes
LangFuse	http://localhost:3000	Create account on first visit; generate API keys under Project Settings
n8n	http://localhost:5678	Import `n8n/hydra-executor.workflow.json`
MinIO Console	http://localhost:9001	`minioadmin` / `minioadmin`
Postgres	`localhost:5433`	`hydra` / `hydra` / `hydra` db
ClickHouse	`http://localhost:8123`	`default` / `hydra`

Wire the LangFuse self-hosted keys into .env:

LANGFUSE_BASE_URL=http://localhost:3000
LANGFUSE_PUBLIC_KEY=pk-lf-...   # from Project Settings
LANGFUSE_SECRET_KEY=sk-lf-...

Then run python -m hydra.orchestrator — every agent cycle produces a full trace in LangFuse showing which nodes degraded, what the strategist reasoned over, and what the guardian approved.

Framework assessment: should Hydra migrate?

Now that all seven layers are running, it is worth asking: does a higher-level framework eliminate this boilerplate?

LangGraph (already used) remains the right tool. It is rated S-tier for stateful, production-grade systems with explicit control flow. The Guardian's security gate requires the explicit graph — you cannot express "block transactions that fail simulation, unless LLM is unavailable, in which case escalate" cleanly in role-based frameworks like CrewAI.

Agno (formerly Phidata) would replace the Sentinel's LlamaIndex + pgvector plumbing with its native Knowledge abstraction — roughly 60 lines to 6. Not compatible with LangGraph's state machine model.

PydanticAI would improve the analyst.py output schema validation — replacing the JSON parser with strict typed output. Compatible with LangGraph, worth adding in a future article.

The pattern worth extracting as a reusable package: router.py's cost-aware complexity routing. It is generic — any multi-agent system on OpenRouter benefits from a complexity classifier that routes to the cheapest capable model. When there is a second use case in the workspace, that becomes @yannvr/agent-router.

What comes next

The system is running and observable. The natural next improvement is replacing the analyst.py JSON output parser with PydanticAI structured output — so the DeFiAnalysis schema is validated at the LLM boundary, not guessed at. That removes the last "parse a dict and hope" step from the critical path.

The complete stack

Layer	Technology	Article
Orchestration	LangGraph 1.1	Article 1
Automation	n8n 2.0	Article 2
Knowledge	pgvector + LlamaIndex + GraphRAG	Article 3
Observability	LangFuse (self-hosted) + W&B Weave	Article 4
Specialization	Fine-tuned Qwen 7B via Ollama	Article 5
Coordination	Multi-agent swarm + OpenRouter routing	Article 6
Security	SOAR + Guardian + Tenderly simulation	Article 7
Resilience	Structured logging · Tenacity retries · LangFuse self-hosted	This article

AI to Web3 series — building Hydra, a sovereign multi-agent DeFi intelligence mesh:

1 — LangChain orchestration · 2 — n8n execution · 3 — RAG at scale · 4 — LLM observability · 5 — Fine-tuning · 6 — Agent swarms · 7 — SOAR · 8 — Production resilience