LLM Observability: Seeing Inside the Black Box with LangFuse and W&B Weave

You cannot debug what you cannot see. And an AI agent that fails silently, confidently, is the most expensive kind of failure.

This is part four of the AI to Web3 series. We have built the LangGraph orchestration scaffold (Article 1), the n8n execution layer (Article 2), and the RAG knowledge system (Article 3) for Hydra, our sovereign DeFi intelligence mesh. Today we wire in the observability layer.

Why we are writing about this

Three agents. Several tool calls per cycle. Retrieval from a live vector database. An n8n workflow at the end. Every step is a point of failure that does not produce an exception — it produces a confident wrong answer, a missed retrieval, a stale cached response, or a transaction that looked reasonable to the model but was based on data from fifteen minutes ago.

Traditional APM tools (Datadog, New Relic) tell you when your server is down. They do not tell you when your agent reasoned incorrectly. Latency is normal. Token counts are normal. The output just happens to be wrong.

LLM observability is the discipline of making AI systems debuggable. The 2026 State of Agent Engineering report puts adoption at 89% among teams with agents in production. It is not optional for anything handling money.

What makes LLM observability different from traditional APM

Classical observability is infrastructure-focused: CPU, memory, latency, error rates, request counts. These metrics tell you if the system is running. They do not tell you if it is correct.

LLM observability adds a semantic layer:

Prompt drift. Did the prompt change between deployments? If so, did output quality change?
Retrieval quality. Are the retrieved chunks actually relevant to the query? Are they current?
Hallucination detection. Did the model assert something that is not in the retrieved context?
Reasoning chain integrity. Did the agent take the steps it was supposed to take, in the right order?
Token cost per decision. How much did this agent cycle cost, at the provider level?
Tool call accuracy. When the agent called a tool, did it call the right one with the right parameters?

These are not things you can measure by watching CPU graphs. They require tracing the full execution — input, retrieved context, model response, tool calls, final output — as a structured tree, and then evaluating the quality of each node.

LangFuse

LangFuse is the open-source answer to LLM observability. MIT license, self-hostable via Docker Compose or Kubernetes, backed by ClickHouse for high-throughput ingestion.

Tracing. Every LLM call in your application is logged as a trace with a parent-child tree structure: the outer agent run is the root, each LLM call and tool invocation is a child node. You see the full execution path, the exact prompts sent, the exact responses received, token counts, latency, and cost — for every run, forever.

Evaluation. LangFuse supports three evaluation modes:

LLM-as-a-judge — asynchronously scores outputs for relevance, faithfulness, and hallucination. Runs after the user response is already sent, so zero added latency.
Human annotation — flag individual traces for review, rate quality, annotate with ground truth. Builds the dataset for future automated evaluation.
Custom eval pipelines — define your own scoring functions via the API, run them on any subset of traces, track scores over time.

Prompt management. Prompts are version-controlled in LangFuse and served via SDK. Each trace is linked to the exact prompt version that produced it. When you change a prompt, you can see immediately whether output quality improved or regressed across all downstream agents.

Dataset management. Curate datasets from production traces — flag a trace as a positive or negative example, build a test set, run it before every deployment. This is how you catch prompt regressions before they reach production.

TypeScript SDK v4 — the JavaScript/TypeScript SDK reached general availability this year, making LangFuse viable for Next.js and Node.js agent systems.

LangFuse integrates natively with LangChain, LlamaIndex, OpenAI SDK, LiteLLM, and OpenTelemetry. For our Hydra stack it slots in with a single decorator:

from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
# Pass to any LangChain or LangGraph invocation as a callback

W&B Weave

W&B Weave is Weights & Biases' answer to the same problem — built for teams already in the W&B ecosystem for ML experiment tracking. Apache 2.0 license, self-hostable, Python and TypeScript SDKs.

The signature primitive is the @weave.op() decorator:

import weave
weave.init("hydra-production")

@weave.op()
async def sentinel_node(state: HydraState) -> HydraState:
    # Every call to this function is automatically traced:
    # inputs, outputs, duration, any exceptions
    ...

Wrap any function and every call is logged as a nested trace tree — inputs, outputs, metadata, duration. The hierarchy is inferred from the call stack: if sentinel_node calls retriever.retrieve(), which calls the embedding model, which calls the LLM for reranking, all of it appears as a single trace tree.

OpenTelemetry-native. Weave ingests OTLP traces natively, so any OTEL-compatible tool (Jaeger, Grafana Tempo, your existing observability stack) can send data to Weave alongside LLM-specific traces.

Evaluation framework. Define scorers (correctness, safety, retrieval faithfulness), run them over curated datasets, track scores across time and model versions. Weave links evaluation results to the exact code version and model snapshot that produced them — critical for regulated environments that need audit trails.

Governance. The audit trail linking code versions, model snapshots, evaluation datasets, and production traces is the core differentiator for teams under regulatory scrutiny.

The full observability landscape

Platform	Type	Open source	Best for
LangFuse	LLM-first	Yes (MIT)	Self-hosted, data sovereignty, framework-agnostic
W&B Weave	LLM-first	Yes (Apache 2)	ML teams already on W&B, OTEL-native
LangSmith	LLM-first	No	LangChain/LangGraph users who want zero-config
Braintrust	Eval-first	No	CI/CD eval-gated deployments
Helicone	Proxy	Partial	Zero-code cost tracking, multi-provider
Arize Phoenix	APM-native	Partial	OTEL shops, embedding drift detection
Datadog LLM	APM-native	No	Enterprises with existing Datadog
Confident AI	Eval-first	No	50+ research-backed metrics, auto-dataset curation

Selection heuristics:

Self-hosted / data sovereignty → LangFuse
Already on W&B for ML training runs → Weave
Pure LangChain shop, zero config → LangSmith
Eval-heavy CI/CD → Braintrust
Zero-code multi-provider cost tracking → Helicone

For Hydra we use LangFuse — it is MIT-licensed, ClickHouse-backed for high throughput, and self-hostable alongside the rest of our infrastructure. No production traces leave our network.

DeFi-specific observability requirements

Running an AI agent over financial data introduces observability requirements that general-purpose LLM tools do not cover out of the box. The five-stage pipeline we track in Hydra:

1. Data freshness. Every signal that enters the agent is tagged with a data_freshness timestamp. If the Sentinel retrieves a pool TVL figure older than 10 minutes, that is flagged in the trace. Stale data is a silent failure mode.

2. Confidence scoring. Each retrieval result carries a confidence score. The Strategist's reasoning is tagged with the minimum confidence of its inputs. Low-confidence decisions are routed to human review rather than the Executor.

3. Signed attestations. Agent decisions are logged with a hash of the input state, allowing any decision to be reconstructed and verified after the fact. This is the foundation for on-chain auditability — the attestation can be published on-chain as proof of the reasoning chain.

4. Execution trace. When the Executor calls the n8n webhook, the webhook response (including any on-chain transaction hash) is logged back to the LangFuse trace. The full path from "retrieved pool data" to "submitted transaction" is one linked trace.

5. Silent failure detection. Tenderly's Simulation API is called before any transaction is signed. If the simulation fails, the LangFuse trace captures the failure reason. The agent never learns "the transaction failed" — it learns "the transaction would have failed, and here is why."

Production use cases

Prompt regression testing. You change the Strategist's system prompt to improve yield analysis. Before deploying, you run the new prompt against your curated dataset of 200 historical scenarios. LangFuse shows you whether average faithfulness scores went up or down. The deployment is gated on the eval result.

Cost attribution per agent. With per-trace token counting, you know exactly what each agent costs per decision cycle. When you discover the Sentinel is spending 40% of the token budget on low-value queries, you refactor the retrieval logic. Measurable impact.

Hallucination detection in DeFi context. An LLM-as-judge scorer checks whether the Analyst's output is grounded in the retrieved context. If the Analyst claims a pool has 14% APY but the retrieved document says 6%, that trace is flagged for review. Over time, these flags identify systematic weaknesses in the retrieval quality.

Hydra — Article 4 contribution: the observability layer

Every node in the Hydra LangGraph graph is now wrapped with LangFuse callbacks. The trace tree mirrors the agent graph: the outer Hydra run is the root, each agent node is a child, each LLM call and tool invocation within an agent is a grandchild.

View Hydra code

# hydra/observer.py — commit 3f63f16
import hashlib, json, logging
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler  # langfuse.callback removed in v4
from hydra.orchestrator import HydraState

log = logging.getLogger(__name__)

# LangFuse 4.x reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL
# from the environment automatically — no constructor args needed.
langfuse = Langfuse()

def get_langfuse_callback() -> CallbackHandler:
    return CallbackHandler()

def log_decision_attestation(state: HydraState, trace_id: str) -> str:
    min_confidence = min(
        (s.get("confidence", 0) for s in state["signals"]), default=0.0
    )
    attestation_data = {
        "decisions": state["decisions"],
        "signal_count": len(state["signals"]),
        "risk_count": len(state["risks"]),
        "min_confidence": round(min_confidence, 4),
        "trace_id": trace_id,
    }
    attestation_hash = hashlib.sha256(
        json.dumps(attestation_data, sort_keys=True).encode()
    ).hexdigest()

    try:
        langfuse.create_score(  # langfuse.score() removed in LangFuse v3
            trace_id=trace_id,
            name="decision_confidence",
            value=min_confidence,
            comment=f"attestation:{attestation_hash}",
        )
    except Exception as e:
        # Observability must never break the critical path — but failures
        # should be visible so operators know telemetry was lost.
        log.warning("observer: LangFuse score upload failed (trace %s): %s", trace_id, e)

    return attestation_hash

The LangGraph graph now passes the callback to every node invocation:

# hydra/orchestrator.py — commit d879df2
import uuid
from hydra.observer import get_langfuse_callback

# Fresh thread_id per run prevents stale checkpoint state from causing
# concurrent-write conflicts on operator.add channels (risks, decisions).
config = {
    "configurable": {"thread_id": str(uuid.uuid4())},
    "callbacks": [get_langfuse_callback()],
}
result = await hydra.ainvoke(initial_state, config=config)

Every run produces a full trace in your self-hosted LangFuse instance at http://localhost:3000. See Article 8 for the complete Docker Compose setup — LangFuse 3.x requires ClickHouse, MinIO, and several non-obvious configuration steps on macOS. You see the Sentinel's retrieval queries, the chunks it found, the Strategist's reasoning, the Executor's webhook call, and the on-chain result — all in one linked view.

Updated project structure:

View Hydra code

hydra/
├── orchestrator.py
├── executor.py
├── sentinel.py
├── observer.py          # LangFuse integration (this article)
├── n8n/
│   ├── hydra-executor.workflow.json
│   └── hydra-ingestor.workflow.json
├── docker-compose.yml   # includes LangFuse + ClickHouse + Postgres
├── requirements.txt
└── .env.example

View code — docker-compose.yml (relevant services)

# docker-compose.yml (relevant services)
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: hydra
  langfuse-server:
    image: langfuse/langfuse:latest
    depends_on: [postgres]
    environment:
      DATABASE_URL: postgresql://postgres@postgres/langfuse
    ports: ["3000:3000"]
  clickhouse:
    image: clickhouse/clickhouse-server:latest

The stack so far

Layer	Technology	Status
Orchestration	LangGraph 1.1	Done — Article 1
Automation	n8n 2.0	Done — Article 2
Knowledge	pgvector + LlamaIndex + GraphRAG	Done — Article 3
Observability	LangFuse (self-hosted) + W&B Weave	Done — this article
Specialization	Fine-tuned SLM	Article 5
Coordination	Multi-agent swarm + routing	Article 6
Security	SOAR + Guardian	Article 7
Resilience	Structured logging · Tenacity retries · LangFuse self-hosted	Article 8

Next in this series: Domain-specific fine-tuning — now that Hydra can see what its agents are doing, we can see where general-purpose models fail at DeFi reasoning. How to train a specialist model that outperforms GPT-4 on protocol analysis at 1% of the inference cost.

AI to Web3 series — building Hydra, a sovereign multi-agent DeFi intelligence mesh:

1 — LangChain orchestration · 2 — n8n execution · 3 — RAG at scale · 4 — LLM observability · 5 — Fine-tuning · 6 — Agent swarms · 7 — SOAR · 8 — Production resilience