Wide events over metrics, logs, and traces • Codes is cheap

Exported on 07/17/2024 at 00:00:00 GMT+0 from Codex CLI

Source material

Burmistrov argues that the industry's obsession with the "three pillars" of observability (metrics, logs, traces) hides the real goal: being able to explore high-cardinality production data quickly when the failure mode is still an unknown unknown.
At Meta, the Scuba system treats everything as a wide event: a single, very wide record (think JSON object) that captures every bit of context about what happened. Analysts can slice, dice, group, and filter those events along any dimension without pre-aggregation penalties.
The post walks through a debugging story where a conversion drop is isolated by iteratively filtering wide events until just one OS version and app build remain. Because nothing was pre-aggregated, no detail was lost, and the people who own the failing component are immediately obvious.
Wide events subsume the three pillars because traces are just events with TraceId/SpanId, logs are structured fields with a message, and metrics are periodic state snapshots. Keeping everything raw defers schema decisions to query time and makes cross-correlation much easier.
The author's frustration is mainly with OpenTelemetry education: the glossary of 60+ terms and pillar-first framing make observability feel complex and gated, when the underlying idea could be as simple as "capture rich events and explore them".

Commenters largely agree the concept is powerful, but emphasize that storing and querying every field for every event is expensive, especially when you pay a vendor. Cost, not confusion, is the reason many teams lean on pre-aggregated metrics.
Several readers note that "wide events" map closely to structured logging, Kafka streams feeding Elastic/ClickHouse, or event-sourcing patterns. The novelty is less technical than operational maturity and great tooling.
Engineers from smaller shops point out that on-prem pipelines can be cost-effective (e.g., Kafka + Elastic on a handful of servers), but once you scale to cloud vendors the bill grows quadratically with success.
A few warn about data swamps: "just put it there" easily leads to uncontrolled retention, unclear ownership, and shadow dependencies when other teams start automating against your event stream.
Others explore hybrids: generating metrics from the event firehose, batching wide events into Parquet, or using OpenTelemetry spans but improving the UI so the wide-event mental model is accessible.
Meta-level criticism surfaces too: vendors bend "observability" into whatever they can sell, and the "all you need is X" framing ignores the trade-offs that real teams juggle.

Wide events shine when you have the infrastructure to store raw context cheaply and a UI that makes faceted exploration fast; without both, the approach stalls.
Cost governance and retention policies have to evolve alongside the data model, or the wide-event lake turns into an unsearchable swamp.
Even if we cannot afford Scuba-level depth, adopting structured events as the default emission format keeps the door open for richer tooling later. Metrics can always be derived downstream, but lost detail is gone forever.
Communicating the mental model matters: pitch observability as "capturing richly structured events you can interrogate" rather than "learn these three disjoint pillars".