AI session

Wide events over metrics, logs, and traces

Exported Jul 17, 2024 Source · Codex CLI 542 words

Exported on 07/17/2024 at 00:00:00 GMT+0 from Codex CLI

Source material

Substack highlights

  • Burmistrov argues that the industry's obsession with the "three pillars" of observability (metrics, logs, traces) hides the real goal: being able to explore high-cardinality production data quickly when the failure mode is still an unknown unknown.
  • At Meta, the Scuba system treats everything as a wide event: a single, very wide record (think JSON object) that captures every bit of context about what happened. Analysts can slice, dice, group, and filter those events along any dimension without pre-aggregation penalties.
  • The post walks through a debugging story where a conversion drop is isolated by iteratively filtering wide events until just one OS version and app build remain. Because nothing was pre-aggregated, no detail was lost, and the people who own the failing component are immediately obvious.
  • Wide events subsume the three pillars because traces are just events with TraceId/SpanId, logs are structured fields with a message, and metrics are periodic state snapshots. Keeping everything raw defers schema decisions to query time and makes cross-correlation much easier.
  • The author's frustration is mainly with OpenTelemetry education: the glossary of 60+ terms and pillar-first framing make observability feel complex and gated, when the underlying idea could be as simple as "capture rich events and explore them".

HN discussion themes

  • Commenters largely agree the concept is powerful, but emphasize that storing and querying every field for every event is expensive, especially when you pay a vendor. Cost, not confusion, is the reason many teams lean on pre-aggregated metrics.
  • Several readers note that "wide events" map closely to structured logging, Kafka streams feeding Elastic/ClickHouse, or event-sourcing patterns. The novelty is less technical than operational maturity and great tooling.
  • Engineers from smaller shops point out that on-prem pipelines can be cost-effective (e.g., Kafka + Elastic on a handful of servers), but once you scale to cloud vendors the bill grows quadratically with success.
  • A few warn about data swamps: "just put it there" easily leads to uncontrolled retention, unclear ownership, and shadow dependencies when other teams start automating against your event stream.
  • Others explore hybrids: generating metrics from the event firehose, batching wide events into Parquet, or using OpenTelemetry spans but improving the UI so the wide-event mental model is accessible.
  • Meta-level criticism surfaces too: vendors bend "observability" into whatever they can sell, and the "all you need is X" framing ignores the trade-offs that real teams juggle.

Takeaways for our stack

  • Wide events shine when you have the infrastructure to store raw context cheaply and a UI that makes faceted exploration fast; without both, the approach stalls.
  • Cost governance and retention policies have to evolve alongside the data model, or the wide-event lake turns into an unsearchable swamp.
  • Even if we cannot afford Scuba-level depth, adopting structured events as the default emission format keeps the door open for richer tooling later. Metrics can always be derived downstream, but lost detail is gone forever.
  • Communicating the mental model matters: pitch observability as "capturing richly structured events you can interrogate" rather than "learn these three disjoint pillars".