Exported on 07/17/2024 at 00:00:00 GMT+0 from Codex CLI
Source material
- All you need is Wide Events, not "Metrics, Logs and Traces" by Ivan Burmistrov (Feb 15, 2024)
- Hacker News discussion (Feb 27, 2024)
Substack highlights
- Burmistrov argues that the industry's obsession with the "three pillars" of observability (metrics, logs, traces) hides the real goal: being able to explore high-cardinality production data quickly when the failure mode is still an unknown unknown.
- At Meta, the Scuba system treats everything as a wide event: a single, very wide record (think JSON object) that captures every bit of context about what happened. Analysts can slice, dice, group, and filter those events along any dimension without pre-aggregation penalties.
- The post walks through a debugging story where a conversion drop is isolated by iteratively filtering wide events until just one OS version and app build remain. Because nothing was pre-aggregated, no detail was lost, and the people who own the failing component are immediately obvious.
- Wide events subsume the three pillars because traces are just events with
TraceId/SpanId, logs are structured fields with a message, and metrics are periodic state snapshots. Keeping everything raw defers schema decisions to query time and makes cross-correlation much easier. - The author's frustration is mainly with OpenTelemetry education: the glossary of 60+ terms and pillar-first framing make observability feel complex and gated, when the underlying idea could be as simple as "capture rich events and explore them".
HN discussion themes
- Commenters largely agree the concept is powerful, but emphasize that storing and querying every field for every event is expensive, especially when you pay a vendor. Cost, not confusion, is the reason many teams lean on pre-aggregated metrics.
- Several readers note that "wide events" map closely to structured logging, Kafka streams feeding Elastic/ClickHouse, or event-sourcing patterns. The novelty is less technical than operational maturity and great tooling.
- Engineers from smaller shops point out that on-prem pipelines can be cost-effective (e.g., Kafka + Elastic on a handful of servers), but once you scale to cloud vendors the bill grows quadratically with success.
- A few warn about data swamps: "just put it there" easily leads to uncontrolled retention, unclear ownership, and shadow dependencies when other teams start automating against your event stream.
- Others explore hybrids: generating metrics from the event firehose, batching wide events into Parquet, or using OpenTelemetry spans but improving the UI so the wide-event mental model is accessible.
- Meta-level criticism surfaces too: vendors bend "observability" into whatever they can sell, and the "all you need is X" framing ignores the trade-offs that real teams juggle.
Takeaways for our stack
- Wide events shine when you have the infrastructure to store raw context cheaply and a UI that makes faceted exploration fast; without both, the approach stalls.
- Cost governance and retention policies have to evolve alongside the data model, or the wide-event lake turns into an unsearchable swamp.
- Even if we cannot afford Scuba-level depth, adopting structured events as the default emission format keeps the door open for richer tooling later. Metrics can always be derived downstream, but lost detail is gone forever.
- Communicating the mental model matters: pitch observability as "capturing richly structured events you can interrogate" rather than "learn these three disjoint pillars".