This document is an outgrowth of our experience building systems which are meant to be readily diagnosable while simultaneously highly performant and scalable.
We have taken everything we have learned from this and distilled into a series of facets. Applying these facets will allow one to build:
- A system that is readily diagnosable when issues and questions arise
- A system that is actively and accurately monitored for errors and exception states
- A system that performs and scales
Much of this may seem obvious, but nonetheless what we have more often seen are engineers and ops people starting at the most primitive state (multiple log files strewn across several servers or VMs) and working their way towards implementing all of these as exigencies arise. This Manifesto instead proposes developers start from the end – incorporate these facets from the outset so that expensive re-writes, re-tooling and emergency patches are never required.
This is best achieved by baking diagnostics into the platform the application itself, so that developers and ops people do not have to think about it. Ideally, code is written in as close to a normal fashion as possible, placing as little burden as possible on developers to provide diagnostics, which almost invariably they will only incorporate when needed – i.e., when the application is crashing or having a significant outage. By then, it’s already too late.
Thus, the Anti-Heisenberg Principle – this is how we closely observe and measure systems without affecting them.
Facet 1: Exhaustive, Not Exhausting
“All key information is captured for every transaction, from every source, by default.”
A system should capture all core transactional information including:
- Custom Tags
- Transactions do not slip through the cracks
- Systems do not need to be re-deployed to capture this information
- Systems do not need to be restarted to capture this information
- Servers do not need access and configuration files changed to capture this information
Facet 2: No-Fault Performance
“The logging system should be performant with minimal overhead. And it should NEVER be a bottleneck.”
It’s important that any diagnostic system operate efficiently. It should affect processing times only marginally or not at all – i.e., it should not be sending off multiple sets of logs or metrics over different HTTP connections, turning a light 100 millisecond transaction into 2 second behemoth. As much as possible, data should be batched and streamed to a collection point.
Additionally, the diagnostics should never be a point of failure. It’s all well and good if the application runs efficiently while the collection system is performing correctly. But if it fails, it should have little to no impact on the application being instrumented. This can be a painful lesson for developers who rely on a remote logging system in their transaction flow. To this end, it is important the system be stress-tested for slow response times as well as outages for collection points.
In any circumstance, any fault within the diagnostics system should not impact the behavior of the application.
Facet 3: Frictionless Insight Anywhere
“From any device and location, the system should be readily and quickly diagnosable.”
At a remedial level, this means not:
- Trying to figure out which of five servers to log into in order to see the relevant logs
- Hunting down keys or passwords for access to the particular server
- Rummaging around file systems to find right set of logs
All of these are examples of significant friction, obstacles that can slow down or stymie the diagnosis of even relatively simple issues. The goal in removing this friction is to enable use-cases where one can:
- Receive an alert while at a restaurant that something is wrong with the system
- Open the relevant logging information to see what is happening on a phone
- Identify the issue based on the information presented (as well as ancillary queries)
- Send an email or change a setting that resolves the issue
- Problem solved – take a sip of your Chilcano and enjoy the entree 🙂
The dinner scenario may sound frivolous, but the key component is that information is available on mobile devices (or any other type of device for that matter), does not require memorizing keys or tags to access the desired information, and can be easily searched and navigated.
Facet 4: Pithy By Default, Verbose When Needed
“Information should be categorized hierarchically, and it should be easy to navigate through these tiers.”
For example, sometimes the only information needed is error logs – and sometimes the complete trace of program execution is required.
It is important to be able to filter out noise quickly and efficiently. At other times, “spelunking” is required and there is a need to trawl through everything. Being able to do this means properly categorizing logs as well as pulling out key information in how logs are presented for a summary view.
At a basic level, this means categorizing log output as INFO, WARN, ERROR, all familiar techniques for developers.
At scale, though, this becomes much trickier, as often the data required for debugging is in tension with performance needs. Many systems are carefully instrumented for naught, as actually enabling the instrumentation in production requires re-deployment or re-configuration. By leveraging best practices with event streaming as well as selective capture techniques, an “Anti-Heisenberg” system needs neither.
Facet 5: Structured Yet Flexible
“When the logs contain structured data, it is indexed and easily searchable – but this is not required”
Often, it is useful to create logs that use JSON or XML or some other structured format for data. This ensures data is captured in a consistent and complete way. And once captured, it should be easy to query based on these structures. They should be indexed, as well as viewable in a UI so a user can quickly filter on them. For example, take a payload such as this:
A diagnostician looking at the logs should be able to quickly filter by transaction types or users. At the same time, any logging system needs to be able to capture free-form text – essentially, just throw anything at it and let it sort it out.
Facet 6: Instrumented, Effortlessly
“Key function points are instrumented with performance metrics, out of the gate. No additional work is necessary to get a quick read on the performance of any transaction or top-level routine.”
This means all transactions, plus top-level routines, have their processing times captured. Reports can then be generated on averages, max, min, 95th percentile, etc.
Instrumentation can be directed by the developer via annotations or code, but they should also be included by default for all application entry points.
With this data properly and consistently collected, we can quickly identify bottlenecks in the system via visual graphs, as well as set threshold alarms for system performance. Without it, the process of identifying performance issues, especially at scale, can turn into a tedious game of half-measures and guesses, as developers iterate through hypotheses on where the problem lies.
Facet 7: No Wolf, No Cry
“A well-instrumented system is tuned so that developers are alerted immediately when things really go wrong, and rarely when they do not”
Two great traps for systems:
- Never cries wolf – major production issues produce no alarms or alerts
- Always cries wolf – irrelevant issues, or non-issues
In practice, these look the same to the engineers maintaining them. For systems that are constantly sending out alarms, engineers become inured to it, either “silencing” alarms or routing alerts out of their main inbox. The result is that when things really go wrong, the team is not the first to know and is more likely to respond based on “ancillary alerts” (emails from actual users, out-of-whack diagnostics reports, data missing or looking odd in user reports, etc).
At scale, this is an ongoing battle, but the key practice is, for every alert raised it either:
- Is a critical production issue which needs to be addressed immediately
- Is a major “false-alarm” issue that needs to be addressed expediently
There is no such thing as a “business-as-usual” alarm.
This is our collection of best practices. And as a Manifesto, it is also meant to challenge and be challenged. To that end, feedback is encouraged:
Please comment below or email me at firstname.lastname@example.org. Reach me via chat on Gitter.