The Evolution of Event Tracing

One of the exciting things about being a developer is how rapidly new technologies and tools are released, and the corresponding gains they have on productivity and performance. While it tends to be exhausting keeping up with all this new technology, it can also be exhilarating when there are clear areas where a new API or other technology provides value.

In this post, I’d like to illustrate how these technologies have helped transform one of the key components of our collection agent, the SQL Server trace collector. By utilizing new technologies like async/await, and more recently, channels, I’ll illustrate how we were able to greatly simplify the object model for trace, while at the same time reducing the cost of tracing.

SentryOne, being a monitoring platform, has needed to collect trace data from remote systems for some time. While this sounds like a straightforward affair, there exist many ways of doing the same thing, each having a corresponding result. In general, we always wish to observe the following core tenets:

  1. Low impact/overhead on the target machine
  2. Excellent throughput
  3. Low CPU usage on the machine performing the collection
  4. Low memory overhead on the machine performing the collection
  5. Judicious use of threads to prevent scalability problems from context-switching/stack memory overhead
  6. Simple APIs that other developers can later understand

Trace Basics

Trace consists of the following phases:

Phases of Trace_BPhilpot

In addition, a SQL Server trace also has the following characteristics:

  1. It spends most of the time waiting for IO
  2. It is not CPU bound, but rather IO bound
  3. It can impact the target if data is not read from the buffer fast enough
  4. Users expect/want near real-time visibility of trace data
  5. It has the potential to produce a lot of data
  6. The data produced must be stored somewhere, and that storage is IO bound as well

For the original version of SQL Server trace collection, we decided to create a component that could do everything, while also allowing for some level of modularity if trace definitions or architectures changed. This trace exposed two methods, Start() and Stop(), as well as a Queue<TraceRow> for the buffered rows. In addition, there were some knobs for tuning the max queue size, as well as controlling behavior when that size was reached. Calling start created a background thread that would read from the target and put the data into the Queue<TraceRow>. This API surface offered the following benefits:

  1. It was user-friendly and easy to understand
  2. It allowed fast reads to ensure that we never stalled the target and that we’d read data from the target’s buffer as quickly as possible

However, it suffered from the following problems:

  1. Each trace server had a separate queue. This was fine when using a trace server for QuickTrace, or when monitoring a handful of targets, but suffered from scalability and complexity issues when the number of targets increased to 50.
  2. It required a separate thread for each trace server, as Start() essentially created a background thread. This created additional scalability challenges as the number of trace engines increased.
  3. It required at least one reader thread that in small environments could spend most of the time doing nothing.
  4. The design didn’t support just-in-time writes, i.e., writing the rows as soon as they arrive, even if we wanted them.
  5. It violated the Separation of Concerns SOLID principle, insofar as it pretty much had all the responsibilities, like thread management, buffer management, and trace reading.

Initial Architectural Improvements

One of the first improvements we made to the architecture was to introduce the concept of a shared queue. We needed a place to put it, and a rewrite wasn’t appropriate for that feature, so we just added another shared buffer to the main TraceServer component. This solved the issue for the reader, so even if we had 100 targets, all the data could go into one buffer, and be read from a single reader with buffering.

For this buffer, we chose BlockingCollection<T>, which has the GetConsumingEnumerable() method, allowing us to do a blocking for each over the shared buffer. This allowed us to do just-in-time writes as data came in. However, a consequence of this was that if there is a massive amount of trace data, the single processor for a shared queue could lead to write stalls. In practice, this hasn’t really been an issue, but it was a weakness. Another side effect of adding the BlockingCollection<T> is that it just made the trace server even more complicated and pushed it even further away from SOLID principles. Those problems were annoying but manageable.

The next thing we tried to do was tackle the thread management issue, namely that we needed N+1 threads for trace, where N is the number of watched targets, and 1 extra for the processor thread. For the TraceServer threads reading from the target, the work was all IO bound and waiting on asynchronous IO most of the time, so it made sense to migrate this to async/await. This was possible for the legacy profiler API, because it was using a DataReader to get the data, but it introduced complications for our implementation on top of XEvents. Namely, with XEvents, the streaming provider does not have awaitable methods, so they cannot be used in an async block without running the risk of starving the thread pool. As XEvents gained popularity, we had to revert to non-async. The TraceServer, as we know it, was at a dead-end from the current API standpoint.

Building a Better TraceServer

The good news is technologies and tools change, and we set out to build a better TraceServer altogether. Recall the phases of trace:

Phases of Trace_BPhilpot_v2

Note what they all have in common, namely that they are all async. With that in mind, we focused on them as a linear asynchronous chained lifetime. We needed to just focus on the following key items:

  1. What is the connection string for the target?
  2. What events do we wish to collect?
  3. Run the trace, passing in a buffer.

For example:

var traceServer = new TraceServer(engineType, buffer); 
await traceServer.RunAsync(traceConfiguration, cancellationToken); 

where RunAsync essentially encapsulates: 

await InitializeAsync(cancellationToken); 
await StartAsync(cancellationToken); 
await ReadDataInLoop(buffer, cancellationToken); 
await StopAsync(cancellationToken); 
await RemoveTraceAsync(cancellationToken); 

We shift the cancellation responsibility to the CancellationToken, allowing us to cancel from any number of sources. The question then becomes, “What is the buffer? Initially, we experimented with using a delegate in the form of an interface contract, called ITraceRowProcessorWe could then implement that multiple ways, including a BufferedTraceRowProcessor, which for the implementation of ITraceRowProcessor.Process(TraceRow traceRow) just added the trace row to a ConcurrentQueue or BlockingCollection<T>Something still felt off though. 

Enter Channel<T> 

It turned out, however, that there was an even better way, using Channel<T>. Channel<T> has been covered many times by bloggers more capable than I, so I will just link to some of those posts: 

By passing in ChannelWriter<TraceRow> as buffer, a few things happen: 

  1. We can asynchronously write to channels, which means as soon as we are done awaiting data from the trace source, and we get a trace row, we can then await the write to the channel. This is exactly what we want so that we stay on the CPU for as little time as possible since we are dealing with task scheduler threads. 
  2. The channels have built-in buffering behaviors, like max size, and what to do when the max size is reached but expose a common class for both bounded and unbounded channels, allowing for flexibility. 
  3. They are massively optimized for speed, better than anything before them. 
  4. They support multiple writers, so the same channel could serve more than one TraceServer engine for the shared buffer scenario. 
  5. They support multiple readers, so we don’t need to worry about IO stalls or processing backup. We could just add another reader from the same channel. 
  6. They support both wait and peek style operations, allowing you to know if there are any more items and, if so, add them to a batch. This ensures we can do batch writing and just-in-time writing from the same consumer. 
  7. They support async/await from the reader side, ensuring that we don’t need a dedicated thread just waiting for data that may not arrive. 

Putting It All Together 

Once we migrated to Channel<T>, everything came together. We also had a corresponding discovery of a way to do asynchronous XEvents reading, which meant that we could use the async TraceServer with both legacy SQL Server Trace APIs as well as newer XEvents APIs. The trace engine is fast, scalable, uses few threads, is fully async all the way down, supports cancellation, and has extensible buffer behavior support. Expect the new engine in a later version of our SentryOne monitoring software. It’s also worth noting that the techniques applied above do not just apply to trace, but also serve as an illustration for rapid data processing in general, and how the tools have evolved to support even better use cases with fewer lines of code. 

Thwack - Symbolize TM, R, and C