Tracing Ceph with LTTng

This post demonstrates how to use LTTng-UST to collect execution traces from Ceph. As a driving example we’ll use the traces to identify all instances of lock ownership, and how long each lock is held. This type of analysis could be useful for things like identifying sources of latency. While specific to Ceph, the tracing techniques shown can be applied to any application as a powerful tool for performance analysis and debugging.

Parallel and real-time systems are notoriously difficult to debug. This is due to a number of issues such as non-determinism as well as an inherent complexity in execution behavior. Many tools exist for debugging and performance analysis, but no single tool is best for every situation. For instance, profiling tools such as OProfile can be used to collect detailed performance data, but typically provide results at an aggregate level that fails to capture low-level details like thread interleavings that are crucial to diagnosing concurrency bugs. Debugging tools such as GDB sit at the opposite end of the spectrum, providing fine granularity control and monitoring of execution, but can significantly alter program behavior and mask concurrency issues. Tracing occupies a middle ground providing the ability to collect execution traces while having minimal impact on program behavior.

Next I’ll discuss existing tools used for tracing and profiling, as well as the tooling that is currently used to examine the behavior of Ceph. Then I’ll use a simple example in the context of Ceph concurrency control primitives to demonstrate the use of LTTng, building up to the analysis of traces in identifying costly critical sections.

Tracing and Profiling Tools #

A large number of tracing and profiling tools exist. This list of generic tools is by far incomplete, but contains some of the more well known projects.

In addition to generic tools, domain-specific tracing and profiling tools exist. For instance, blktrace is used to trace the block I/O layer in the Linux kernel, and a number of options for tracing and profiling MPI applications exist.

Omitted from the list above is the even larger number of home grown and ad hoc solutions. Tracing is inherently a simple idea: collect metadata about an event as a side effect of execution. The problem is that designing a low overhead tracing solution that works across platforms and has good tooling is quite challenging. I’ll refer the reader to the following paper that has a lot of good information about the challenges of performance analysis tracing, and is discussed in the context of the Tau project.

Sameer S. Shende and Allen D. Malony, The Tau Parallel Performance System.

Tracing and Profiling in Ceph #

There are several existing approaches to trace and profile Ceph. The first is to use existing debug logs that Ceph produces. These are heavy weight text files that are created and whose syntax is largely ad hoc. These are great for debugging, but can have a large impact on performance. Aggregate performance information is also available via the performance counters sub-system, enabling users to dump snapshots that include information such as average queue lengths and some types of latency information. Finally, efforts to integrate SystemTap have been discussed on the mailing list.

Next I’ll show how to use LTTng to trace Ceph execution behavior. While SystemTap is an excellent tool that I think must be supported—especially for debugging systems that cannot tolerate down time—the last time I investigated SystemTap its overhead made it less attractive for long-term tracing, a task that LTTng is suitable for. However, this comes at the cost of requiring static tracepoints to be inserted into source code.

Finding High-latency Critical Sections in Ceph #

The primary locking primitive in Ceph is the type Mutex. Since limiting concurrency can impact latency, finding instances in which a mutex is held for a long period of time is helpful in understanding the performance of Ceph internals. Essentially we want to perform the following analysis for every mutex used in Ceph:

Mutex lock;
time_t start, duration;

lock.Lock();
start = timestamp();
// do stuff
duration = start - timestamp();
lock.Unlock();

Rather than changing every call site, we can instead record every instance of the Mutex::Lock and Mutex::Unlock events, and calculate the duration between these events by post-processing the event stream. In order to drive home the differences between profiling and tracing, I am going to start by demonstrating how to use tracing to calculate mutex wait times, which happens to be a task that profiling can handle. Then, reusing the functionality and tools that we have built up, I will show how to find critical sections that introduce large latencies, a much more difficult task for profiling that is made simple through the use of tracing and analysis tools.

The userspace port of the LTTng kernel trace toolkit is called LTTng-UST. I’ll refer the reader to the LTTng website for detailed information about using the toolkit. Its use is fairly straightforward, but integrating it can be highly dependent on the application being instrumented, so a completely generic tutorial would be difficult to produce. The LTTng website has published good documentation.

Calculating Mutex Wait Time #

The following code snippet shows the locking routine for a mutex in Ceph (a fancy wrapper around pthread_mutex_lock that adds some additional features). In order to examine the wait times introduced by the use of a mutex we need to instrument this code to track how much time a thread spends in this function. Note that the point of this exercise is to demonstrate tracing, and yes, as stated, many profiling tools such as OProfile are fully capable of handling this task. We’ll get to the harder stuff shortly.

void Mutex::Lock(bool no_lockdep) {
  if (lockdep && g_lockdep && !no_lockdep)
      _will_lock();

  if (TryLock())
      return;

  int r = pthread_mutex_lock(&_m);
  assert(r == 0);

  if (lockdep && g_lockdep) _locked();
  _post_lock();
}

As a short aside, the above routine is simplified. In fact, the full routine does provide profiling support via Ceph performance counters for tracking mutex wait time. Here is the unabridged version. Notice the addition of clock sampling and logging that hugs the call to pthread_mutex_lock:

void Mutex::Lock(bool no_lockdep) {
  if (lockdep && g_lockdep && !no_lockdep)
      _will_lock();

  if (TryLock()) {
    return;
  }

  utime_t start;
  if (logger && cct && cct->_conf->mutex_perf_counter)
    start = ceph_clock_now(cct);

  int r = pthread_mutex_lock(&_m);

  if (logger && cct && cct->_conf->mutex_perf_counter)
    logger->tinc(l_mutex_wait,
         ceph_clock_now(cct) - start);

  assert(r == 0);

  if (lockdep && g_lockdep) _locked();
  _post_lock();
}

The problem is that performance counters (in their current form) toss away a lot of useful context. For example, the use of performance counters in this example can be used to examine total wait time over a fixed period of time, across all mutex instances, but they cannot be tell us which threads were effected most, or which specific mutex instances were to blame for large wait times—very useful information for debugging and performance analysis.

Defining Tracepoints #

We can calculate the time a thread spends in Mutex::Lock by logging events upon entering and exiting the function. It is true that you could simply calculate and log the duration with a single event, but that’s no fun when it comes to demonstrating trace post-processing. We start by defining the two events, lock_enter and lock_exit, and create fields that will hold both the address of the lock as well as the lock’s name. Note that timing information and context such as thread identifiers are handled by the LTTng-UST framework automatically. The important thing to notice in the tracepoint definition is what information we are collecting. Specifically, we are going to record the address and name of each Mutex instance. Without this information any trace post-processing tool would not be able to match a lock_enter event with the corresponding lock_exit event.

TRACEPOINT_EVENT(mutex, lock_enter,
    TP_ARGS(
        const void *, addr,
        const char *, name),
    TP_FIELDS(
        ctf_integer_hex(unsigned long, addr, addr)
        ctf_string(name, name)
    )
)

TRACEPOINT_EVENT(mutex, lock_exit,
    TP_ARGS(
        const void *, addr,
        const char *, name),
    TP_FIELDS(
        ctf_integer_hex(unsigned long, addr, addr)
        ctf_string(name, name)
    )
)

After generating the tracepoint hooks using the lttng-gen-tp tool we can add instrumentation points to Mutex::Lock. Notice we modify the routine slightly to be friendly to different return paths.

 void Mutex::Lock(bool no_lockdep) {
+  utime_t start;
+  int r;
+
+  tracepoint(mutex, lock_enter, this, name);
+
   if (lockdep && g_lockdep && !no_lockdep) _will_lock();
 
   if (TryLock()) {
-    return;
+    goto out;
   }
 
-  utime_t start;
   if (logger && cct && cct->_conf->mutex_perf_counter)
     start = ceph_clock_now(cct);
-  int r = pthread_mutex_lock(&_m);
+  r = pthread_mutex_lock(&_m);
   if (logger && cct && cct->_conf->mutex_perf_counter)
     logger->tinc(l_mutex_wait,
                 ceph_clock_now(cct) - start);
   assert(r == 0);
   if (lockdep && g_lockdep) _locked();
   _post_lock();
+
+out:
+  tracepoint(mutex, lock_exit, this, name);
 }

Ok cool, now that that’s done, go recompile Ceph. Yes, this is a downside to LTTng-UST. But, it has a lot of other benefits, like being very low-overhead, and clearly capturing the semantics at the call site.

A Note About Starting Ceph #

Normally an LTTng-UST instrumented application can be started as normal. But there is one minor issue spinning up processes that daemonize themselves. When starting a Ceph daemon (e.g. ceph-osd, ceph-mon, ceph-msd) you should LD_PRELOAD the LTTng-UST forking utility library. On my Ubuntu box I do this in conjunction with vstart.sh as follows:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/liblttng-ust-fork.so ./vstart

That’s it. Now we should be able to access any tracepoints that exist via the standard LTTng-UST tools.

Listing Available Tracepoints #

The set of tracepoints from any LTTng-UST enabled process can be listed using the lttng list command. For instance, I started two OSDs and one monitor and then listed the available tracepoints:

nwatkins@kyoto:~$ lttng list -u
UST events:
-------------

PID: 13541 - Name: ./ceph-osd
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 13271 - Name: ./ceph-osd
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 13054 - Name: ./ceph-mon
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

Awesome! Those are the tracepoints we added. Notice that since we added our tracepoint to the Mutex type, which is a common feature shared across many parts of Ceph, it shows up in both the OSD and monitor processes.

Capturing Traces #

The first thing that needs to be done is to start a tracing session.

nwatkins@kyoto:~$ lttng create
Session auto-20140530-165745 created.
Traces will be written in /home/nwatkins/lttng-traces/auto-20140530-165745

Next we’ll enable all userspace events within the current session. This tool can also be used to interact with LTTng tracepoints in the kernel, as well as selectively enabling events based on different filtering patterns.

nwatkins@kyoto:~$ lttng enable-event -u -a
All UST events are enabled in channel channel0

Start logging in the current session.

nwatkins@kyoto:~$ lttng start
Tracing started for session auto-20140530-165745

At this point traces are being actively collected. We can generate some load on the cluster to make sure we hit a bunch of mutex instances.

nwatkins@kyoto:~$ rados -p data bench 10 write
 Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
 Object prefix: benchmark_data_kyoto_21530
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        17         1   3.99856         4  0.950886  0.950886
     2      16        21         5   9.99701        16   1.51263   1.39381
     3      16        25         9   11.9967        16   2.57365   1.79965

And finally we can suspend trace collection.

nwatkins@kyoto:~$ lttng stop
Waiting for data availability.
Tracing stopped for session auto-20140530-165745

Viewing the trace that we just collected is also very simple using the lttng view command.

nwatkins@kyoto:~$ lttng view | head
[16:58:23.038521097] (+?.?????????) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038530591] (+0.000009494) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038531411] (+0.000000820) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[16:58:23.038532042] (+0.000000631) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[16:58:23.038532967] (+0.000000925) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038533479] (+0.000000512) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038535265] (+0.000001786) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[16:58:23.038535846] (+0.000000581) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[16:58:23.038538253] (+0.000002407) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
[16:58:23.038538829] (+0.000000576) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }

The snapshot of the trace above shows timestamped events corresponding to the tracepoints we inserted into Mutex::Lock. Notice that the address and name of the mutex is also available.

Trace Sizes #

Using the lttng view utility we can convert the trace into a readable format. However, traces are collected in an efficient binary format. For instance, the size of the converted trace for a short run is roughly 46 megabytes:

nwatkins@kyoto:~/ceph/ceph$ lttng view > full.trace
nwatkins@kyoto:~/ceph/ceph$ ls -lh full.trace 
-rw-rw-r-- 1 nwatkins nwatkins 46M May 31 12:06 full.trace

While the directory containing the raw traces is significantly smaller:

nwatkins@kyoto:~/ceph/ceph$ du -sh ~/lttng-traces/auto-20140531-120016/ust/uid/1000/64-bit/
12M     /home/nwatkins/lttng-traces/auto-20140531-120016/ust/uid/1000/64-bit/

This works out to about 44 bytes per event in the raw format vs 161 bytes per event in the converted textual format. Quite a difference.

Adding Context To Traces #

Additional context can be added to traces without changing the tracepoint, recompiling, or restarting the application. For instance, here is how we can log the pthread_id of the execution context that triggered an event:

nwatkins@kyoto:~/ceph/ceph$ lttng add-context -u -t pthread_id
UST context pthread_id added to all channels

After adding the context and collecting a new trace the additional context information will be available in the trace:

nwatkins@kyoto:~/ceph/ceph$ lttng view | head
[12:01:16.770519287] (+?.?????????) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770532490] (+0.000013203) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770533490] (+0.000001000) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[12:01:16.770534174] (+0.000000684) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[12:01:16.770535188] (+0.000001014) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770535800] (+0.000000612) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770537653] (+0.000001853) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[12:01:16.770538299] (+0.000000646) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[12:01:16.770540621] (+0.000002322) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
[12:01:16.770541233] (+0.000000612) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }

Notice the new { pthread_id = 140109313545984 } field.

Trace Post-processing #

Now that we have a trace with the data we need, namely the lock_enter and lock_exit events, we can calculate how long any thread waited on a mutex. The babeltrace tool that ships with lttng-ust is useful for interacting with traces, and also has Python bindings that we will make use of. Note that there are many tools for trace analysis, including several GUIs, but a full discussion of these tools is beyond the scope of this post. The following script performs all the post-processing that we need:

import sys
from babeltrace import *

traces = TraceCollection()
ret = traces.add_trace(sys.argv[1], "ctf")

lockers = {}
for event in traces.events:
    # metadata about mutex and thread context
    addr, name = event["addr"], event["name"]
    thread_id = event["pthread_id"]

    # identifies a thread interacting with a mutex
    locker = (thread_id, addr)

    # record a thread entering Mutex::Lock
    if event.name == "mutex:lock_enter":
        lockers[locker] = event.timestamp

    # calculate time spent in Mutex::Lock
    elif event.name == "mutex:lock_exit":
        try:
            duration = event.timestamp - lockers[locker]
            del lockers[locker]
            print duration, name
        except KeyError:
            continue

This simple script monitors for the mutex:lock_enter event, and prints the elapsed time when it encounters the corresponding mutex:lock_exit event. It tracks ownership by identifying a mutex by the tuple (thread_id, addr). Notice the try block used when processing the mutex:lock_exit event. This is necessary to catch the edge case in which tracing began after a thread had executed past mutex:lock_enter but prior to reaching the mutex:lock_exit event.

This script can be run as follows to find instances of the longest mutex wait times in Ceph:

nwatkins@kyoto:~$ python mutex.py /path/to/trace | sort -nr | head
3727778496 Monitor::lock
3505961112 Monitor::lock
476723751 Monitor::lock
372843475 FileStore::OpSequencer::apply_lock
197103478 PG::_lock
196684510 fdcace_lock
51744346 filestore::opsequencer::apply_lock
50814795 filestore::opsequencer::apply_lock
35316968 fdcache_lock
25695414 pg::_lock
21339899 filestore::opsequencer::apply_lock

Fantastic. Now we have the wait cost of each instance of locking a mutex. Further analysis can be done at this point, such as plotting a histogram, or picking a lock and trying to reduce costs. In the next section we are going to reuse the mutex:lock_enter event to find the longest critical sections in Ceph. For this we need to know when a mutex is unlocked.

Instrumenting Mutex::Unlock #

Now that we know how to collect a trace with events corresponding to Mutex::Lock, we can instrument Mutex::Unlock and perform post-processing to calculate the duration of critical sections. The first step is to add the new tracepoint. Since the previous section covered this process in detail, I’ll only show the important bits here. Here is the new tracepoint we will be adding:

tracepoint_event(mutex, unlock,
    tp_args(
        const void *, addr,
        const char *, name),
    tp_fields(
        ctf_integer_hex(unsigned long, addr, addr)
        ctf_string(name, name)
    )
)

Next add the tracepoint to Mutex::Unlock:

 void Mutex::Unlock() {
   _pre_unlock();
   if (lockdep && g_lockdep) _will_unlock();
   int r = pthread_mutex_unlock(&_m);
   assert(r == 0);
+  tracepoint(mutex, unlock, this, name);
 }

Now go ahead and spin up Ceph. We can see our old tracepoints along with the new one:

nwatkins@kyoto:~$ lttng list -u
UST events:
-------------

PID: 27747 - Name: ./ceph-mon
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 28234 - Name: ./ceph-osd
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

PID: 27964 - Name: ./ceph-osd
      ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
      mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)

Once we have collected a new set of traces, we are all set for post-processing. Here is the updated Python script for our new analysis task:

import sys
from babeltrace import *

traces = TraceCollection()
ret = traces.add_trace(sys.argv[1], "ctf")

lockers = {}
for event in traces.events:
    # metadata about mutex and thread context
    addr, name = event["addr"], event["name"]
    thread_id = event["pthread_id"]

    # identifies a thread interacting with a mutex
    locker = (thread_id, addr)

    # track when we finally acquire the lock
    if event.name == "mutex:lock_exit":
        lockers[locker] = event.timestamp

    # calc time from Mutex::Lock to Mutex::Unlock
    elif event.name == "mutex:unlock":
        try:
            duration = event.timestamp - lockers[locker]
            del lockers[locker]
            print duration, name
        except KeyError:
            continue

The script is very similar to the previous one used to calculate mutex wait times. However, here we are calculating the duration between exiting Mutex::Lock and exiting Mutex::Unlock. Performing the analysis and sorting on duration we get the following results:

nwatkins@kyoto:~$ python mutex.py /path/to/trace | sort -nr | head
36261010629 SimpleMessenger::lock
36260955023 SimpleMessenger::DispatchQueue::local_delivery_lock
36260066867 Finisher::finisher_lock
36259848130 MonClient::monc_lock
36256039073 librados::RadosClient::lock
36005823838 ObjBencher::lock
10000237408 SimpleMessenger::Pipe::pipe_lock
9999982398 SimpleMessenger::Pipe::pipe_lock
9413737346 WBThrottle::lock
8157151233 SimpleMessenger::Pipe::pipe_lock
7785652523 SimpleMessenger::DispatchQeueu::lock
7341281593 WBThrottle::lock
6756233583 SimpleMessenger::Pipe::pipe_lock
5881035551 Finisher::finisher_lock
5880718046 Finisher::finisher_lock

And that’s all there is to it. Virtually any question you can ask regarding the behavior of Ceph (or any other application) can be answered using trace analysis.

What’s Next #

There are a lot of ways to use LTTng. For instance, one can perform remote tracing over the network, combine kernel and user-space traces for a holistic system view, and also instrument clients for end-to-end tracing analysis.