Tracing Ceph with LTTng
This post demonstrates how to use LTTng-UST to collect execution traces from Ceph. As a driving example we’ll use the traces to identify all instances of lock ownership, and how long each lock is held. This type of analysis could be useful for things like identifying sources of latency. While specific to Ceph, the tracing techniques shown can be applied to any application as a powerful tool for performance analysis and debugging.
Parallel and real-time systems are notoriously difficult to debug. This is due to a number of issues such as non-determinism as well as an inherent complexity in execution behavior. Many tools exist for debugging and performance analysis, but no single tool is best for every situation. For instance, profiling tools such as OProfile can be used to collect detailed performance data, but typically provide results at an aggregate level that fails to capture low-level details like thread interleavings that are crucial to diagnosing concurrency bugs. Debugging tools such as GDB sit at the opposite end of the spectrum, providing fine granularity control and monitoring of execution, but can significantly alter program behavior and mask concurrency issues. Tracing occupies a middle ground providing the ability to collect execution traces while having minimal impact on program behavior.
Next I’ll discuss existing tools used for tracing and profiling, as well as the tooling that is currently used to examine the behavior of Ceph. Then I’ll use a simple example in the context of Ceph concurrency control primitives to demonstrate the use of LTTng, building up to the analysis of traces in identifying costly critical sections.
Tracing and Profiling Tools #
A large number of tracing and profiling tools exist. This list of generic tools is by far incomplete, but contains some of the more well known projects.
In addition to generic tools, domain-specific tracing and profiling tools exist. For instance, blktrace is used to trace the block I/O layer in the Linux kernel, and a number of options for tracing and profiling MPI applications exist.
Omitted from the list above is the even larger number of home grown and ad hoc solutions. Tracing is inherently a simple idea: collect metadata about an event as a side effect of execution. The problem is that designing a low overhead tracing solution that works across platforms and has good tooling is quite challenging. I’ll refer the reader to the following paper that has a lot of good information about the challenges of performance analysis tracing, and is discussed in the context of the Tau project.
Sameer S. Shende and Allen D. Malony, The Tau Parallel Performance System.
Tracing and Profiling in Ceph #
There are several existing approaches to trace and profile Ceph. The first is to use existing debug logs that Ceph produces. These are heavy weight text files that are created and whose syntax is largely ad hoc. These are great for debugging, but can have a large impact on performance. Aggregate performance information is also available via the performance counters sub-system, enabling users to dump snapshots that include information such as average queue lengths and some types of latency information. Finally, efforts to integrate SystemTap have been discussed on the mailing list.
Next I’ll show how to use LTTng to trace Ceph execution behavior. While SystemTap is an excellent tool that I think must be supported—especially for debugging systems that cannot tolerate down time—the last time I investigated SystemTap its overhead made it less attractive for long-term tracing, a task that LTTng is suitable for. However, this comes at the cost of requiring static tracepoints to be inserted into source code.
Finding High-latency Critical Sections in Ceph #
The primary locking primitive in Ceph is the type Mutex
. Since limiting
concurrency can impact latency, finding instances in which a mutex is held for
a long period of time is helpful in understanding the performance of Ceph
internals. Essentially we want to perform the following analysis for every
mutex used in Ceph:
Mutex lock;
time_t start, duration;
lock.Lock();
start = timestamp();
// do stuff
duration = start - timestamp();
lock.Unlock();
Rather than changing every call site, we can instead record every instance of
the Mutex::Lock
and Mutex::Unlock
events, and calculate the duration
between these events by post-processing the event stream. In order to drive
home the differences between profiling and tracing, I am going to start by
demonstrating how to use tracing to calculate mutex wait times, which happens
to be a task that profiling can handle. Then, reusing the functionality and
tools that we have built up, I will show how to find critical sections that
introduce large latencies, a much more difficult task for profiling that is
made simple through the use of tracing and analysis tools.
The userspace port of the LTTng kernel trace toolkit is called LTTng-UST. I’ll refer the reader to the LTTng website for detailed information about using the toolkit. Its use is fairly straightforward, but integrating it can be highly dependent on the application being instrumented, so a completely generic tutorial would be difficult to produce. The LTTng website has published good documentation.
Calculating Mutex Wait Time #
The following code snippet shows the locking routine for a mutex in Ceph (a
fancy wrapper around pthread_mutex_lock
that adds some additional
features). In order to examine the wait times introduced by the use
of a mutex we need to instrument this code to track how much time a thread
spends in this function. Note that the point of this exercise is to
demonstrate tracing, and yes, as stated, many profiling tools such as OProfile
are fully capable of handling this task. We’ll get to the harder stuff shortly.
void Mutex::Lock(bool no_lockdep) {
if (lockdep && g_lockdep && !no_lockdep)
_will_lock();
if (TryLock())
return;
int r = pthread_mutex_lock(&_m);
assert(r == 0);
if (lockdep && g_lockdep) _locked();
_post_lock();
}
As a short aside, the above routine is simplified. In fact, the full routine
does provide profiling support via Ceph performance counters for tracking
mutex wait time. Here is the unabridged version. Notice the addition of clock
sampling and logging that hugs the call to pthread_mutex_lock
:
void Mutex::Lock(bool no_lockdep) {
if (lockdep && g_lockdep && !no_lockdep)
_will_lock();
if (TryLock()) {
return;
}
utime_t start;
if (logger && cct && cct->_conf->mutex_perf_counter)
start = ceph_clock_now(cct);
int r = pthread_mutex_lock(&_m);
if (logger && cct && cct->_conf->mutex_perf_counter)
logger->tinc(l_mutex_wait,
ceph_clock_now(cct) - start);
assert(r == 0);
if (lockdep && g_lockdep) _locked();
_post_lock();
}
The problem is that performance counters (in their current form) toss away a lot of useful context. For example, the use of performance counters in this example can be used to examine total wait time over a fixed period of time, across all mutex instances, but they cannot be tell us which threads were effected most, or which specific mutex instances were to blame for large wait times—very useful information for debugging and performance analysis.
Defining Tracepoints #
We can calculate the time a thread spends in Mutex::Lock
by logging events
upon entering and exiting the function. It is true that you could simply
calculate and log the duration with a single event, but that’s no fun when it
comes to demonstrating trace post-processing. We start by defining the two
events, lock_enter
and lock_exit
, and create fields that will hold both
the address of the lock as well as the lock’s name. Note that timing
information and context such as thread identifiers are handled by the
LTTng-UST framework automatically. The important thing to notice in the
tracepoint definition is what information we are collecting. Specifically, we
are going to record the address and name of each Mutex
instance. Without
this information any trace post-processing tool would not be able to match a
lock_enter
event with the corresponding lock_exit
event.
TRACEPOINT_EVENT(mutex, lock_enter,
TP_ARGS(
const void *, addr,
const char *, name),
TP_FIELDS(
ctf_integer_hex(unsigned long, addr, addr)
ctf_string(name, name)
)
)
TRACEPOINT_EVENT(mutex, lock_exit,
TP_ARGS(
const void *, addr,
const char *, name),
TP_FIELDS(
ctf_integer_hex(unsigned long, addr, addr)
ctf_string(name, name)
)
)
After generating the tracepoint hooks using the lttng-gen-tp
tool we can add
instrumentation points to Mutex::Lock
. Notice we modify the routine slightly
to be friendly to different return paths.
void Mutex::Lock(bool no_lockdep) {
+ utime_t start;
+ int r;
+
+ tracepoint(mutex, lock_enter, this, name);
+
if (lockdep && g_lockdep && !no_lockdep) _will_lock();
if (TryLock()) {
- return;
+ goto out;
}
- utime_t start;
if (logger && cct && cct->_conf->mutex_perf_counter)
start = ceph_clock_now(cct);
- int r = pthread_mutex_lock(&_m);
+ r = pthread_mutex_lock(&_m);
if (logger && cct && cct->_conf->mutex_perf_counter)
logger->tinc(l_mutex_wait,
ceph_clock_now(cct) - start);
assert(r == 0);
if (lockdep && g_lockdep) _locked();
_post_lock();
+
+out:
+ tracepoint(mutex, lock_exit, this, name);
}
Ok cool, now that that’s done, go recompile Ceph. Yes, this is a downside to LTTng-UST. But, it has a lot of other benefits, like being very low-overhead, and clearly capturing the semantics at the call site.
A Note About Starting Ceph #
Normally an LTTng-UST instrumented application can be started as normal. But
there is one minor issue spinning up processes that daemonize themselves. When
starting a Ceph daemon (e.g. ceph-osd
, ceph-mon
, ceph-msd
) you should
LD_PRELOAD
the LTTng-UST forking utility library. On my Ubuntu box I do this
in conjunction with vstart.sh
as follows:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/liblttng-ust-fork.so ./vstart
That’s it. Now we should be able to access any tracepoints that exist via the standard LTTng-UST tools.
Listing Available Tracepoints #
The set of tracepoints from any LTTng-UST enabled process can be listed using
the lttng list
command. For instance, I started two OSDs and one monitor and
then listed the available tracepoints:
nwatkins@kyoto:~$ lttng list -u
UST events:
-------------
PID: 13541 - Name: ./ceph-osd
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
PID: 13271 - Name: ./ceph-osd
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
PID: 13054 - Name: ./ceph-mon
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
Awesome! Those are the tracepoints we added. Notice that since we added our
tracepoint to the Mutex
type, which is a common feature shared across many
parts of Ceph, it shows up in both the OSD and monitor processes.
Capturing Traces #
The first thing that needs to be done is to start a tracing session.
nwatkins@kyoto:~$ lttng create
Session auto-20140530-165745 created.
Traces will be written in /home/nwatkins/lttng-traces/auto-20140530-165745
Next we’ll enable all userspace events within the current session. This tool can also be used to interact with LTTng tracepoints in the kernel, as well as selectively enabling events based on different filtering patterns.
nwatkins@kyoto:~$ lttng enable-event -u -a
All UST events are enabled in channel channel0
Start logging in the current session.
nwatkins@kyoto:~$ lttng start
Tracing started for session auto-20140530-165745
At this point traces are being actively collected. We can generate some load on the cluster to make sure we hit a bunch of mutex instances.
nwatkins@kyoto:~$ rados -p data bench 10 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
Object prefix: benchmark_data_kyoto_21530
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 17 1 3.99856 4 0.950886 0.950886
2 16 21 5 9.99701 16 1.51263 1.39381
3 16 25 9 11.9967 16 2.57365 1.79965
And finally we can suspend trace collection.
nwatkins@kyoto:~$ lttng stop
Waiting for data availability.
Tracing stopped for session auto-20140530-165745
Viewing the trace that we just collected is also very simple using the lttng view
command.
nwatkins@kyoto:~$ lttng view | head
[16:58:23.038521097] (+?.?????????) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038530591] (+0.000009494) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038531411] (+0.000000820) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[16:58:23.038532042] (+0.000000631) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[16:58:23.038532967] (+0.000000925) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038533479] (+0.000000512) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[16:58:23.038535265] (+0.000001786) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[16:58:23.038535846] (+0.000000581) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[16:58:23.038538253] (+0.000002407) kyoto mutex:lock_enter: { cpu_id = 0 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
[16:58:23.038538829] (+0.000000576) kyoto mutex:lock_exit: { cpu_id = 0 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
The snapshot of the trace above shows timestamped events corresponding to
the tracepoints we inserted into Mutex::Lock
. Notice that the address and
name of the mutex is also available.
Trace Sizes #
Using the lttng view
utility we can convert the trace into a readable
format. However, traces are collected in an efficient binary format. For
instance, the size of the converted trace for a short run is roughly 46
megabytes:
nwatkins@kyoto:~/ceph/ceph$ lttng view > full.trace
nwatkins@kyoto:~/ceph/ceph$ ls -lh full.trace
-rw-rw-r-- 1 nwatkins nwatkins 46M May 31 12:06 full.trace
While the directory containing the raw traces is significantly smaller:
nwatkins@kyoto:~/ceph/ceph$ du -sh ~/lttng-traces/auto-20140531-120016/ust/uid/1000/64-bit/
12M /home/nwatkins/lttng-traces/auto-20140531-120016/ust/uid/1000/64-bit/
This works out to about 44 bytes per event in the raw format vs 161 bytes per event in the converted textual format. Quite a difference.
Adding Context To Traces #
Additional context can be added to traces without changing the tracepoint,
recompiling, or restarting the application. For instance, here is
how we can log the pthread_id
of the execution context that
triggered an event:
nwatkins@kyoto:~/ceph/ceph$ lttng add-context -u -t pthread_id
UST context pthread_id added to all channels
After adding the context and collecting a new trace the additional context information will be available in the trace:
nwatkins@kyoto:~/ceph/ceph$ lttng view | head
[12:01:16.770519287] (+?.?????????) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770532490] (+0.000013203) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770533490] (+0.000001000) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[12:01:16.770534174] (+0.000000684) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205ABE8, name = "OSD::heartbeat_update_lock" }
[12:01:16.770535188] (+0.000001014) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770535800] (+0.000000612) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205AB18, name = "OSD::heartbeat_lock" }
[12:01:16.770537653] (+0.000001853) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[12:01:16.770538299] (+0.000000646) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x1FC2408, name = "SimpleMessenger::DispatchQeueu::lock" }
[12:01:16.770540621] (+0.000002322) kyoto mutex:lock_enter: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
[12:01:16.770541233] (+0.000000612) kyoto mutex:lock_exit: { cpu_id = 2 }, { pthread_id = 140109313545984 }, { addr = 0x205A610, name = "OSD::recovery_tp::lock" }
Notice the new { pthread_id = 140109313545984 }
field.
Trace Post-processing #
Now that we have a trace with the data we need, namely the lock_enter
and
lock_exit
events, we can calculate how long any thread waited on a mutex.
The babeltrace
tool that ships with lttng-ust
is useful for interacting
with traces, and also has Python bindings that we will make use of. Note
that there are many tools for trace analysis, including several GUIs, but a
full discussion of these tools is beyond the scope of this post. The
following script performs all the post-processing that we need:
import sys
from babeltrace import *
traces = TraceCollection()
ret = traces.add_trace(sys.argv[1], "ctf")
lockers = {}
for event in traces.events:
# metadata about mutex and thread context
addr, name = event["addr"], event["name"]
thread_id = event["pthread_id"]
# identifies a thread interacting with a mutex
locker = (thread_id, addr)
# record a thread entering Mutex::Lock
if event.name == "mutex:lock_enter":
lockers[locker] = event.timestamp
# calculate time spent in Mutex::Lock
elif event.name == "mutex:lock_exit":
try:
duration = event.timestamp - lockers[locker]
del lockers[locker]
print duration, name
except KeyError:
continue
This simple script monitors for the mutex:lock_enter
event, and prints the
elapsed time when it encounters the corresponding mutex:lock_exit
event.
It tracks ownership by identifying a mutex by the tuple (thread_id, addr)
.
Notice the try
block used when processing the mutex:lock_exit
event. This
is necessary to catch the edge case in which tracing began after a thread had
executed past mutex:lock_enter
but prior to reaching the mutex:lock_exit
event.
This script can be run as follows to find instances of the longest mutex wait times in Ceph:
nwatkins@kyoto:~$ python mutex.py /path/to/trace | sort -nr | head
3727778496 Monitor::lock
3505961112 Monitor::lock
476723751 Monitor::lock
372843475 FileStore::OpSequencer::apply_lock
197103478 PG::_lock
196684510 fdcace_lock
51744346 filestore::opsequencer::apply_lock
50814795 filestore::opsequencer::apply_lock
35316968 fdcache_lock
25695414 pg::_lock
21339899 filestore::opsequencer::apply_lock
Fantastic. Now we have the wait cost of each instance of locking a mutex.
Further analysis can be done at this point, such as plotting a histogram, or
picking a lock and trying to reduce costs. In the next section we are going to
reuse the mutex:lock_enter
event to find the longest critical sections in
Ceph. For this we need to know when a mutex is unlocked.
Instrumenting Mutex::Unlock #
Now that we know how to collect a trace with events corresponding to
Mutex::Lock
, we can instrument Mutex::Unlock
and perform post-processing
to calculate the duration of critical sections. The first step is to add the
new tracepoint. Since the previous section covered this process in detail,
I’ll only show the important bits here. Here is the new tracepoint we will be
adding:
tracepoint_event(mutex, unlock,
tp_args(
const void *, addr,
const char *, name),
tp_fields(
ctf_integer_hex(unsigned long, addr, addr)
ctf_string(name, name)
)
)
Next add the tracepoint to Mutex::Unlock
:
void Mutex::Unlock() {
_pre_unlock();
if (lockdep && g_lockdep) _will_unlock();
int r = pthread_mutex_unlock(&_m);
assert(r == 0);
+ tracepoint(mutex, unlock, this, name);
}
Now go ahead and spin up Ceph. We can see our old tracepoints along with the new one:
nwatkins@kyoto:~$ lttng list -u
UST events:
-------------
PID: 27747 - Name: ./ceph-mon
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
PID: 28234 - Name: ./ceph-osd
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
PID: 27964 - Name: ./ceph-osd
ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:unlock (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_exit (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
mutex:lock_enter (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
Once we have collected a new set of traces, we are all set for post-processing. Here is the updated Python script for our new analysis task:
import sys
from babeltrace import *
traces = TraceCollection()
ret = traces.add_trace(sys.argv[1], "ctf")
lockers = {}
for event in traces.events:
# metadata about mutex and thread context
addr, name = event["addr"], event["name"]
thread_id = event["pthread_id"]
# identifies a thread interacting with a mutex
locker = (thread_id, addr)
# track when we finally acquire the lock
if event.name == "mutex:lock_exit":
lockers[locker] = event.timestamp
# calc time from Mutex::Lock to Mutex::Unlock
elif event.name == "mutex:unlock":
try:
duration = event.timestamp - lockers[locker]
del lockers[locker]
print duration, name
except KeyError:
continue
The script is very similar to the previous one used to calculate mutex wait
times. However, here we are calculating the duration between exiting
Mutex::Lock
and exiting Mutex::Unlock
. Performing the analysis and sorting
on duration we get the following results:
nwatkins@kyoto:~$ python mutex.py /path/to/trace | sort -nr | head
36261010629 SimpleMessenger::lock
36260955023 SimpleMessenger::DispatchQueue::local_delivery_lock
36260066867 Finisher::finisher_lock
36259848130 MonClient::monc_lock
36256039073 librados::RadosClient::lock
36005823838 ObjBencher::lock
10000237408 SimpleMessenger::Pipe::pipe_lock
9999982398 SimpleMessenger::Pipe::pipe_lock
9413737346 WBThrottle::lock
8157151233 SimpleMessenger::Pipe::pipe_lock
7785652523 SimpleMessenger::DispatchQeueu::lock
7341281593 WBThrottle::lock
6756233583 SimpleMessenger::Pipe::pipe_lock
5881035551 Finisher::finisher_lock
5880718046 Finisher::finisher_lock
And that’s all there is to it. Virtually any question you can ask regarding the behavior of Ceph (or any other application) can be answered using trace analysis.
What’s Next #
There are a lot of ways to use LTTng. For instance, one can perform remote tracing over the network, combine kernel and user-space traces for a holistic system view, and also instrument clients for end-to-end tracing analysis.