Ceph OSD request processing latency
How fast can RADOS process a request? The answer depends on a lot of factors such as network and I/O performance, operation type, and all sorts of flavors of contention that limit concurrency. Today we’ll focus on the latency added due to request processing inside an OSD. We are going to do our performance analysis by post-processing execution traces collected using LTTng-UST. Check out Tracing Ceph With LTTng for more information on instrumenting Ceph.
Let’s start by profiling the performance of the RADOS create
operation.
First we need to create a workload generator. It couldn’t be simpler: call
ioctx.create
with a unique object name in a loop and record the latency of
each operation. Here is the kernel of the workload tool:
for (int i=0;; i++) {
std::stringstream oid;
oid << "obj." << i;
uint64_t start = get_time();
ioctx.create(oid.str(), true);
uint64_t duration = get_time() - start;
// log (start, duration)
}
In the snippet above, log (start, duration)
will be replaced by an
LTTng trace point so that we can access the latency measurements after
the experiment completes.
Client Observed Latency #
I ran this workload for five minutes against a single OSD using the in-memory object store. The experiment collected approximately 250,000 operations. Here is the distribution of latencies as observed by the client. With the exception of some outliers, the expected latency is about 1.1 milliseconds. That’s a pretty long time, considering we are going over the loopback and never touching an I/O device!
Client Measurement | Latency (ms) |
---|---|
mean | 1.145 |
std | 0.102 |
min | 0.597 |
25% | 1.103 |
50% | 1.144 |
75% | 1.187 |
max | 12.700 |
These latencies reflect everything, such as client library overhead, the network round-trip, and all request processing that occurs in the OSD. What contributes to the high latency? Next we’ll look at a subset of the processing that occurs in the OSD.
OSD Operation Handling #
The following diagram shows the typical execution flow that an OSD follows
while processing a client request. Starting at the top left the Messenger
dispatches a message to the OSD which finishes by placing the request on the
OpWQ
workqueue. One can think of the dispatch phase as being analogous to
the top half of an IRQ handler.
A queued operation is plucked off of OpWQ
by a worker thread, and this is
where the actual work associated with an operation occurs. Continuing with our
IRQ analogy, this is like the bottom half handler. We further separate this
phase into the portion that executes the transaction associated with the
request. We’ve instrumented the OSD and client using LTTng-UST, collected
traces, and broken down request processing latency by phases.
Message Dispatch #
The message dispatch latency represents the amount of time taken by the OSD, following message receipt, to perform any initial processing and place the request on the workqueue.
(ms) | Client View | Dispatch Phase |
---|---|---|
mean | 1.145 | 0.094 |
std | 0.102 | 0.005 |
min | 0.597 | 0.053 |
25% | 1.103 | 0.093 |
50% | 1.144 | 0.093 |
75% | 1.187 | 0.094 |
max | 12.700 | 1.138 |
So this is pretty fast at around 9 microseconds. The max time taken was about 1.1 milliseconds, so that isn’t what caused our huge 12 millisecond latency.
Queue Latency #
Once a message has been placed on the workqueue it sits idle until a worker thread wakes up and handles it.
(ms) | Client | Dispatch | OpWQ |
---|---|---|---|
mean | 1.145 | 0.094 | 0.042 |
std | 0.102 | 0.005 | 0.011 |
min | 0.597 | 0.053 | 0.000 |
25% | 1.103 | 0.093 | 0.038 |
50% | 1.144 | 0.093 | 0.041 |
75% | 1.187 | 0.094 | 0.045 |
max | 12.700 | 1.138 | 0.668 |
Alright, so we can add on about 4 more microseconds that the request is sitting in the queue. Still, we don’t know what caused the max client latency.
Request Handling #
The request handling latency added to the table below shows the time taken by a worker thread to handle a request. Check out the max: we’ve narrowed down where that big latency came from. It doesn’t tell us exactly what happened, but it does tell us interesting things like it probably wasn’t the network or queuing delay (we need more instrumentation points to really get to the bottom of things).
(ms) | Client | Dispatch | OpWQ | Req Handling |
---|---|---|---|---|
mean | 1.145 | 0.094 | 0.042 | 0.225 |
std | 0.102 | 0.005 | 0.011 | 0.078 |
min | 0.597 | 0.053 | 0.000 | 0.122 |
25% | 1.103 | 0.093 | 0.038 | 0.215 |
50% | 1.144 | 0.093 | 0.041 | 0.221 |
75% | 1.187 | 0.094 | 0.045 | 0.228 |
max | 12.700 | 1.138 | 0.668 | 11.637 |
Now things are starting to add up. Note that in this phase we have found a large latency. It looks like that big client spike probably came from here (although this isn’t a definitive test).
Transaction Handling #
Finally, we show latencies for the portion of request handling that corresponds to the actual operation specific transaction. Note that the latencies described in the previous sections are non-overlapping phases in that their latencies add together. However, transaction handling described here is a subset of the request handling phase.
(ms) | Client | Dispatch | OpWQ | Req Handling | Tx Handling |
---|---|---|---|---|---|
mean | 1.145 | 0.094 | 0.042 | 0.225 | 0.021 |
std | 0.102 | 0.005 | 0.011 | 0.078 | 0.002 |
min | 0.597 | 0.053 | 0.000 | 0.122 | 0.011 |
25% | 1.103 | 0.093 | 0.038 | 0.215 | 0.020 |
50% | 1.144 | 0.093 | 0.041 | 0.221 | 0.020 |
75% | 1.187 | 0.094 | 0.045 | 0.228 | 0.020 |
max | 12.700 | 1.138 | 0.668 | 11.637 | 0.161 |
Check out the max row again. Since transaction handling is a subset of the request handling phase we can completely rule out transaction handling as the cause of the large latency. The large latency is happening somewhere else in the request handling phase.
What’s Next #
To really dig in further we’ll need more instrumentation points, and running on top of disks or over the network may shift bottlenecks and require instrumentation points in completely different parts of the request flow.