Ceph OSD request processing latency

8 June 2014

How fast can RADOS process a request? The answer depends on a lot of factors such as network and I/O performance, operation type, and all sorts of flavors of contention that limit concurrency. Today we’ll focus on the latency added due to request processing inside an OSD. We are going to do our performance analysis by post-processing execution traces collected using LTTng-UST. Check out Tracing Ceph With LTTng for more information on instrumenting Ceph.

Let’s start by profiling the performance of the RADOS create operation. First we need to create a workload generator. It couldn’t be simpler: call ioctx.create with a unique object name in a loop and record the latency of each operation. Here is the kernel of the workload tool:

for (int i=0;; i++) {
    std::stringstream oid;
    oid << "obj." << i;
    uint64_t start = get_time();
    ioctx.create(oid.str(), true);
    uint64_t duration = get_time() - start;
    // log (start, duration)
}

In the snippet above, log (start, duration) will be replaced by an LTTng trace point so that we can access the latency measurements after the experiment completes.

Client Observed Latency

I ran this workload for five minutes against a single OSD using the in-memory object store. The experiment collected approximately 250,000 operations. Here is the distribution of latencies as observed by the client. With the exception of some outliers, the expected latency is about 1.1 milliseconds. That’s a pretty long time, considering we are going over the loopback and never touching an I/O device!

Client MeasurementLatency (ms)
mean1.145
std0.102
min0.597
25%1.103
50%1.144
75%1.187
max12.700

These latencies reflect everything, such as client library overhead, the network round-trip, and all request processing that occurs in the OSD. What contributes to the high latency? Next we’ll look at a subset of the processing that occurs in the OSD.

OSD Operation Handling

The following diagram shows the typical execution flow that an OSD follows while processing a client request. Starting at the top left the Messenger dispatches a message to the OSD which finishes by placing the request on the OpWQ workqueue. One can think of the dispatch phase as being analogous to the top half of an IRQ handler.

Image

A queued operation is plucked off of OpWQ by a worker thread, and this is where the actual work associated with an operation occurs. Continuing with our IRQ analogy, this is like the bottom half handler. We further separate this phase into the portion that executes the transaction associated with the request. We’ve instrumented the OSD and client using LTTng-UST, collected traces, and broken down request processing latency by phases.

Message Dispatch

The message dispatch latency represents the amount of time taken by the OSD, following message receipt, to perform any initial processing and place the request on the workqueue.

(ms)Client ViewDispatch Phase
mean1.1450.094
std0.1020.005
min0.5970.053
25%1.1030.093
50%1.1440.093
75%1.1870.094
max12.7001.138

So this is pretty fast at around 9 microseconds. The max time taken was about 1.1 milliseconds, so that isn’t what caused our huge 12 millisecond latency.

Queue Latency

Once a message has been placed on the workqueue it sits idle until a worker thread wakes up and handles it.

(ms)ClientDispatchOpWQ
mean1.1450.0940.042
std0.1020.0050.011
min0.5970.0530.000
25%1.1030.0930.038
50%1.1440.0930.041
75%1.1870.0940.045
max12.7001.1380.668

Alright, so we can add on about 4 more microseconds that the request is sitting in the queue. Still, we don’t know what caused the max client latency.

Request Handling

The request handling latency added to the table below shows the time taken by a worker thread to handle a request. Check out the max: we’ve narrowed down where that big latency came from. It doesn’t tell us exactly what happened, but it does tell us interesting things like it probably wasn’t the network or queuing delay (we need more instrumentation points to really get to the bottom of things).

(ms)ClientDispatchOpWQReq Handling
mean1.1450.0940.0420.225
std0.1020.0050.0110.078
min0.5970.0530.0000.122
25%1.1030.0930.0380.215
50%1.1440.0930.0410.221
75%1.1870.0940.0450.228
max12.7001.1380.66811.637

Now things are starting to add up. Note that in this phase we have found a large latency. It looks like that big client spike probably came from here (although this isn’t a definitive test).

Transaction Handling

Finally, we show latencies for the portion of request handling that corresponds to the actual operation specific transaction. Note that the latencies described in the previous sections are non-overlapping phases in that their latencies add together. However, transaction handling described here is a subset of the request handling phase.

(ms)ClientDispatchOpWQReq HandlingTx Handling
mean1.1450.0940.0420.2250.021
std0.1020.0050.0110.0780.002
min0.5970.0530.0000.1220.011
25%1.1030.0930.0380.2150.020
50%1.1440.0930.0410.2210.020
75%1.1870.0940.0450.2280.020
max12.7001.1380.66811.6370.161

Check out the max row again. Since transaction handling is a subset of the request handling phase we can completely rule out transaction handling as the cause of the large latency. The large latency is happening somewhere else in the request handling phase.

What’s Next

To really dig in further we’ll need more instrumentation points, and running on top of disks or over the network may shift bottlenecks and require instrumentation points in completely different parts of the request flow.