OpRequest flow in RADOS OSD server

5 May 2014

This post is a quick tour of the life cycle of an OpRequest in the Ceph/RADOS storage server. We’ll follow the request from the time the generic message arrives off the network, to the point that the resulting transaction for an object operation hits the low-level object store layer as a transaction.

The Messenger handles connections and generic messages. A message will be dispatched to any registered dispatchers via the ms_dispatch virtual method on the Dispatcher interface. The OSD class implements the Dispatcher interface. There are two high-level asynchronous traces described below. The first is the process of receiving, preparing, and queueing a request. The second is from the perspective of separate worker threads that dequeue requests to be processed.

Message Dispatch and Request Queuing

The trace begins when a message is dispatched to the OSD:

  • bool OSD::ms_dispatch(Message *m)
    • src/osd/OSD.cc:4720

There are two paths that can be taken, both of which will arrive at OSD::dispatch_op.

  • void OSD::_dispatch(Message *m)
    • Construct a new OpRequest
    • src/osd/OSD.cc:4937
  • void OSD::do_waiters()
    • Grab an existing OpRequest
    • src/osd/OSD.cc:4840

Both _dispatch and do_waiters will then process a request:

  • void OSD::dispatch_op(OpRequestRef op)
    • src/osd/OSD.cc:4857
  • void OSD::handle_op(OpRequestRef op)
    • src/osd/OSD.cc:7352
  • void OSD::enqueue_op(PG *pg, OpRequestRef op)
    • src/osd/OSD.cc:7546
  • void PG::queue_op(OpRequestRef op)
    • src/osd/PG.cc:1707

The request is now living on a queue waiting to be picked up by a worker:

Request Processing

The rough flow:

  • struct OpWQ: public ThreadPool::WorkQueueVal<pair<PGRef, OpRequestRef>, PGRef >
    • src/osd/OSD.h:1101
  • void OSD::OpWQ::_process(PGRef pg, ThreadPool::TPHandle &handle)
    • src/osd/OSD.cc:7604
  • void OSD::dequeue_op(PGRef pg, OpRequestRef op, ThreadPool::TPHandle &handle)
    • src/osd/OSD.cc:7643
  • void ReplicatedPG::do_request(OpRequestRef op, ThreadPool::TPHandle &handle)
    • src/osd/ReplicatedPG.cc:1080
  • void ReplicatedPG::do_op(OpRequestRef op)
    • src/osd/ReplicatedPG.cc:1191
  • void ReplicatedPG::execute_ctx(OpContext *ctx)
    • src/osd/ReplicatedPG.cc:1706

The following sub-trace shows the path taken to the actual logic behind a RADOS client write operation. All other client operations can be found down this path as well. For instance, CEPH_OSD_OP_WRITE is sibling to all other client operations in a large switch statement in do_osd_ops.

  • int ReplicatedPG::prepare_transaction(OpContext *ctx)
    • src/osd/ReplicatedPG.cc:5055
  • int ReplicatedPG::do_osd_ops(OpContext *ctx, vector& ops)
    • src/osd/ReplicatedPG.cc:2921
    • src/osd/ReplicatedPG.cc:3650

The accumulated transaction is submitted in issue_repop that will then call submit_transaction on the configured PGBackend (e.g. replication or erasure coding). The backend will communicate with replicas as well as run the transaction against the local object store.

  • void ReplicatedPG::issue_repop(RepGather *repop, utime_t now)
    • src/osd/ReplicatedPG.cc:6660
  • virtual void submit_transaction(
    • src/osd/PGBackend.h:490

The local object store (e.g. FileStore or BlueStore) is what manages the underlying storage hardware.