Adding a new placement group operation in Ceph

In this post we are going to create a librados operation in Ceph that operates at the level of the placement group (most RADOS operations act upon objects). As a demonstration we’ll build an interface that computes the checksum of all object data in a placement group. This probably isn’t useful to anyone, but it exercises a lot of interesting internal machinery. The overall approach is adapted from the code paths used to list objects in a pool. We will simplify the problem by ignoring various error scenarios related to pool state changing during the operation.

The intended usage of the new interface is similar to:

int pg = 0;
for (;;) {
  int checksum;
  int ret = ioctx.pg_checksum(pg, &checksum);
  if (ret < 0) // -ERANGE: no more pgs
    break;
  std::cout << "pg " << pg << ": " << checksum << std::endl;
  pg++;
}

Build the operation #

There is some basic scaffolding that needs to be added on the client side to expose an operation through the librados interface and it includes editing librados.hpp and adding hooks into the IoCtx implementation. Here are the basics:

Add a public interface to librados:

diff --git a/src/include/rados/librados.hpp b/src/include/rados/librados.hpp
index 640e7cf..ea25fca 100644
--- a/src/include/rados/librados.hpp
+++ b/src/include/rados/librados.hpp
@@ -857,6 +857,8 @@ namespace librados
         ObjectCursor *split_start,
         ObjectCursor *split_finish);
 
+    int pg_checksum(int pg, int *checksum);

diff --git a/src/librados/librados.cc b/src/librados/librados.cc
index 55f0a5a..6d2a6f3 100644
--- a/src/librados/librados.cc
+++ b/src/librados/librados.cc
@@ -1773,6 +1773,11 @@ const librados::ObjectIterator& librados::IoCtx::objects_end() const
   return ObjectIterator::__EndObjectIterator;
 }
 
+int librados::IoCtx::pg_checksum(int pg, int *checksum)
+{
+  return io_ctx_impl->pg_checksum(pg, checksum);
+}

The public interface punches through to the implementation:

diff --git a/src/librados/IoCtxImpl.h b/src/librados/IoCtxImpl.h
index be77d8e..e58a57e 100644
--- a/src/librados/IoCtxImpl.h
+++ b/src/librados/IoCtxImpl.h
@@ -222,6 +222,8 @@ struct librados::IoCtxImpl {
   int pool_change_auid(unsigned long long auid);
   int pool_change_auid_async(unsigned long long auid, PoolAsyncCompletionImpl *c);
 
+  int pg_checksum(int pg, int *checksum);

diff --git a/src/librados/IoCtxImpl.cc b/src/librados/IoCtxImpl.cc
index 75f4dee..5f72f0b 100644
--- a/src/librados/IoCtxImpl.cc
+++ b/src/librados/IoCtxImpl.cc
@@ -1098,6 +1098,11 @@ int librados::IoCtxImpl::aio_cancel(AioCompletionImpl *c)
   return objecter->op_cancel(c->tid, -ECANCELED);
 }
 
+int librados::IoCtxImpl::pg_checksum(int pg, int *checksum)
+{
+  std::cout << "call objecter here..." << std::endl << std::flush;
+  return 0;
+}

This last method librados::IoCtxImpl::pg_checksum is where we will add more interesting bits. The Objecter talks to the OSDs and the IoCtx contains the context information for the client connection to a pool. We’ll yank out the necessary information from the IoCtx (like the target pool) and pass it to the Objecter where most of the work will happen.

The new version of librados::IoCtxImpl::pg_checksum passes a condition variable into Objecter::pg_checksum (which we’ll show next) along with any state information, and then waits for the condition variable to signal that the work is complete.

diff --git a/src/librados/IoCtxImpl.cc b/src/librados/IoCtxImpl.cc
index 75f4dee..7228bb6 100644
--- a/src/librados/IoCtxImpl.cc
+++ b/src/librados/IoCtxImpl.cc
@@ -1098,6 +1098,23 @@ int librados::IoCtxImpl::aio_cancel(AioCompletionImpl *c)
   return objecter->op_cancel(c->tid, -ECANCELED);
 }
 
+int librados::IoCtxImpl::pg_checksum(int pg, int *checksum)
+{
+  Cond cond;
+  bool done;
+  int r = 0;
+  Mutex mylock("IoCtxImpl::pg_checksums::mylock");
+
+  objecter->pg_checksum(poolid, snap_seq, oloc.nspace,
+      pg, checksum, new C_SafeCond(&mylock, &cond, &done, &r));
+
+  mylock.Lock();
+  while (!done)
+    cond.Wait(mylock);
+  mylock.Unlock();
+
+  return r;
+}

Before we build the operation to send off to the OSD we should make sure that the target placement group is valid. We do this by querying the OSDMap and comparing the number of placement groups to the target placement group specified by the client:

diff --git a/src/osdc/Objecter.cc b/src/osdc/Objecter.cc
index 8560a26..c0ac883 100644
--- a/src/osdc/Objecter.cc
+++ b/src/osdc/Objecter.cc
@@ -3427,6 +3427,34 @@ uint32_t Objecter::list_nobjects_seek(NListContext *list_context,
   return list_context->current_pg;
 }
 
+void Objecter::pg_checksum(int64_t poolid, int pool_snap_seq, string nspace,
+    int pg, int *checksum, Context *onfinish)
+{
+  // get pool state
+  shared_lock rl(rwlock);
+  const pg_pool_t *pool = osdmap->get_pg_pool(poolid);
+  if (!pool) {
+    rl.unlock();
+    onfinish->complete(-ENOENT);
+    return;
+  }
+  int pg_num = pool->get_pg_num();
+  rl.unlock();
+
+  // sanity check requested pg
+  if (pg >= pg_num) {
+    onfinish->complete(-ERANGE);
+    return;
+  }

Now we construct the operation. The call to ObjectOperation::pg_checksum builds a context for the operation. The C_PGChecksum object is a callback object responsible for handling the response. Finally, Objecter::pg_read is used to submit the operation. It returns immediately, and the calling client continues to block on the condition variable in IoCtxImpl::pg_checksum (see above). The condition is signaled by C_PGChecksum after the operation reply arrives from the target OSD.

+  // build and submit op
+  ObjectOperation op;
+  op.pg_checksum();

+  C_PGChecksum *onack = new C_PGChecksum(onfinish, checksum);

+  object_locator_t oloc(poolid, nspace);
+  pg_read(pg, oloc, op, &onack->bl, 0, onack, &onack->epoch, NULL);
+}
+
 void Objecter::list_nobjects(NListContext *list_context, Context *onfinish)
 {
   ldout(cct, 10) << "list_objects" << dendl;

Here is ObjectOperation::pg_checksum. It adds a new operation and tags it with the ID of the new checksum operation.

diff --git a/src/osdc/Objecter.h b/src/osdc/Objecter.h
index 984fb89..98748fa 100644
--- a/src/osdc/Objecter.h
+++ b/src/osdc/Objecter.h
@@ -215,6 +215,12 @@ struct ObjectOperation {
     flags |= CEPH_OSD_FLAG_PGOP;
   }
 
+  void pg_checksum() {
+    OSDOp& osd_op = add_op(CEPH_OSD_OP_PG_CHECKSUM);
+    (void)osd_op;
+    flags |= CEPH_OSD_FLAG_PGOP;
+  }
+
   void scrub_ls(const librados::object_id_t& start_after,
                uint64_t max_to_get,

This is the callback object for handling the OSD reply. The finish method is called automatically, and if there is no error the return payload is decoded. The final_finish context refers to the context that the calling client is waiting on (back in IoCtxImpl::pg_checksum), and checksum_out is a pointer to integer where the client wants the result stored.

@@ -1427,6 +1433,32 @@ public:
     }
   };
 
+  struct C_PGChecksum : public Context {
+    Context *final_finish;
+    epoch_t epoch;
+    int *checksum_out;
+    bufferlist bl;
+
+    C_PGChecksum(Context *finish, int *checksum) :
+      final_finish(finish), epoch(0), checksum_out(checksum)
+    {}
+
+    void finish(int r) {
+      if (r >= 0) {
+        try {
+          int checksum;
+          bufferlist::iterator it = bl.begin();
+          ::decode(checksum, it);
+          *checksum_out = checksum;
+          r = 0;
+        } catch (buffer::error& e) {
+          r = -EIO;
+        }
+      }
+      final_finish->complete(r);
+    }
+  };
+
   struct C_NList : public Context {
     NListContext *list_context;
     Context *final_finish;

To complete the story on the client side, here is where CEPH_OSD_OP_PG_CHECKSUM is defined:

diff --git a/src/include/rados.h b/src/include/rados.h
index 71d5f27..8c5c3b4 100644
--- a/src/include/rados.h
+++ b/src/include/rados.h
@@ -299,7 +299,8 @@ extern const char *ceph_osd_state_name(int s);
        f(PG_HITSET_GET, __CEPH_OSD_OP(RD, PG, 4),      "pg-hitset-get")    \
        f(PGNLS,        __CEPH_OSD_OP(RD, PG, 5),       "pgnls")            \
        f(PGNLS_FILTER, __CEPH_OSD_OP(RD, PG, 6),       "pgnls-filter")     \
-       f(SCRUBLS, __CEPH_OSD_OP(RD, PG, 7), "scrubls")
+       f(SCRUBLS, __CEPH_OSD_OP(RD, PG, 7), "scrubls")                 \
+       f(PG_CHECKSUM,  __CEPH_OSD_OP(RD, PG, 8),       "pg-checksum")     \
 
 enum {
 #define GENERATE_ENUM_ENTRY(op, opcode, str)   CEPH_OSD_OP_##op = (opcode),

If a client calls IoCtx::pg_checksum at this point it will receive an error from the OSD indicating that an unknown operation has been submitted. Next we’ll fix that.

Handling the operation in an OSD #

Before we dive into the weeds, let’s return a fixed value to make sure everything is wired up correctly. Below we add a handler for the checksum operation and return a fixed value of 543. If everything is working correctly your client will receive this value when calling IoCtx::pg_checksum on any valid placement group.

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index cce8898..496a7bc 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1470,6 +1470,14 @@ void ReplicatedPG::do_pg_op(OpRequestRef op)
       result = do_scrub_ls(m, &osd_op);
       break;
 
+   case CEPH_OSD_OP_PG_CHECKSUM:
+      {
+        int checksum = 543;
+        ::encode(checksum, osd_op.outdata);
+        result = 0;
+      }
+      break;
+
     default:
       result = -EINVAL;
       break;

Now let’s get to work implementing the server-side handler for computing the checksum. The first step is to enumerate the set of objects in the placement group. This is accomplished using the PGBackend::objects_list_partial method. This method takes an object as an enumeration starting point, and returns some number of objects that follow the starting point. The checksum integer will be the result of the operation, and osr->flush(); blocks until any queued transactions have completed.

Note that 2nd and 3rd arguments to PGBackend::objects_list_partial specify the min and max number of objects to return. In practice these should be larger than 1, but we keep them this way to exercise the iteration logic.

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 496a7bc..d093f67 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1472,9 +1472,112 @@ void ReplicatedPG::do_pg_op(OpRequestRef op)
 
    case CEPH_OSD_OP_PG_CHECKSUM:
       {
+        int checksum = 0;
+
+        osr->flush();
+        hobject_t current = collection_list_handle_t();
+        for (;;) {
+          hobject_t next;
+          vector<hobject_t> sentries;
+          int r = pgbackend->objects_list_partial(
+              current, 1, 1, &sentries, &next);
+          if (r) {
+            result = -EINVAL;
+            break;
+          }

The way I interpret this next assertion (taken from the standard librados code path for enumerating objects) is that if we are enumerating objects within a snapshot (i.e. read-only) context, then there should be no missing objects. I believe missing objects can show up when the number of placement groups are changed, when there is a failure, and probably some other situations. This topic deserves more research, but for now we can act as though the union of the missing objects and the set of objects returned by PGBackend::objects_list_partial are the set of objects in the placement group.

+          assert(snapid == CEPH_NOSNAP ||
+              pg_log.get_missing().get_items().empty());
+
+          pg_log.resort_missing(get_sort_bitwise());

Here we obtain iterators to the list of objects from PGBackend and the missing set.

+          map<hobject_t, pg_missing_item, hobject_t::ComparatorWithDefault>::const_iterator missing_iter =
+            pg_log.get_missing().get_items().lower_bound(current);
+
+          vector<hobject_t>::iterator ls_iter = sentries.begin();
+          hobject_t _max = hobject_t::get_max();
+

If no objects were returned from the backend and there are no missing objects then we are done.

+
+          if (sentries.empty() && missing_iter == pg_log.get_missing().get_items().end()) {
+            result = 0;
+            break;
+          }
+

This loop processes one object at a time. The first step is to combine the set of existing objects with the set of missing objects. The process here is effectively the last part of a merge sort. The resulting variable candidate holds a hobject_t from one of the two sets.

+          for (;;) {
+            const hobject_t &mcand =
+              missing_iter == pg_log.get_missing().get_items().end() ?
+              _max :
+              missing_iter->first;
+
+            const hobject_t &lcand =
+              ls_iter == sentries.end() ?
+              _max :
+              *ls_iter;
+
+            hobject_t candidate;
+            if (mcand == lcand) {
+              candidate = mcand;
+              if (!mcand.is_max()) {
+                ++ls_iter;
+                ++missing_iter;
+              }
+            } else if (cmp(mcand, lcand, get_sort_bitwise()) < 0) {
+              candidate = mcand;
+              assert(!mcand.is_max());
+              ++missing_iter;
+            } else {
+              candidate = lcand;
+              assert(!lcand.is_max());
+              ++ls_iter;
+            }

The comparison here ends iteration if we go past the end point returned from PGBackend::objects_list_partial, which is probably possible because an arbitrary set of missing objects could sort into the same range.

+
+            if (cmp(candidate, next, get_sort_bitwise()) >= 0) {
+              break;
+            }
+

Objects internal to RADOS like snapdir objects and objects used for metadata are skipped as they are not user-visible.

+            // skip snapdir objects
+            if (candidate.snap == CEPH_SNAPDIR)
+              continue;
+
+            if (candidate.snap < snapid)
+              continue;
+
+            assert(snapid == CEPH_NOSNAP);
+
+            // skip internal namespace
+            if (candidate.get_namespace() == cct->_conf->osd_hit_set_namespace)
+              continue;
+
+            // skip wrong namespace
+            if (m->get_object_locator().nspace != librados::all_nspaces &&
+                candidate.get_namespace() != m->get_object_locator().nspace)
+              continue;
+

Finally we read the object data by calling PGBackend::objects_read_sync.

+            bufferlist bl;
+            int ret = pgbackend->objects_read_sync(
+                candidate, 0, 0, 0, &bl);
+            if (ret < 0) {
+              derr << "checksum read error " << ret << dendl;
+              result = ret;
+              break;
+            }
+

And the checksum is updated by adding together all of the object bytes cast to a numeric value (this is just for testing…). Iteration resumes with current = next, and after iteration is complete the checksum is encoded and returned to the client.

+            // update checksum
+            const size_t length = bl.length();
+            const uint8_t *data = (uint8_t*)bl.c_str();
+            for (size_t offset = 0; offset < length; offset++)
+              checksum += data[offset];
+          }
+          current = next;
+        }
+        ::encode(checksum, osd_op.outdata);
+      }
+      break;

And here is a sample output:

nwatkins@pl1:~/ceph/build$ bin/checksum-test
pg 0 ret 0 cs 126195159
pg 1 ret 0 cs 378485417
pg 2 ret 0 cs 378442111
pg 3 ret 0 cs 252377238
pg 4 ret 0 cs 378579803
pg 5 ret 0 cs 126198934
pg 6 ret 0 cs 126232920
pg 7 ret 0 cs 378636258