Adding a new placement group operation in Ceph
In this post we are going to create a librados
operation in
Ceph that operates at the level of the placement group (most
RADOS operations act upon objects). As a demonstration we’ll build an interface
that computes the checksum of all object data in a placement group. This
probably isn’t useful to anyone, but it exercises a lot of interesting internal
machinery. The overall approach is adapted from the code paths used to list
objects in a pool. We will simplify the problem by ignoring various error
scenarios related to pool state changing during the operation.
The intended usage of the new interface is similar to:
int pg = 0;
for (;;) {
int checksum;
int ret = ioctx.pg_checksum(pg, &checksum);
if (ret < 0) // -ERANGE: no more pgs
break;
std::cout << "pg " << pg << ": " << checksum << std::endl;
pg++;
}
Build the operation #
There is some basic scaffolding that needs to be added on the client side to
expose an operation through the librados
interface and it includes editing
librados.hpp
and adding hooks into the IoCtx
implementation. Here are the
basics:
Add a public interface to librados
:
diff --git a/src/include/rados/librados.hpp b/src/include/rados/librados.hpp
index 640e7cf..ea25fca 100644
--- a/src/include/rados/librados.hpp
+++ b/src/include/rados/librados.hpp
@@ -857,6 +857,8 @@ namespace librados
ObjectCursor *split_start,
ObjectCursor *split_finish);
+ int pg_checksum(int pg, int *checksum);
diff --git a/src/librados/librados.cc b/src/librados/librados.cc
index 55f0a5a..6d2a6f3 100644
--- a/src/librados/librados.cc
+++ b/src/librados/librados.cc
@@ -1773,6 +1773,11 @@ const librados::ObjectIterator& librados::IoCtx::objects_end() const
return ObjectIterator::__EndObjectIterator;
}
+int librados::IoCtx::pg_checksum(int pg, int *checksum)
+{
+ return io_ctx_impl->pg_checksum(pg, checksum);
+}
The public interface punches through to the implementation:
diff --git a/src/librados/IoCtxImpl.h b/src/librados/IoCtxImpl.h
index be77d8e..e58a57e 100644
--- a/src/librados/IoCtxImpl.h
+++ b/src/librados/IoCtxImpl.h
@@ -222,6 +222,8 @@ struct librados::IoCtxImpl {
int pool_change_auid(unsigned long long auid);
int pool_change_auid_async(unsigned long long auid, PoolAsyncCompletionImpl *c);
+ int pg_checksum(int pg, int *checksum);
diff --git a/src/librados/IoCtxImpl.cc b/src/librados/IoCtxImpl.cc
index 75f4dee..5f72f0b 100644
--- a/src/librados/IoCtxImpl.cc
+++ b/src/librados/IoCtxImpl.cc
@@ -1098,6 +1098,11 @@ int librados::IoCtxImpl::aio_cancel(AioCompletionImpl *c)
return objecter->op_cancel(c->tid, -ECANCELED);
}
+int librados::IoCtxImpl::pg_checksum(int pg, int *checksum)
+{
+ std::cout << "call objecter here..." << std::endl << std::flush;
+ return 0;
+}
This last method librados::IoCtxImpl::pg_checksum
is where we will add more
interesting bits. The Objecter
talks to the OSDs and the IoCtx
contains the
context information for the client connection to a pool. We’ll yank out the
necessary information from the IoCtx
(like the target pool) and pass it to
the Objecter
where most of the work will happen.
The new version of librados::IoCtxImpl::pg_checksum
passes a condition
variable into Objecter::pg_checksum
(which we’ll show next) along with any
state information, and then waits for the condition variable to signal that the
work is complete.
diff --git a/src/librados/IoCtxImpl.cc b/src/librados/IoCtxImpl.cc
index 75f4dee..7228bb6 100644
--- a/src/librados/IoCtxImpl.cc
+++ b/src/librados/IoCtxImpl.cc
@@ -1098,6 +1098,23 @@ int librados::IoCtxImpl::aio_cancel(AioCompletionImpl *c)
return objecter->op_cancel(c->tid, -ECANCELED);
}
+int librados::IoCtxImpl::pg_checksum(int pg, int *checksum)
+{
+ Cond cond;
+ bool done;
+ int r = 0;
+ Mutex mylock("IoCtxImpl::pg_checksums::mylock");
+
+ objecter->pg_checksum(poolid, snap_seq, oloc.nspace,
+ pg, checksum, new C_SafeCond(&mylock, &cond, &done, &r));
+
+ mylock.Lock();
+ while (!done)
+ cond.Wait(mylock);
+ mylock.Unlock();
+
+ return r;
+}
Before we build the operation to send off to the OSD we should make sure that
the target placement group is valid. We do this by querying the OSDMap
and
comparing the number of placement groups to the target placement group
specified by the client:
diff --git a/src/osdc/Objecter.cc b/src/osdc/Objecter.cc
index 8560a26..c0ac883 100644
--- a/src/osdc/Objecter.cc
+++ b/src/osdc/Objecter.cc
@@ -3427,6 +3427,34 @@ uint32_t Objecter::list_nobjects_seek(NListContext *list_context,
return list_context->current_pg;
}
+void Objecter::pg_checksum(int64_t poolid, int pool_snap_seq, string nspace,
+ int pg, int *checksum, Context *onfinish)
+{
+ // get pool state
+ shared_lock rl(rwlock);
+ const pg_pool_t *pool = osdmap->get_pg_pool(poolid);
+ if (!pool) {
+ rl.unlock();
+ onfinish->complete(-ENOENT);
+ return;
+ }
+ int pg_num = pool->get_pg_num();
+ rl.unlock();
+
+ // sanity check requested pg
+ if (pg >= pg_num) {
+ onfinish->complete(-ERANGE);
+ return;
+ }
Now we construct the operation. The call to ObjectOperation::pg_checksum
builds a context for the operation. The C_PGChecksum
object is a callback
object responsible for handling the response. Finally, Objecter::pg_read
is
used to submit the operation. It returns immediately, and the calling client
continues to block on the condition variable in IoCtxImpl::pg_checksum
(see
above). The condition is signaled by C_PGChecksum
after the operation reply
arrives from the target OSD.
+ // build and submit op
+ ObjectOperation op;
+ op.pg_checksum();
+ C_PGChecksum *onack = new C_PGChecksum(onfinish, checksum);
+ object_locator_t oloc(poolid, nspace);
+ pg_read(pg, oloc, op, &onack->bl, 0, onack, &onack->epoch, NULL);
+}
+
void Objecter::list_nobjects(NListContext *list_context, Context *onfinish)
{
ldout(cct, 10) << "list_objects" << dendl;
Here is ObjectOperation::pg_checksum
. It adds a new operation and tags it
with the ID of the new checksum operation.
diff --git a/src/osdc/Objecter.h b/src/osdc/Objecter.h
index 984fb89..98748fa 100644
--- a/src/osdc/Objecter.h
+++ b/src/osdc/Objecter.h
@@ -215,6 +215,12 @@ struct ObjectOperation {
flags |= CEPH_OSD_FLAG_PGOP;
}
+ void pg_checksum() {
+ OSDOp& osd_op = add_op(CEPH_OSD_OP_PG_CHECKSUM);
+ (void)osd_op;
+ flags |= CEPH_OSD_FLAG_PGOP;
+ }
+
void scrub_ls(const librados::object_id_t& start_after,
uint64_t max_to_get,
This is the callback object for handling the OSD reply. The finish
method is
called automatically, and if there is no error the return payload is decoded.
The final_finish
context refers to the context that the calling client is
waiting on (back in IoCtxImpl::pg_checksum
), and checksum_out
is a pointer
to integer where the client wants the result stored.
@@ -1427,6 +1433,32 @@ public:
}
};
+ struct C_PGChecksum : public Context {
+ Context *final_finish;
+ epoch_t epoch;
+ int *checksum_out;
+ bufferlist bl;
+
+ C_PGChecksum(Context *finish, int *checksum) :
+ final_finish(finish), epoch(0), checksum_out(checksum)
+ {}
+
+ void finish(int r) {
+ if (r >= 0) {
+ try {
+ int checksum;
+ bufferlist::iterator it = bl.begin();
+ ::decode(checksum, it);
+ *checksum_out = checksum;
+ r = 0;
+ } catch (buffer::error& e) {
+ r = -EIO;
+ }
+ }
+ final_finish->complete(r);
+ }
+ };
+
struct C_NList : public Context {
NListContext *list_context;
Context *final_finish;
To complete the story on the client side, here is where
CEPH_OSD_OP_PG_CHECKSUM
is defined:
diff --git a/src/include/rados.h b/src/include/rados.h
index 71d5f27..8c5c3b4 100644
--- a/src/include/rados.h
+++ b/src/include/rados.h
@@ -299,7 +299,8 @@ extern const char *ceph_osd_state_name(int s);
f(PG_HITSET_GET, __CEPH_OSD_OP(RD, PG, 4), "pg-hitset-get") \
f(PGNLS, __CEPH_OSD_OP(RD, PG, 5), "pgnls") \
f(PGNLS_FILTER, __CEPH_OSD_OP(RD, PG, 6), "pgnls-filter") \
- f(SCRUBLS, __CEPH_OSD_OP(RD, PG, 7), "scrubls")
+ f(SCRUBLS, __CEPH_OSD_OP(RD, PG, 7), "scrubls") \
+ f(PG_CHECKSUM, __CEPH_OSD_OP(RD, PG, 8), "pg-checksum") \
enum {
#define GENERATE_ENUM_ENTRY(op, opcode, str) CEPH_OSD_OP_##op = (opcode),
If a client calls IoCtx::pg_checksum
at this point it will receive an error
from the OSD indicating that an unknown operation has been submitted. Next
we’ll fix that.
Handling the operation in an OSD #
Before we dive into the weeds, let’s return a fixed value to make sure
everything is wired up correctly. Below we add a handler for the checksum
operation and return a fixed value of 543
. If everything is working correctly
your client will receive this value when calling IoCtx::pg_checksum
on any
valid placement group.
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index cce8898..496a7bc 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1470,6 +1470,14 @@ void ReplicatedPG::do_pg_op(OpRequestRef op)
result = do_scrub_ls(m, &osd_op);
break;
+ case CEPH_OSD_OP_PG_CHECKSUM:
+ {
+ int checksum = 543;
+ ::encode(checksum, osd_op.outdata);
+ result = 0;
+ }
+ break;
+
default:
result = -EINVAL;
break;
Now let’s get to work implementing the server-side handler for computing the
checksum. The first step is to enumerate the set of objects in the placement
group. This is accomplished using the PGBackend::objects_list_partial
method.
This method takes an object as an enumeration starting point, and returns some
number of objects that follow the starting point. The checksum
integer will
be the result of the operation, and osr->flush();
blocks until any queued
transactions have completed.
Note that 2nd and 3rd arguments to PGBackend::objects_list_partial
specify
the min and max number of objects to return. In practice these should be larger
than 1, but we keep them this way to exercise the iteration logic.
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 496a7bc..d093f67 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1472,9 +1472,112 @@ void ReplicatedPG::do_pg_op(OpRequestRef op)
case CEPH_OSD_OP_PG_CHECKSUM:
{
+ int checksum = 0;
+
+ osr->flush();
+ hobject_t current = collection_list_handle_t();
+ for (;;) {
+ hobject_t next;
+ vector<hobject_t> sentries;
+ int r = pgbackend->objects_list_partial(
+ current, 1, 1, &sentries, &next);
+ if (r) {
+ result = -EINVAL;
+ break;
+ }
The way I interpret this next assertion (taken from the standard librados
code
path for enumerating objects) is that if we are enumerating objects within a
snapshot (i.e. read-only) context, then there should be no missing objects. I
believe missing objects can show up when the number of placement groups are
changed, when there is a failure, and probably some other situations. This
topic deserves more research, but for now we can act as though the union of the
missing objects and the set of objects returned by
PGBackend::objects_list_partial
are the set of objects in the placement
group.
+ assert(snapid == CEPH_NOSNAP ||
+ pg_log.get_missing().get_items().empty());
+
+ pg_log.resort_missing(get_sort_bitwise());
Here we obtain iterators to the list of objects from PGBackend
and the
missing set.
+ map<hobject_t, pg_missing_item, hobject_t::ComparatorWithDefault>::const_iterator missing_iter =
+ pg_log.get_missing().get_items().lower_bound(current);
+
+ vector<hobject_t>::iterator ls_iter = sentries.begin();
+ hobject_t _max = hobject_t::get_max();
+
If no objects were returned from the backend and there are no missing objects then we are done.
+
+ if (sentries.empty() && missing_iter == pg_log.get_missing().get_items().end()) {
+ result = 0;
+ break;
+ }
+
This loop processes one object at a time. The first step is to combine the set
of existing objects with the set of missing objects. The process here is
effectively the last part of a merge sort. The resulting variable candidate
holds a hobject_t
from one of the two sets.
+ for (;;) {
+ const hobject_t &mcand =
+ missing_iter == pg_log.get_missing().get_items().end() ?
+ _max :
+ missing_iter->first;
+
+ const hobject_t &lcand =
+ ls_iter == sentries.end() ?
+ _max :
+ *ls_iter;
+
+ hobject_t candidate;
+ if (mcand == lcand) {
+ candidate = mcand;
+ if (!mcand.is_max()) {
+ ++ls_iter;
+ ++missing_iter;
+ }
+ } else if (cmp(mcand, lcand, get_sort_bitwise()) < 0) {
+ candidate = mcand;
+ assert(!mcand.is_max());
+ ++missing_iter;
+ } else {
+ candidate = lcand;
+ assert(!lcand.is_max());
+ ++ls_iter;
+ }
The comparison here ends iteration if we go past the end point returned from
PGBackend::objects_list_partial
, which is probably possible because an
arbitrary set of missing objects could sort into the same range.
+
+ if (cmp(candidate, next, get_sort_bitwise()) >= 0) {
+ break;
+ }
+
Objects internal to RADOS like snapdir
objects and objects used for metadata
are skipped as they are not user-visible.
+ // skip snapdir objects
+ if (candidate.snap == CEPH_SNAPDIR)
+ continue;
+
+ if (candidate.snap < snapid)
+ continue;
+
+ assert(snapid == CEPH_NOSNAP);
+
+ // skip internal namespace
+ if (candidate.get_namespace() == cct->_conf->osd_hit_set_namespace)
+ continue;
+
+ // skip wrong namespace
+ if (m->get_object_locator().nspace != librados::all_nspaces &&
+ candidate.get_namespace() != m->get_object_locator().nspace)
+ continue;
+
Finally we read the object data by calling PGBackend::objects_read_sync
.
+ bufferlist bl;
+ int ret = pgbackend->objects_read_sync(
+ candidate, 0, 0, 0, &bl);
+ if (ret < 0) {
+ derr << "checksum read error " << ret << dendl;
+ result = ret;
+ break;
+ }
+
And the checksum is updated by adding together all of the object bytes cast to
a numeric value (this is just for testing…). Iteration resumes with
current = next
, and after iteration is complete the checksum is encoded
and returned to the client.
+ // update checksum
+ const size_t length = bl.length();
+ const uint8_t *data = (uint8_t*)bl.c_str();
+ for (size_t offset = 0; offset < length; offset++)
+ checksum += data[offset];
+ }
+ current = next;
+ }
+ ::encode(checksum, osd_op.outdata);
+ }
+ break;
And here is a sample output:
nwatkins@pl1:~/ceph/build$ bin/checksum-test
pg 0 ret 0 cs 126195159
pg 1 ret 0 cs 378485417
pg 2 ret 0 cs 378442111
pg 3 ret 0 cs 252377238
pg 4 ret 0 cs 378579803
pg 5 ret 0 cs 126198934
pg 6 ret 0 cs 126232920
pg 7 ret 0 cs 378636258