Toward dynamic RADOS object class management

Standard object classes in RADOS are managed using a static versioning and distribution scheme, but this may be restrictive for dynamically defined interfaces. In this post we describe a proof-of-concept implementation for dynamically managing object interfaces.

The code described in this post is a work-in-progress and is maintained in a separate branch: https://github.com/dotnwat/ceph/tree/objclass-manager.

Introduction #

In a previous post I described how dynamic Lua object classes can be managed in the file system, extending the existing workflow for object class development to encompass object classes written in Lua and installed on a local file system. While this method is familiar to developers, it isn’t the only way to manage dynamic interfaces. We envision a centralized system for interface management that handles persistence, distribution, and version management.

There are many methods by which such a manager could be constructed. For instance, object classes could be managed in a custom object class that would provide persistence. However, since object class definitions must be distributed to all OSD processes in the cluster, additional functionality for retrieving these definitions and keeping them consistent across the cluster would need to be introduced. Another approach would be to make use of a data structure that is already distributed across OSD processes, such as the OSDMap or the structure that encodes pool information (pg_pool_t). In this post we’ll explore a proof-of-concept interface manager service using pg_pool_t structure to manage object class definitions. In particular we will focus on the mechanics for distribution of classes. The semantics of versioning will be left for a future post.

Interface State Management #

The pg_pool_t object contains a bunch of information about a RADOS pool. Information managed by this structure includes replication or erasure coding, cache mode, snapshots, etc… Our prototype begins with the addition of a string that will hold a Lua script. Note that future versions will have more complex management features that may require an entirely new set of managed structures, but in the interest of simplicity, a single Lua script will be the only new state introduced for the prototype.

The encode and decode methods are used to serialize and deserialize the structure for saving to disk or transmitting across the network.

diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc
index de80857..32b25f8 100644
--- a/src/osd/osd_types.cc
+++ b/src/osd/osd_types.cc
@@ -1429,7 +1429,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
     return;
   }
 
-  ENCODE_START(24, 5, bl);
+  ENCODE_START(25, 5, bl);
   ::encode(type, bl);
   ::encode(size, bl);
   ::encode(crush_ruleset, bl);
@@ -1478,12 +1478,13 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
   ::encode(hit_set_grade_decay_rate, bl);
   ::encode(hit_set_search_last_n, bl);
   ::encode(opts, bl);
+  ::encode(lua_script, bl);
   ENCODE_FINISH(bl);
 }
 
 void pg_pool_t::decode(bufferlist::iterator& bl)
 {
-  DECODE_START_LEGACY_COMPAT_LEN(24, 5, 5, bl);
+  DECODE_START_LEGACY_COMPAT_LEN(25, 5, 5, bl);
   ::decode(type, bl);
   ::decode(size, bl);
   ::decode(crush_ruleset, bl);
@@ -1625,6 +1626,9 @@ void pg_pool_t::decode(bufferlist::iterator& bl)
   if (struct_v >= 24) {
     ::decode(opts, bl);
   }
+  if (struct_v >= 25) {
+    ::decode(lua_script, bl);
+  }
   DECODE_FINISH(bl);
   calc_pg_masks();
   calc_grade_table();
diff --git a/src/osd/osd_types.h b/src/osd/osd_types.h
index cb71218..be435dc 100644
--- a/src/osd/osd_types.h
+++ b/src/osd/osd_types.h
@@ -1223,6 +1223,8 @@ struct pg_pool_t {
 
   pool_opts_t opts; ///< options
 
+  string lua_script;
+
 private:
   vector<uint32_t> grade_table;

This structure is distributed across the cluster on a per-pool basis and provides pool metadata to OSD processes. The addition of the Lua script to the structure extends that distribution mechanism to our additional state. Next we’ll see how to update that state from a client.

Installing an Interface #

Now that the pg_pool_t has the new state that will hold the Lua script the state needs to be updated. Since the monitor coordinates updates to this structure through the OSD service, we’ll piggyback on this infrastructure to make updates to the installed Lua script.

Below we add a new monitor command, ceph osd pool set-class <pool> <class> <script>, which updates the pg_pool_t structure for the given pool using a round of Paxos, and ensures the update is propagated to each OSD in the cluster.

First is some metadata describing the new command and its parameters:

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8d09f91..2dff5ac 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -676,6 +676,11 @@ COMMAND("osd pool get " \
        "name=pool,type=CephPoolname " \
        "name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hi
t_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code
_profile|min_read_recency_for_promote|all|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recove
ry_op_priority", \
        "get pool parameter <var>", "osd", "r", "cli,rest")
+COMMAND("osd pool set-class " \
+       "name=pool,type=CephPoolname " \
+       "name=class,type=CephString " \
+       "name=script,type=CephString", \
+       "set pool object class <class> to <script>", "osd", "rw", "cli,rest")
 COMMAND("osd pool set " \
        "name=pool,type=CephPoolname " \
        "name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hi
t_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cac
he_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priori
ty|recovery_op_priority " \

Now for the implementation of the command. First thing we do is decode the three input parameters:

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index ee5b51e..fce7567 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -7310,6 +7310,45 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
     wait_for_finished_proposal(op, new Monitor::C_Command(mon, op, 0, ss.str(),
                          get_last_committed() + 1));
     return true;
+  } else if (prefix == "osd pool set-class") {
+
+    string poolstr;
+    cmd_getval(g_ceph_context, cmdmap, "pool", poolstr);
+    int64_t pool_id = osdmap.lookup_pg_pool_name(poolstr);
+    if (pool_id < 0) {
+      ss << "unrecognized pool '" << poolstr << "'";
+      err = -ENOENT;
+      goto reply;
+    }
+
+    string clsname;
+    cmd_getval(g_ceph_context, cmdmap, "class", clsname);
+    if (clsname == "") {
+      ss << "invalid class name '" << clsname << "'";
+      err = -EINVAL;
+      goto reply;
+    }
+
+    string script;
+    cmd_getval(g_ceph_context, cmdmap, "script", script);
+    if (script == "") {
+      ss << "invalid script <<< clipped >>>";
+      err = -EINVAL;
+      goto reply;
+    }

Next we update the pg_pool_t structure and wait on a Paxos proposal with the update to complete:

+    pg_pool_t *p = pending_inc.get_new_pool(pool_id,
+        osdmap.get_pg_pool(pool_id));
+
+    p->lua_script = script;
+
+    ss << "set-class " << clsname << " (ignored) = <<< clipped >>> for pool " << poolstr;
+
+    rs = ss.str();
+    wait_for_finished_proposal(op, new Monitor::C_Command(mon, op, 0, rs,
+                         get_last_committed() + 1));
+    return true;
+
   } else if (prefix == "osd pool set-quota") {
     string poolstr;
     cmd_getval(g_ceph_context, cmdmap, "pool", poolstr);

And that is it for introducing the new state. The system will automatically distribute the updated state to across the cluster. We can give this a test from the Ceph CLI and provide a string containing a Lua script:

[nwatkins@kyoto src]$ ./ceph osd pool set rbd lua_script "function test() end"
set pool 0 lua_script to <<< Lua script clipped >>>

The monitor command infrastructure will echo the value being set, but we trim it because the size of the Lua script may be large. That CLI interface is useful, but it is also nice to have a programmatic interface. We can construct the monitor request using the RADOS mon_command API. The mon_command API takes a JSON formatted string that encodes the monitor command. We show below using Python:

def set_lua_script(rados, pool, script, timeout=30):
    cmd = {
      "prefix": "osd pool set",
      "pool":   pool,
      "var":    "lua_script",
      "val":    script,
    }
    return rados.mon_command(json.dumps(cmd), '', timeout)

We’ve wrapped it up in a simple script called ocm_set.py (ocm is short for object class manager). So, we can now have programmatic access from Python via rados.py. The mon_command is also exposed through the C and C++ RADOS APIs, so all one would need to do is construct the required JSON and access is available through those languages as well. Here is an example using the ocm_set.py tool. The tool also accepts - as the input script parameter in which case the script is read from standard input.

[nwatkins@kyoto src]$ ./ocm_set.py rbd "function test() end"
(0, '', u'set pool 0 lua_script to <<< Lua script clipped >>>')

Adding the Lua script to the pg_pool_t is sufficient for distributing the script to the servers in the cluster, but at this point the string representing the Lua script isn’t actually accessed. Next we need to wire it up to the Lua object class implementation cls_lua so we can use the ioctx::exec interface.

Access Installed Lua Script in OSD #

When a client invokes the exec RADOS interface it generates a CEPH_OSD_OP_CALL operation to be executed within the OSD. The first stop along the execution path that is relevant for us is OSD::init_op_flags that does a fast examination on the operation to extract relevant information for the rest of execution of the operation.

With respect to the CEPH_OSD_OP_CALL operation, the OSD::init_op_flags method does two things. First, it ensures that the shared library that implements the object class is loaded and checks to make sure that the method is available.

ClassHandler::ClassData *cls;
int r = class_handler->open_class(cname, &cls);
if (r) {
  derr << "class " << cname << " open got " << cpp_strerror(r) << dendl;
  if (r == -ENOENT)
    r = -EOPNOTSUPP;
  else
    r = -EIO;
  return r;
}

The second task is to mark the operation with read/write flags that are used by the OSD when executing the operation. Normal RADOS operations have these flags hard coded, but object class methods are loaded at runtime. The flags are normally extracted and saved before proceeding:

int flags = cls->get_method_flags(mname.c_str());
if (flags < 0) {
  if (flags == -ENOENT)
    r = -EOPNOTSUPP;
  else
    r = flags;
  return r;
}

is_read = flags & CLS_METHOD_RD;
is_write = flags & CLS_METHOD_WR;
bool is_promote = flags & CLS_METHOD_PROMOTE;
...

In this proof-of-concept we will be reusing the cls_lua object class as the target class when invoking the exec RADOS method, so the default behavior above is correct—if the Lua object class isn’t available an error is returned. However, there are issues to consider when ensuring that the target object class method exists.

As an aside, the way the current cls_lua interface works is that there is exactly one method lua.execute which multiplexes user requests. That is, the actual method called and the script are packaged up as an input parameter. This works well, but it would also be nice to allow users to invoke the scripts without having to use a wrapper to handle input parameters, allowing something such as lua.my_method_call to be called with the vanilla RADOS API. This means that when cls->get_method_flags(mname.c_str()) is called (above) that the method may not exist on the statically defined and loaded cls_lua object class, but the target method may still be defined within the Lua script that exists in pg_pool_t.

Phew… OK the change is simple and I added a comment below to explain how it works.

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 7425cc7..0ed9bca 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -8740,6 +8740,23 @@ int OSD::init_op_flags(OpRequestRef& op)
    bp.copy(iter->op.cls.class_len, cname);
    bp.copy(iter->op.cls.method_len, mname);
 
+        /*
+         * Notes on handling Lua classes distributed via pg_pool_t:
+         *
+         *   Currently it is required that all scripts use the class name
+         *   "lua" which means that `open_class` below always works and we
+         *   virtualize on the method name. The ability to use Lua scripts
+         *   loaded from the file system introduced the notion of Lua backed
+         *   classes removing the requirement that "lua" class always be used.
+         *
+         *   The definition of Lua classes is held in pg_pool_t::lua_classes,
+         *   but it isn't clear how to safely reach into that structure at
+         *   this point in the code path.
+         *
+         *   TODO: find out what sort of synchronization issues arise when
+         *   reaching into pg_pool_t at this point. If safe then extract the
+         *   class name and script and patch into the ClassHandler.
+         */
    ClassHandler::ClassData *cls;
    int r = class_handler->open_class(cname, &cls);
    if (r) {
@@ -8751,13 +8768,35 @@ int OSD::init_op_flags(OpRequestRef& op)
      return r;
    }
    int flags = cls->get_method_flags(mname.c_str());
-   if (flags < 0) {
-     if (flags == -ENOENT)
-       r = -EOPNOTSUPP;
-     else
-       r = flags;
-     return r;
-   }
+        if (flags < 0) {
+          /*
+           * If the method isn't found and the Lua class is being invoked
+           * we'll attempt to perform late binding during execution of the Lua
+           * script. This means that the static methods in `cls_lua` become
+           * reserved:
+           *
+           *   - eval_msgpack
+           *   - eval_json
+           *   - eval_bufferlist
+           *
+           * TODO: there is currently not a method for extracting operation
+           * flags from dynamically defined interfaces so we patch the flags
+           * to be conservative and cover all our bases.
+           *
+           * TODO: since those methods are only referenced by a wrapper
+           * library they could be slightly obfuscated to make a name
+           * collision more unlikely.
+           */
+          if (flags == -ENOENT && cname == "lua")
+            flags = CLS_METHOD_RD | CLS_METHOD_WR;
+          else {
+            if (flags == -ENOENT)
+              r = -EOPNOTSUPP;
+            else
+              r = flags;
+            return r;
+          }
+        }
    is_read = flags & CLS_METHOD_RD;
    is_write = flags & CLS_METHOD_WR;
         bool is_promote = flags & CLS_METHOD_PROMOTE;

Note that it is possible to fully virtualize things so that Lua scripts don’t have to use the lua object class as a target for exec calls. This is a relatively simple fix, but some more investigation is needed to look at what consistency guarantees are made accessing pg_pool_t in OSD::init_op_flags. Anyway…

Now we can move onto the salient portion of the execution of the CEPH_OSD_OP_CALL operation performed in ReplicatedPG::do_osd_ops. The first thing that happens is to get a reference to the object class. Notice that it is a bug if the class isn’t found—we just saw how that pre-processing of the operation performed this lookup.

ClassHandler::ClassData *cls;
result = osd->class_handler->open_class(cname, &cls);
assert(result == 0);

The next task is to grab a reference to the actual method being invoked. Notice below that in the unmodified OSD if the method doesn’t exist we return an error to the client.

ClassHandler::ClassMethod *method = cls->get_method(mname.c_str());
  if (!method) {
    dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
    result = -EOPNOTSUPP;
    break;
  }

Previously when I showed how the operation flags were extracted we had to handle the case that the method didn’t exist. We need to do the same thing here. Here is the patch that supports late binding the method. Notice that the input is rewritten to conform to the structured input that eval_bufferlist expects. Cool!

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 27c7244..5703a8d 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -4231,11 +4231,29 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
    assert(result == 0);   // init_op_flags() already verified this works.
 
    ClassHandler::ClassMethod *method = cls->get_method(mname.c_str());
-   if (!method) {
-     dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
-     result = -EOPNOTSUPP;
-     break;
-   }
+        if (!method) {
+          /*
+           * If the named method doesn't exist and the target object class is
+           * `cls_lua` then we patch this call with the Lua script stored in
+           * `pg_pool_t` and allow late binding of the referenced method with
+           * the script.
+           */
+          if (cname == "lua") {
+            method = cls->get_method("eval_bufferlist");
+            if (method) {
+              bufferlist tmp_indata;
+              ::encode(pool.info.lua_script, tmp_indata);
+              ::encode(mname, tmp_indata);
+              ::encode(indata, tmp_indata);
+              indata = tmp_indata;
+            }
+          }
+          if (!method) {
+            dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
+            result = -EOPNOTSUPP;
+            break;
+          }
+        }
 
    int flags = method->get_flags();
    if (flags & CLS_METHOD_WR)

Testing #

First things first… let’s make sure existing stuff isn’t busted. We can test with the cls_hello demonstration object class:

ret, data = ioctx.execute('oid', 'hello', 'say_hello', 'Bernie')
print data[:ret]

Which runs successfully:

[nwatkins@kyoto src]$ python test.py
Hello, Bernie!

When we reference an object class that doesn’t exist an error is correctly reported. Below we try to call a method on the class-does-not-exist class:

[nwatkins@kyoto src]$ python test.py
rados.Error: Ioctx.exec(rbd): failed to exec class-does-not-exist:say_hello on oid: errno ENOTSUP

Let’s try to execute a Lua method using the built-in interface which requires the Lua script to be sent along with the request. Here we define a method in Lua that will return the input string in upper case:

cmd = {
  "script": """
      function upper(input, output)
        input_str = input:str()
        upper_str = string.upper(input_str)
        output:append(upper_str)
      end
      cls.register(upper)
  """,
  "handler": "upper",
  "input": "this string was in lower case",
}

ret, data = ioctx.execute('oid', 'lua', 'eval_json', json.dumps(cmd))
print data[:ret]

And when run, we see the output that we expect:

[nwatkins@kyoto src]$ python test.py
THIS STRING WAS IN LOWER CASE

So now on to the main attraction. What happens if we call a method on the lua class that doesn’t exist? What we will do is call the upper method as if it was a first class method on the class. We get the correct response which can be interpreted as the method upper on the lua class does not exist as a static method, and was not found in the input script provided from pg_pool_t (if a script existed).

[nwatkins@kyoto src]$ python test.py
rados.Error: Ioctx.exec(rbd): failed to exec lua:upper on oid: errno ENOTSUP

What we can do now is register the upper method with the cluster and try again. First we stash the method definition in a file called upper.lua:

[nwatkins@kyoto src]$ cat upper.lua 
function upper(input, output)
    input_str = input:str()
    upper_str = string.upper(input_str)
    output:append(upper_str)
end
cls.register(upper)

Next we use the ocm_set.py tool to register the Lua script:

[nwatkins@kyoto src]$ cat upper.lua | python ocm_set.py rbd -
(0, '', u'set pool 0 lua_script to <<< Lua script clipped >>>')

Now we can modify our invocation of exec and pass the input string directly:

ret, data = ioctx.execute('oid', 'lua', 'upper', "this string was in lower case")
print data[:ret]

Success:

[nwatkins@kyoto src]$ python test.py
THIS STRING WAS IN LOWER CASE

And we can then leave the invocation of exec the same and switch out the implementation transparently. Here is the updated script that we will register… it just returns the word upper:

[nwatkins@kyoto src]$ cat upper.lua
function upper(input, output)
    output:append("upper")
end
cls.register(upper)

If we repeat the test we see the expected output:

[nwatkins@kyoto src]$ python test.py
upper

It works :)

What’s Next #

There are many things that need to be done to tackle some of the challenges introduced by this technique. From a production deployment stand point it will be beneficial to bake these new features into the object class infrastructure so that special cases aren’t just spread around the OSD code base. The second, more interesting thing, is to create a methodology for updating interfaces. The current proof-of-concept simply replaces whatever interface is currently installed. Facilities for migrating interfaces and performing data transformations to support migration are needed.