Toward dynamic RADOS object class management
Standard object classes in RADOS are managed using a static versioning and distribution scheme, but this may be restrictive for dynamically defined interfaces. In this post we describe a proof-of-concept implementation for dynamically managing object interfaces.
The code described in this post is a work-in-progress and is maintained in a separate branch: https://github.com/dotnwat/ceph/tree/objclass-manager.
Introduction #
In a previous post I described how dynamic Lua object classes can be managed in the file system, extending the existing workflow for object class development to encompass object classes written in Lua and installed on a local file system. While this method is familiar to developers, it isn’t the only way to manage dynamic interfaces. We envision a centralized system for interface management that handles persistence, distribution, and version management.
There are many methods by which such a manager could be constructed. For
instance, object classes could be managed in a custom object class that would
provide persistence. However, since object class definitions must be
distributed to all OSD processes in the cluster, additional functionality for
retrieving these definitions and keeping them consistent across the cluster
would need to be introduced. Another approach would be to make use of a data
structure that is already distributed across OSD processes, such as the
OSDMap
or the structure that encodes pool information (pg_pool_t
). In this
post we’ll explore a proof-of-concept interface manager service using
pg_pool_t
structure to manage object class definitions. In particular we
will focus on the mechanics for distribution of classes. The semantics of
versioning will be left for a future post.
Interface State Management #
The pg_pool_t
object contains a bunch of information about a RADOS pool.
Information managed by this structure includes replication or erasure coding,
cache mode, snapshots, etc… Our prototype begins with the
addition of a string that will hold a Lua script. Note that future
versions will have more complex management features that may
require an entirely new set of managed structures, but in the
interest of simplicity, a single Lua script will be the only new
state introduced for the prototype.
The encode
and decode
methods are used to serialize and deserialize the
structure for saving to disk or transmitting across the network.
diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc
index de80857..32b25f8 100644
--- a/src/osd/osd_types.cc
+++ b/src/osd/osd_types.cc
@@ -1429,7 +1429,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
return;
}
- ENCODE_START(24, 5, bl);
+ ENCODE_START(25, 5, bl);
::encode(type, bl);
::encode(size, bl);
::encode(crush_ruleset, bl);
@@ -1478,12 +1478,13 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
::encode(hit_set_grade_decay_rate, bl);
::encode(hit_set_search_last_n, bl);
::encode(opts, bl);
+ ::encode(lua_script, bl);
ENCODE_FINISH(bl);
}
void pg_pool_t::decode(bufferlist::iterator& bl)
{
- DECODE_START_LEGACY_COMPAT_LEN(24, 5, 5, bl);
+ DECODE_START_LEGACY_COMPAT_LEN(25, 5, 5, bl);
::decode(type, bl);
::decode(size, bl);
::decode(crush_ruleset, bl);
@@ -1625,6 +1626,9 @@ void pg_pool_t::decode(bufferlist::iterator& bl)
if (struct_v >= 24) {
::decode(opts, bl);
}
+ if (struct_v >= 25) {
+ ::decode(lua_script, bl);
+ }
DECODE_FINISH(bl);
calc_pg_masks();
calc_grade_table();
diff --git a/src/osd/osd_types.h b/src/osd/osd_types.h
index cb71218..be435dc 100644
--- a/src/osd/osd_types.h
+++ b/src/osd/osd_types.h
@@ -1223,6 +1223,8 @@ struct pg_pool_t {
pool_opts_t opts; ///< options
+ string lua_script;
+
private:
vector<uint32_t> grade_table;
This structure is distributed across the cluster on a per-pool basis and provides pool metadata to OSD processes. The addition of the Lua script to the structure extends that distribution mechanism to our additional state. Next we’ll see how to update that state from a client.
Installing an Interface #
Now that the pg_pool_t
has the new state that will hold the Lua script the
state needs to be updated. Since the monitor coordinates updates to this
structure through the OSD service, we’ll piggyback on this infrastructure to
make updates to the installed Lua script.
Below we add a new monitor command, ceph osd pool set-class <pool> <class> <script>
,
which updates the pg_pool_t
structure for the given pool using a
round of Paxos, and ensures the update is propagated to each OSD in the
cluster.
First is some metadata describing the new command and its parameters:
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8d09f91..2dff5ac 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -676,6 +676,11 @@ COMMAND("osd pool get " \
"name=pool,type=CephPoolname " \
"name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hi
t_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code
_profile|min_read_recency_for_promote|all|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recove
ry_op_priority", \
"get pool parameter <var>", "osd", "r", "cli,rest")
+COMMAND("osd pool set-class " \
+ "name=pool,type=CephPoolname " \
+ "name=class,type=CephString " \
+ "name=script,type=CephString", \
+ "set pool object class <class> to <script>", "osd", "rw", "cli,rest")
COMMAND("osd pool set " \
"name=pool,type=CephPoolname " \
"name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hi
t_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cac
he_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priori
ty|recovery_op_priority " \
Now for the implementation of the command. First thing we do is decode the three input parameters:
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index ee5b51e..fce7567 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -7310,6 +7310,45 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
wait_for_finished_proposal(op, new Monitor::C_Command(mon, op, 0, ss.str(),
get_last_committed() + 1));
return true;
+ } else if (prefix == "osd pool set-class") {
+
+ string poolstr;
+ cmd_getval(g_ceph_context, cmdmap, "pool", poolstr);
+ int64_t pool_id = osdmap.lookup_pg_pool_name(poolstr);
+ if (pool_id < 0) {
+ ss << "unrecognized pool '" << poolstr << "'";
+ err = -ENOENT;
+ goto reply;
+ }
+
+ string clsname;
+ cmd_getval(g_ceph_context, cmdmap, "class", clsname);
+ if (clsname == "") {
+ ss << "invalid class name '" << clsname << "'";
+ err = -EINVAL;
+ goto reply;
+ }
+
+ string script;
+ cmd_getval(g_ceph_context, cmdmap, "script", script);
+ if (script == "") {
+ ss << "invalid script <<< clipped >>>";
+ err = -EINVAL;
+ goto reply;
+ }
Next we update the pg_pool_t
structure and wait on a Paxos proposal with the
update to complete:
+ pg_pool_t *p = pending_inc.get_new_pool(pool_id,
+ osdmap.get_pg_pool(pool_id));
+
+ p->lua_script = script;
+
+ ss << "set-class " << clsname << " (ignored) = <<< clipped >>> for pool " << poolstr;
+
+ rs = ss.str();
+ wait_for_finished_proposal(op, new Monitor::C_Command(mon, op, 0, rs,
+ get_last_committed() + 1));
+ return true;
+
} else if (prefix == "osd pool set-quota") {
string poolstr;
cmd_getval(g_ceph_context, cmdmap, "pool", poolstr);
And that is it for introducing the new state. The system will automatically distribute the updated state to across the cluster. We can give this a test from the Ceph CLI and provide a string containing a Lua script:
[nwatkins@kyoto src]$ ./ceph osd pool set rbd lua_script "function test() end"
set pool 0 lua_script to <<< Lua script clipped >>>
The monitor command infrastructure will echo the value being set, but we trim
it because the size of the Lua script may be large. That CLI interface is
useful, but it is also nice to have a programmatic interface. We can construct
the monitor request using the RADOS mon_command
API. The mon_command
API
takes a JSON formatted string that encodes the monitor command. We show below
using Python:
def set_lua_script(rados, pool, script, timeout=30):
cmd = {
"prefix": "osd pool set",
"pool": pool,
"var": "lua_script",
"val": script,
}
return rados.mon_command(json.dumps(cmd), '', timeout)
We’ve wrapped it up in a simple script called ocm_set.py
(ocm
is short for
object class manager). So, we can now have programmatic access from
Python via rados.py
. The mon_command
is also exposed through the C and C++
RADOS APIs, so all one would need to do is construct the required JSON and
access is available through those languages as well. Here is an example using
the ocm_set.py
tool. The tool also accepts -
as the input script parameter
in which case the script is read from standard input.
[nwatkins@kyoto src]$ ./ocm_set.py rbd "function test() end"
(0, '', u'set pool 0 lua_script to <<< Lua script clipped >>>')
Adding the Lua script to the pg_pool_t
is sufficient for distributing the
script to the servers in the cluster, but at this point the string
representing the Lua script isn’t actually accessed. Next we need to wire it
up to the Lua object class implementation cls_lua
so we can use the
ioctx::exec
interface.
Access Installed Lua Script in OSD #
When a client invokes the exec
RADOS interface it generates a
CEPH_OSD_OP_CALL
operation to be executed within the OSD. The first stop
along the execution path that is relevant for us is OSD::init_op_flags
that
does a fast examination on the operation to extract relevant information for
the rest of execution of the operation.
With respect to the CEPH_OSD_OP_CALL
operation, the OSD::init_op_flags
method does two things. First, it ensures that the shared library that
implements the object class is loaded and checks to make sure that the method
is available.
ClassHandler::ClassData *cls;
int r = class_handler->open_class(cname, &cls);
if (r) {
derr << "class " << cname << " open got " << cpp_strerror(r) << dendl;
if (r == -ENOENT)
r = -EOPNOTSUPP;
else
r = -EIO;
return r;
}
The second task is to mark the operation with read/write flags that are used by the OSD when executing the operation. Normal RADOS operations have these flags hard coded, but object class methods are loaded at runtime. The flags are normally extracted and saved before proceeding:
int flags = cls->get_method_flags(mname.c_str());
if (flags < 0) {
if (flags == -ENOENT)
r = -EOPNOTSUPP;
else
r = flags;
return r;
}
is_read = flags & CLS_METHOD_RD;
is_write = flags & CLS_METHOD_WR;
bool is_promote = flags & CLS_METHOD_PROMOTE;
...
In this proof-of-concept we will be reusing the cls_lua
object class as the
target class when invoking the exec
RADOS method, so the default behavior
above is correct—if the Lua object class isn’t available an error is
returned. However, there are issues to consider when ensuring that the target
object class method exists.
As an aside, the way the current cls_lua
interface works is that there is
exactly one method lua.execute
which multiplexes user requests. That is, the
actual method called and the script are packaged up as an input parameter.
This works well, but it would also be nice to allow users to invoke the
scripts without having to use a wrapper to handle input parameters, allowing
something such as lua.my_method_call
to be called with the vanilla RADOS
API. This means that when cls->get_method_flags(mname.c_str())
is called
(above) that the method may not exist on the statically defined and loaded
cls_lua
object class, but the target method may still be defined within the
Lua script that exists in pg_pool_t
.
Phew… OK the change is simple and I added a comment below to explain how it works.
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 7425cc7..0ed9bca 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -8740,6 +8740,23 @@ int OSD::init_op_flags(OpRequestRef& op)
bp.copy(iter->op.cls.class_len, cname);
bp.copy(iter->op.cls.method_len, mname);
+ /*
+ * Notes on handling Lua classes distributed via pg_pool_t:
+ *
+ * Currently it is required that all scripts use the class name
+ * "lua" which means that `open_class` below always works and we
+ * virtualize on the method name. The ability to use Lua scripts
+ * loaded from the file system introduced the notion of Lua backed
+ * classes removing the requirement that "lua" class always be used.
+ *
+ * The definition of Lua classes is held in pg_pool_t::lua_classes,
+ * but it isn't clear how to safely reach into that structure at
+ * this point in the code path.
+ *
+ * TODO: find out what sort of synchronization issues arise when
+ * reaching into pg_pool_t at this point. If safe then extract the
+ * class name and script and patch into the ClassHandler.
+ */
ClassHandler::ClassData *cls;
int r = class_handler->open_class(cname, &cls);
if (r) {
@@ -8751,13 +8768,35 @@ int OSD::init_op_flags(OpRequestRef& op)
return r;
}
int flags = cls->get_method_flags(mname.c_str());
- if (flags < 0) {
- if (flags == -ENOENT)
- r = -EOPNOTSUPP;
- else
- r = flags;
- return r;
- }
+ if (flags < 0) {
+ /*
+ * If the method isn't found and the Lua class is being invoked
+ * we'll attempt to perform late binding during execution of the Lua
+ * script. This means that the static methods in `cls_lua` become
+ * reserved:
+ *
+ * - eval_msgpack
+ * - eval_json
+ * - eval_bufferlist
+ *
+ * TODO: there is currently not a method for extracting operation
+ * flags from dynamically defined interfaces so we patch the flags
+ * to be conservative and cover all our bases.
+ *
+ * TODO: since those methods are only referenced by a wrapper
+ * library they could be slightly obfuscated to make a name
+ * collision more unlikely.
+ */
+ if (flags == -ENOENT && cname == "lua")
+ flags = CLS_METHOD_RD | CLS_METHOD_WR;
+ else {
+ if (flags == -ENOENT)
+ r = -EOPNOTSUPP;
+ else
+ r = flags;
+ return r;
+ }
+ }
is_read = flags & CLS_METHOD_RD;
is_write = flags & CLS_METHOD_WR;
bool is_promote = flags & CLS_METHOD_PROMOTE;
Note that it is possible to fully virtualize things so that Lua scripts don’t
have to use the lua
object class as a target for exec
calls. This is a
relatively simple fix, but some more investigation is needed to look at what
consistency guarantees are made accessing pg_pool_t
in OSD::init_op_flags
.
Anyway…
Now we can move onto the salient portion of the execution of the
CEPH_OSD_OP_CALL
operation performed in ReplicatedPG::do_osd_ops
. The
first thing that happens is to get a reference to the object class. Notice
that it is a bug if the class isn’t found—we just saw how that
pre-processing of the operation performed this lookup.
ClassHandler::ClassData *cls;
result = osd->class_handler->open_class(cname, &cls);
assert(result == 0);
The next task is to grab a reference to the actual method being invoked. Notice below that in the unmodified OSD if the method doesn’t exist we return an error to the client.
ClassHandler::ClassMethod *method = cls->get_method(mname.c_str());
if (!method) {
dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
result = -EOPNOTSUPP;
break;
}
Previously when I showed how the operation flags were extracted we had to
handle the case that the method didn’t exist. We need to do the same thing
here. Here is the patch that supports late binding the method. Notice that the
input is rewritten to conform to the structured input that eval_bufferlist
expects. Cool!
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 27c7244..5703a8d 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -4231,11 +4231,29 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
assert(result == 0); // init_op_flags() already verified this works.
ClassHandler::ClassMethod *method = cls->get_method(mname.c_str());
- if (!method) {
- dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
- result = -EOPNOTSUPP;
- break;
- }
+ if (!method) {
+ /*
+ * If the named method doesn't exist and the target object class is
+ * `cls_lua` then we patch this call with the Lua script stored in
+ * `pg_pool_t` and allow late binding of the referenced method with
+ * the script.
+ */
+ if (cname == "lua") {
+ method = cls->get_method("eval_bufferlist");
+ if (method) {
+ bufferlist tmp_indata;
+ ::encode(pool.info.lua_script, tmp_indata);
+ ::encode(mname, tmp_indata);
+ ::encode(indata, tmp_indata);
+ indata = tmp_indata;
+ }
+ }
+ if (!method) {
+ dout(10) << "call method " << cname << "." << mname << " does not exist" << dendl;
+ result = -EOPNOTSUPP;
+ break;
+ }
+ }
int flags = method->get_flags();
if (flags & CLS_METHOD_WR)
Testing #
First things first… let’s make sure existing stuff isn’t busted. We can test
with the cls_hello
demonstration object class:
ret, data = ioctx.execute('oid', 'hello', 'say_hello', 'Bernie')
print data[:ret]
Which runs successfully:
[nwatkins@kyoto src]$ python test.py
Hello, Bernie!
When we reference an object class that doesn’t exist an error is correctly
reported. Below we try to call a method on the class-does-not-exist
class:
[nwatkins@kyoto src]$ python test.py
rados.Error: Ioctx.exec(rbd): failed to exec class-does-not-exist:say_hello on oid: errno ENOTSUP
Let’s try to execute a Lua method using the built-in interface which requires the Lua script to be sent along with the request. Here we define a method in Lua that will return the input string in upper case:
cmd = {
"script": """
function upper(input, output)
input_str = input:str()
upper_str = string.upper(input_str)
output:append(upper_str)
end
cls.register(upper)
""",
"handler": "upper",
"input": "this string was in lower case",
}
ret, data = ioctx.execute('oid', 'lua', 'eval_json', json.dumps(cmd))
print data[:ret]
And when run, we see the output that we expect:
[nwatkins@kyoto src]$ python test.py
THIS STRING WAS IN LOWER CASE
So now on to the main attraction. What happens if we call a method on the
lua
class that doesn’t exist? What we will do is call the upper
method as
if it was a first class method on the class. We get the correct response which
can be interpreted as the method upper
on the lua
class does not exist as
a static method, and was not found in the input script provided from
pg_pool_t
(if a script existed).
[nwatkins@kyoto src]$ python test.py
rados.Error: Ioctx.exec(rbd): failed to exec lua:upper on oid: errno ENOTSUP
What we can do now is register the upper
method with the cluster and try
again. First we stash the method definition in a file called upper.lua
:
[nwatkins@kyoto src]$ cat upper.lua
function upper(input, output)
input_str = input:str()
upper_str = string.upper(input_str)
output:append(upper_str)
end
cls.register(upper)
Next we use the ocm_set.py
tool to register the Lua script:
[nwatkins@kyoto src]$ cat upper.lua | python ocm_set.py rbd -
(0, '', u'set pool 0 lua_script to <<< Lua script clipped >>>')
Now we can modify our invocation of exec
and pass the input string directly:
ret, data = ioctx.execute('oid', 'lua', 'upper', "this string was in lower case")
print data[:ret]
Success:
[nwatkins@kyoto src]$ python test.py
THIS STRING WAS IN LOWER CASE
And we can then leave the invocation of exec
the same and switch out the
implementation transparently. Here is the updated script that we will
register… it just returns the word upper
:
[nwatkins@kyoto src]$ cat upper.lua
function upper(input, output)
output:append("upper")
end
cls.register(upper)
If we repeat the test we see the expected output:
[nwatkins@kyoto src]$ python test.py
upper
It works :)
What’s Next #
There are many things that need to be done to tackle some of the challenges introduced by this technique. From a production deployment stand point it will be beneficial to bake these new features into the object class infrastructure so that special cases aren’t just spread around the OSD code base. The second, more interesting thing, is to create a methodology for updating interfaces. The current proof-of-concept simply replaces whatever interface is currently installed. Facilities for migrating interfaces and performing data transformations to support migration are needed.