The latency of Ceph placement group splitting

When failure occurs in Ceph, or when more OSDs are added to a cluster, data moves around to re-replicate objects or to re-balance data placement. This movement is minimized by design, but sometimes it is necessary to scale the system in a way that causes a lot of data movement, and will have an impact on performance (though in practice this is a rare event for which scheduled downtime may be reasonable). This post examines very briefly the performance impact in a constrained set of microbenchmarks.

An object in RADOS maps to a placement group, and each placement group maps to one or more physical devices. The essence of the first mapping is captured by the function pg = hash(oid) % pg_num which given an object name oid selects a target placement group from the range 0..pg_num. The second mapping is more complex, based on CRUSH, but computes something analogous to osd = hash(pg) % pgp_num where pgp_num is the number of placement group placements, or something like the number of OSDs being considered for placement (I think this definition may not be very precise). The important thing is that when OSDs are added, a minimum number of placement groups are moved (and by extension a minimum number of objects).

A typical Ceph deployment will choose a large number of placement groups, on the order of 100s per OSD, so when new hardware is added some placement groups can be re-mapped for balancing, but it is proportional to the expansion of the cluster. When the ratio of PGs to OSDs becomes small adding new placement groups may be required. In affect the addition of placement groups changes the mapping between objects and placement groups, and can incur significant data movement as PGs are effectively being split. In this post we are interested in what the effect of this splitting is on client operation latency.

We’ll start with a microbenchmark that consists of a single client calling rados::create() on objects named obj.0,1,2,.... The client generates this workload on a pool with a single placement group. After some period of time we add an additional placement group to the pool and observe changes in latency as the system adjusts. The experiments are run on a beefy system with 384 GB RAM and 16 cores. On this system 2 OSDs are deployed and backed by the in-memory object store. The purpose of using the in-memory store is to remove as much noise as possible (sources of which include network and storage devices) to expose the system behavior.

Results #

The annotated graph below shows the phases of the experiment. The setup phase consists of the client running without any changes to the system. During this phase the single placement group in the pool is being populated with objects (of zero length). In the example here we perform this setup for sixty seconds (note that we clip the graph up to this point). The next phase starts when we instruct Ceph to add an additional placement group (labeled PG++). Notice a small interruption to client performance immediately following this. Once the system stabilizes we increase the number of PG placements. Following this is a longer client delay followed by a period of increased latency, presumably while Ceph is rebalancing. This period is followed by a final, short delay.

Each experiment follows the same basic structure but we increase the amount of time before we split the placement group. The reasoning behind this is that we would like to know how latency is affected by the size of the placement group at the time we perform the split. In the following experiment we double the setup time to two minutes. Note that this corresponds to approximately 95,000 objects, in contrast to the previous example that had a 60 second setup time corresponding to around 47,000 objects.

In the second example, although we have doubled the start-up time, the time in the rebalance phase is approximately three times as long at 30 seconds, but note that the client delay immediately preceding the rebalance is approximately 10 seconds in each case. In the next experiment we increase the setup time again to 300 seconds, or approximately 250,000 objects.

And we can keep doing this all day, but I’ll just summarize the results in this nice table, as these graphs quickly become very crowded by the rebalance phase and obscure the other phases.

phase/setup-time	120s	300s	600s	1200s
num objs	95K	250K	512K	975K
pg delay	1s	1s	2s	4s
pgp delay	12s	14s	11s	12s
shuffle	27s	86s	195s	380s
post shuffle delay	1s	16s	1s	1s

Next Steps #

The next thing we’ll be looking at are other operations such as reading and writing during PG splitting. We’ll focus first on these microbenchmarks and then start to expand into full scale benchmarks. It’d also be interesting to look at the performance of concurrent workloads because operations will be unblocked as data is fully migrated, and this is an iterative process that takes time.