Skip to main content

The latency of Ceph placement group splitting

·4 mins

When failure occurs in Ceph, or when more OSDs are added to a cluster, data moves around to re-replicate objects or to re-balance data placement. This movement is minimized by design, but sometimes it is necessary to scale the system in a way that causes a lot of data movement, and will have an impact on performance (though in practice this is a rare event for which scheduled downtime may be reasonable). This post examines very briefly the performance impact in a constrained set of microbenchmarks.

An object in RADOS maps to a placement group, and each placement group maps to one or more physical devices. The essence of the first mapping is captured by the function pg = hash(oid) % pg_num which given an object name oid selects a target placement group from the range 0..pg_num. The second mapping is more complex, based on CRUSH, but computes something analogous to osd = hash(pg) % pgp_num where pgp_num is the number of placement group placements, or something like the number of OSDs being considered for placement (I think this definition may not be very precise). The important thing is that when OSDs are added, a minimum number of placement groups are moved (and by extension a minimum number of objects).

A typical Ceph deployment will choose a large number of placement groups, on the order of 100s per OSD, so when new hardware is added some placement groups can be re-mapped for balancing, but it is proportional to the expansion of the cluster. When the ratio of PGs to OSDs becomes small adding new placement groups may be required. In affect the addition of placement groups changes the mapping between objects and placement groups, and can incur significant data movement as PGs are effectively being split. In this post we are interested in what the effect of this splitting is on client operation latency.

We’ll start with a microbenchmark that consists of a single client calling rados::create() on objects named obj.0,1,2,.... The client generates this workload on a pool with a single placement group. After some period of time we add an additional placement group to the pool and observe changes in latency as the system adjusts. The experiments are run on a beefy system with 384 GB RAM and 16 cores. On this system 2 OSDs are deployed and backed by the in-memory object store. The purpose of using the in-memory store is to remove as much noise as possible (sources of which include network and storage devices) to expose the system behavior.

Results #

The annotated graph below shows the phases of the experiment. The setup phase consists of the client running without any changes to the system. During this phase the single placement group in the pool is being populated with objects (of zero length). In the example here we perform this setup for sixty seconds (note that we clip the graph up to this point). The next phase starts when we instruct Ceph to add an additional placement group (labeled PG++). Notice a small interruption to client performance immediately following this. Once the system stabilizes we increase the number of PG placements. Following this is a longer client delay followed by a period of increased latency, presumably while Ceph is rebalancing. This period is followed by a final, short delay.

Image

Each experiment follows the same basic structure but we increase the amount of time before we split the placement group. The reasoning behind this is that we would like to know how latency is affected by the size of the placement group at the time we perform the split. In the following experiment we double the setup time to two minutes. Note that this corresponds to approximately 95,000 objects, in contrast to the previous example that had a 60 second setup time corresponding to around 47,000 objects.

Image

In the second example, although we have doubled the start-up time, the time in the rebalance phase is approximately three times as long at 30 seconds, but note that the client delay immediately preceding the rebalance is approximately 10 seconds in each case. In the next experiment we increase the setup time again to 300 seconds, or approximately 250,000 objects.

Image

And we can keep doing this all day, but I’ll just summarize the results in this nice table, as these graphs quickly become very crowded by the rebalance phase and obscure the other phases.

phase/setup-time 120s 300s 600s 1200s
num objs 95K 250K 512K 975K
pg delay 1s 1s 2s 4s
pgp delay 12s 14s 11s 12s
shuffle 27s 86s 195s 380s
post shuffle delay 1s 16s 1s 1s

Next Steps #

The next thing we’ll be looking at are other operations such as reading and writing during PG splitting. We’ll focus first on these microbenchmarks and then start to expand into full scale benchmarks. It’d also be interesting to look at the performance of concurrent workloads because operations will be unblocked as data is fully migrated, and this is an iterative process that takes time.