Slow placement group read operations in Ceph

In Adding a new placement group operation in Ceph I demonstrated how to add a new operation in RADOS that operates at the placement group level, allowing one operation to operate on multiple objects. Recently I’ve been experimenting more with operations at the placement group level, and found interesting performance behavior when reading multiple objects within a single PG operation.

The two graphs below show the results of four experiments that each read 1000 small objects from a placement group with eight PGs. In one experiment cls_coldcache and cls_hotcache each of the 1000 objects is read sequentially by a client, making 1000 network round trips. The test is repeated with a hot cache and a cold cache (Linux drop_cache plus OSD restart). In the second pair of experiments pg_coldcache and pg_hotcache a client makes a call to each placement group which in turn reads all of its objects and returns a blob containing the concatenation of all the objects in that placement group.

Note that these latency measurements are taken at the lowest level in Ceph, right before pread jumps into the kernel, so they do not include all of the other overheads of OSD code paths or network affects.

The first graph shows the hot cache case. When we look at the hot cache case each read in either version is fairly cheap, but the placement group version wins, likely due to the fact that we are reducing network round trips by around 1000x. This result for the hot cache case is not really surprising.

But the second graph shows the cold cache case. The important thing to notice is that even with 1000 separate read operations dispatched from the client the normal read path is much faster (2 seconds vs 6 seconds). The latency of each read operation is significantly higher when reading from the do_pg_op context.

The only thing that I can come up with that would explain the behavior is some sort of pre-fetching. However as of now I have not yet been able to find any place in the OSD where this pre-fetching is occurring.

The search continues…