Sharding the LRU node cache in CruzDB
In our previous post on CruzDB performance we examined initial performance results for the YCSB benchmark. The results showed that a larger cache had a significant benefit for performance (no surprise), but we also observed that even for read-only workloads throughput was not scaling with the number of threads. In this post we address this issue and present new scalability results.
CruzDB is structured as a copy-on-write red-black tree where every new version is stored in an underlying shared-log. Transactions in CruzDB execute against a snapshot of the database, and when a portion of the tree is not resident in memory it can be reconstructed by reading the missing data from the underlying log in persistent storage.
When memory pressure is high a portion of the tree can be removed. However, since there may be many references to a single node in the tree, we cannot atomically remove the node and all its references. Thus, after removing a link in the tree, a racing thread may attempt to reconstruct this missing portion of the tree even though that portion is already in memory. To avoid copies of a sub-tree from being created we use a node cache on top of the log. The cache is organized as an LRU cache, so it also serves as an indicator of what parts of the tree should be removed.
The read scalability issue shown in the previous post
was found to be caused from a contended lock in the node cache. Through a
basic sharding strategy we were able to reduce contention and achieve better read
scalability. The following graph compares the new sharded LRU optimization
(labeled LRU
) to the previous optimization (labeled: Snap
). This graph is
log scale because sharding the LRU cache gave an extra order of magnitude
increase in throughput for the read-only YCSB workload c
. Other mixed
workloads also fair better.
There are a lot more optimization strategies that we can introduce to
increase read throughput. Instead of deep diving on read performance, we’ll be
adding a group commit feature that will increase update transaction
throughput across all workloads (accept c
which is read-only), especially
a
and f
.