Debuting the performance of CruzDB

The CruzDB database system is a log-structured key-value store that provides serialized transactions over a distributed high-performance log. We have been working on this system for a little over a year and we just finished bindings for the YCSB workload generator. Today we’ll be previewing some of our initial performance results. We’ll cover some of the basics of the system, but will save a discussion of many of the technical details for another day.

The CruzDB database system is structured as a copy-on-write red-black tree where each new version of the tree is appended to a distributed log. New versions are created by executing a transaction over the latest version of the tree, and versions are stored compactly as an after image (i.e. delta) of the tree produced by the transaction. The interface is similar to that of the transactional interface of RocksDB. An LRU cache is used to store recently accessed parts of the database.

Log entries are immutable, so long running read queries do not affect writers and any number of read-only queries may run in parallel. The current implementation serializes all database updates, which limits transaction throughput. Batching and optimistic concurrency control are some of the next optimizations that we’ll be considering.

The first set of results show the overall throughput of the 6 default workloads in YCSB for 1, 2, and 4 client threads. There are two results that immediately stand out. First, the performance is relatively poor—we would expect far more throughput for a workload like workload c which is a read-only workload. The second observation is that adding more threads resulted in lower throughput.

We’ll tackle the overall low throughput issue first. There are a number of reasons why throughput can be low, and we won’t look into all of them in this post, but a major factor is the cache miss rate. When the tree is being traversed, any node that is not in memory must be retrieved from the log storage. Presumably if we increase the size of the LRU node cache we should see better performance.

And indeed a larger node cache results is quite a bit better performance. For the read-only workload c the throughput is nearly 6x better as can be seen in the following graph. We plotted the results for the small cache for comparison (shown in grey with label small).

The second issue with these results is the affect of additional client threads on performance. While transactions that perform writes are serialized, we would expect workloads like workload c to scale throughput with additional threads as reads execute against a fixed snapshot and are not serialized. However the affect of an additional client thread on the throughput of workload c is quite detrimental.

After some investigation I found that the CruzDB bindings to YCSB were running each Get operation in a transaction, and CruzDB currently doesn’t have a smart read-only transaction. We implemented a new version of the Get binding that runs against a snapshot, rather than in a serialized transaction. The results of the optimization are shown in the graph below. The runs labeled Snap are for the optimization and the runs labeled SerTxn are the original serialized transaction approach.

Looking at workload c the performance of a single thread is slightly better with the snapshot-based Get operation, and the performance hit taken with additional threads is not as severe, which is the trend we were expecting. The effect doesn’t seem to be present with 4 threads, nor is it a dramatic improvement. Given this result I expect to be able to find the source of the performance degradation elsewhere.

The remaining issues are related to the multi-threaded cases. First, for a read-only workload like workload c why is there any performance hit with more than one thread? The two threads are running in parallel with snapshot-based Get operations, and assuming that there are no cache misses with an 8GB node cache, I expect the source of this overhead to be related to both locking in the database (e.g. the node cache), as well as the hit we are taking from our heavy use of smart pointers and the associated overheads they introduce.

Another issue is that for write heavy workloads, it is curious that the performance hit for multiple threads is so high, even when considering that transactions are serialized. There are multiple threads that synchronize with each other during transaction processing, and this may be a cost that we see when multiplied by a large number of transactions in the form of thread wake-up latency.

And finally, the performance of workload e, in particular, is very surprising. This is a workload that is 95% scans (that run in parallel against a snapshot), and 5% inserts (that are serialized). I would have expected that the performance here is much better given the high percentage of operations that may run in parallel. This may be explained by the fact that a single scan does a lot of work, but counts as a single operation.

Stay tuned for more. We’ll be looking in depth at these queries, and in the near future we’ll be prototyping a group commit feature that should significantly increase the write throughput.