The CruzDB log-structured database

The cruzdb project began as a completely open-ended exploration of building a MVCC key-value store on top of a distributed shared-log called zlog, and became one of the most interesting and challenging projects I’ve worked. It is able to scale-out across nodes and supports transactions.

Here is a high-level view of the architecture. Shown at the top is CruzDB which is effectively a copy-on-write tree where the delta formed by each mutation of the tree is stored in the underlying log. The cruzdb system is largely agnostic to the log implementation, but we have prototyped the system with zlog running on top of Ceph.

Image

Wanna get involved?

The project is idle at the moment as I’ve been extremely busy at Vectorized building a new storage engine replacement for Kafka. However, please feel free to reach out if you are interested in working on either of these projects. I’m especially interested in testing zlog at scale, and developing research topics and authoring publications related to the cruzdb architecture.

System architecture

Below is a series of posts outlining the architecture and functionality of cruzdb.

The zlog shared-log

I’ve written some on this site about zlog which a distributed shared-log designed to run on top of Ceph. I’ve primarily used zlog as the storage layer for cruzdb. The best resource at the moment for up-to-date information on zlog is the github page and associated documentation.

Also, see here for a talk on zlog I gave at Cephalocon 2018.