Systems research I'm watching for in 2018
This post outlines systems technology that I hope to see form the basis for research projects in the upcoming year. Most of the technologies covered can stand alone as a key enabler for individual research directions when paired with an interesting problem. But I’m most excited about the possibilities when these technologies are combined in unique ways. I’ll cover big memory systems, languages and run-times, and tools and frameworks. While I don’t have suggestions for specific research topics, what I hope to do in this post is get everyone excited about the technologies that I haven’t been able to stop thinking about all of last year.
Memory-centric systems #
If you’re like many people, myself included, processing data at scale immediately brings to mind systems such as Hadoop, Spark, Kafka, Impala, or really any number of other such horizontally scalable systems designed for dealing with data in its many forms. But another approach is to scale-up an application or processing pipeline on a single fat node. Nodes like the X1 EC2 instance are NUMA machines with 128 vCPUs and nearly 2 TBs of RAM. When working sets can fit into memory and applications can take advantage of the hardware parallelism available, these nodes can help scale processing tasks with little to no change to applications. But what happens when you continue to scale beyond what can be offered in a single NUMA machine like the X1? Instead of migrating to a scale-out system that will likely require adapting to a new programming model, distributed NUMA systems exist that provide scale-out, cache-coherent shared-memory solutions.
Unlike single NUMA machines such as the X1, massive shared-memory SMP systems can be built by wiring together a cluster of smaller compute nodes using a cache coherent memory interconnect. Have some software you can’t scale even on the X1? Go buy a system like the SGI UV 300 which uses a custom hardware interconnect, and suddenly you’ll be scaling your application up to 512 cores with 24 TB of addressable memory. Last year HPE acquired SGI and its technology for building these types of systems, adding to its HPC and data analytics portfolio.
These types of systems, sometimes called single-system images (SSI), are available from several vendors. The company ScaleMP also sells a software-defined memory and compute system called vSMP, and raised a new round of $10 million in funding a couple months ago in November of 2017. The most interesting technology right now is TidalScale which has built a custom hypervisor that scales horizontally. You can literally spin up virtual machines that can address terabytes of physical memory from an underlying cluster. And get this, virtual CPUs migrate transparently across the hardware nodes providing dynamic compute load-balancing, all while running unmodified OS images and applications.
The point I’m trying to drive home here is that there is a lot of value in being able to work with large, cache coherent memory systems. And it goes beyond being able to avoid re-writing code for a Spark-oriented world. It can be a lot simpler to write a shared-memory program that periodically performs a checkpoint, than to deal with the complexities of distributed systems that must deal with all manner of failure scenarios. The following tweet posted this summer shows a recently installed system with 2.3 TB of RAM and 144 cores from the vendor Numascale. These systems can help bridge the gap between domain experts in areas such as the life sciences that need to scale their solutions without becoming experts in the latest and greatest data analysis frameworks.
@NumascaleAS system installed by @AIA_science at ICCS in Athens used for @ACTiCLOUD: 2,3 TB Memory, 144 cores, 5,7 TB storage, one system pic.twitter.com/rycWWdPIkQ
— NumascaleAS (@NumascaleAS) July 26, 2017
It shouldn’t come as a surprise that these scale-out shared-memory systems represent a huge amount of engineering effort, especially those using custom hardware for cache coherency. Consider just one aspect of complexity found in these systems. To be able to offer a distributed, cache-coherent address space requires bringing together low-latency network technologies, coherence protocols, low-level changes to virtual memory systems, and an immense amount of testing and tuning. But all of the solutions mentioned here are offered only as proprietary products. Which brings us to the next topic: userfaultfd.
The userfaultfd system call #
When it comes to memory centric systems, the technology I’m most excited about is userfaultfd. The userfaultfd feature is a relatively new system call in the Linux kernel that allows page faults to be handled in user-space. The userfaultfd feature lets an application control precisely how a missing page is filled, as well as how to release pages to deal with memory pressure. It sounds like a seemingly simple feature, but it may have far reaching applications. It’s been in development since at least 2014, and as it has evolved and gained new capabilities, we’ve recently seen many new use cases emerge to showcase the technology. Some notable finds include:
- checkpoint restart in user space
- post-copy live VM migration
- generational garbage collection (2016)
- user-space mmap (2017)
- disaggregated memory (2017)
Take the last case as an example. Disaggregation is a term used to describe an approach for organizing and accessing a cluster of compute resources in which instead of providing applications only local node resources, each node can tap into the resources of other nodes. So disaggregated memory is the ability for a node to use the memory of other nodes, in which an application can be provided an address space that is backed by the aggregate physical memory of many nodes. Think of it as application-controlled swap that operates independently from kernel swap. By using userfaultfd this feature is straight-forward to build: map part of the address space onto physical nodes across the network, and handle page faults by fetching data from a remote node. Since everything is in user-space, its easy to use any networking technologies that you want. Executing an application using a run-time that could predict access patterns (e.g. large list scan)? Let the run-time dispatch pre-fetch hints to userfaultfd.
I except to see userfaultfd enabling many new and clever systems. One area that seems like it is poised to produce some interesting results is user-space distributed shared memory, especially in the form of new programming models. A quick search on Google will reveal several recent proof-of-concept systems that implement some form of distributed shared-memory in user space, typically using some combination of mprotect, and intercepting various system calls like clone to handle thread migration. Solutions with mprotect can be awkward, and userfaultfd may mean it is a lot more approachable for researchers.
It’s also feasible to integrate userfaultfd into existing applications to provide new capabilities. For example, LMDB is a MVCC key-value database designed for shared-memory systems that stores all its data in a memory mapped file. With relatively few changes to LMDB, it should be possible to provide a scalable solution by mapping the address space onto a resource like a Ceph RBD storage. Of course, one could do this today by memory mapping an RBD device, but there are many storage resources that are not supported as virtual block devices by the Linux kernel. What’s interesting about the use of userfaultfd is that no matter the backing resource, LMDB can take a direct role in managing memory, allowing a co-designed solution between an application, LMDB, and the underlying storage provider. One could imagine a similar integration into RocksDB or PostgreSQL to scale-out the size of the in-memory cache.
It’s not clear if we will see any open-source single system images emerge using userfaultfd, but I am expecting to at least see userfaultfd form the basis for exploring new algorithms and protocols in this area.
Let’s move up all the way up the stack now and talk about languages.
Language implementations #
As someone who cannot at all claim to be a languages person, I’m oddly drawn to every new post on Hacker News that talks about some new features in Rust, Pony, or C++. Don’t even get me started on the posts that demonstrate how to use JITs to build toy language implementations. Those are the best. The thing about languages is just how powerful they can be. Especially domain-specific languages that express semantics that can be exploited to enable important optimizations. But it is also abundantly clear from watching this area that language implementation is a massive undertaking that spans the gamut of topics in computer science and architecture. It’s this overhead that has made me excited about two projects that are trying to make the process of building high-performance language implementations much, much easier.
Graal and Truffle #
The first set of technologies is Graal and Truffle. These two projects are part of big ecosystem for building language implementations that run on the JVM. In short, Truffle provides a framework for building language implementations as abstract syntax tree interpreters. Language implementations that use an AST interpreter are relatively easy to build. They look and feel like a direct evaluation of the language itself, as opposed to an approach such as building a byte code interpreter. Unfortunately it can be difficult to optimize AST interpreters. When combined with Graal, annotations on language interpreters built with Truffle can be used to produce virtual machines with automatic just-in-time compilation. What this means in practice is that it is now easier than ever to build high performance language implementations.
The Graal and Truffle community has built some impressive demonstrations of language implementations including R, Ruby, JavaScript, Python, and Lua. But one of the most ambitious and interesting projects is Sulong. This is an interpreter for LLVM bitcode which allows interesting capabilities such as running C or C++ code on the JVM, and potentially eliminating some of the costly overhead of making calls into native libraries via interfaces such as JNI. And this should generalize to other language implementations running on the JVM which want to efficiently interoperate. So perhaps for some projects the overhead of using JNI will disappear.
Honestly I’m a little surprised that I haven’t seen Sulong used as a basis for running an emulated Linux kernel on the JVM. I mean, we’ve already seen the Linux kernel compiled to JavaScript and running in the browser. What happens when the operating system is running directly on the JVM, and ‘‘user-space’’ is also running Java applications? Context switch overhead may simply disappear, or better yet, end-to-end code paths that cross protection boundaries may be optimized like any other JIT code path. A related project in 2015 explored using just-in-time compilation to fuse together an application and its use of an embedded database engine to avoid the costs of crossing programming language boundaries.
Eclipse OMR #
The second project in the space of language implementations is Eclipse OMR which spun out of IBM. The story goes something along the lines of, after having built a high-performance implementation of the JVM called J9, IBM recognized that many of the components used to build J9 such as threading, garbage collection, and just-in-time compiling, can be reused in other language implementations. But the motivation is more about building efficient cloud stacks. As the cloud becomes more of a polyglot soup of language virtual machines, there is a lot of benefit to making these virtual machines as efficient as possible. The Eclipse OMR project is a framework of components and tools for building new language virtual machines as well as enhancing existing virtual machines by adding state of the art implementations of components such as garbage collection or JIT compilation.
It’s early days for Eclipse OMR, but much of the project is rock solid internal IBM technologies that are being open-sourced. Unlike Graal and Truffle, OMR is a C++ project. It’s much lower level, which brings many advantages, and allows it to fill a certain niche, such as integration into existing C and C++ language runtime implementations. I’m watching OMR carefully, especially its use in new and existing language implementations.
Distributed run-time systems #
While technologies such as Graal, Truffle, and Eclipse OMR are exciting because they are reducing the costs of building new language implementations, I expect to see language virtual machines escape the walled gardens of their single node hosts sooner rather than later.
TidalScale built a hypervisor that presents the abstraction of a single physical machine but which is mapped onto a cluster of hardware. But nothing prevents a similar approach from being taken inside language virtual machines running in user-space. In fact, this has been done in the JVM in a project that even included thread migration. Google Scholar shows several related projects. But most of the projects I’ve found are from nearly 15 years ago, and to even say that things have changed a lot in the systems space since then would be an understatement.
Engineering challenges such as thread migration may likely be significantly easier in language virtual machines, and networking technologies like RDMA are becoming far more common. What about garbage collection for JVM instances using significant amounts of memory? The company Azul Systems provides a JVM optimized for very large heap sizes. While not distributed, challenges that appear for large heaps on a single node are likely to also be seen when scaling out. What about scaling the address space? Perhaps userfaultfd would be useful, and language virtual machines often explicitly represent language-level objects, allowing memory locations to be tracked as their locale changes. To be honest, I’m not completely up to date in this space, and there may very well be a good reason for this space being relatively silent. Still, it’s intriguing.
Exascale #
I’ve been following the exascale initiative since its kick-off meeting in 2011. Seven years later, and a lot of progress has been made (see here and here). Whatever topic you are interested in, be it languages, compilers, simulation, storage, memory, processors, GPUs, networking, or power management, it will be a critical component in the exascale project simply because everything is different at this scale.
One observation made early on was that a change was desperately needed in how software was written for these machines. This was not only because it was already becoming difficult to scale existing software and I/O, but because unless a new approach was taken, every order of magnitude increase in scale would require rethinking application construction. Instead of building software in low-level languages with rigid execution paths, domain-specific languages allow high level computations to be expressed, while low-level run-times take care of mapping execution onto a target architecture.
I don’t normally recommend to people that they read 256 page PDFs, but Michael Bauer’s thesis does a superb job of articulating the challenges of building HPC software. Honestly, only reading the first couple chapters is enough to get the idea. The remainder of the thesis introduces Legion.
Legion is a distributed run-time designed to run on massively distributed, highly heterogeneous computing systems. Legion takes over a system, and maps computation and communication onto the system using application and system specific optimizations. Applications are written in way that minimizes and compartmentalizes the task of migrating software between systems. Programmers can write software directly against the Legion API, or get serious and use domain-specific languages that use Legion as a language run-time system.
With new pre-exascale systems coming online, it is becoming time for HPC application developers to step up and demonstrate a new way of scaling science. I hope to see Legion take a central role here. Additionally, I fully expect to see Legion escape the HPC corner that it’s currently occupying. As a run-time for data analytics outside of traditional HPC, it stands to serve as no less than a unifying abstraction for future frameworks. There are several systems in this space such as Charm++, among others, but Legion is an extremely unique, and fascinating project that I encourage everyone to become familiar with.
Tools and frameworks #
Little to do with research, and all about making progress, there are several technologies I’m excited about seeing used to build new systems. It’s probably just the echo chamber I’m in, but modern C++ is getting a lot of love. Combining the huge mind share among developers with advancements to the language and standard libraries (e.g. C++17), it’s not a surprise that there is a lot of new development and excitement around the language.
Seastar is a C++ based framework for building networked servers. The framework provides a futures-based programming model, encourages shared-nothing server design that avoids threads, locking, and context switching, and even includes a DPDK-based TCP/IP stack implemented in user-space to further avoid the overheads of crossing into kernel-space. These features all combine into a framework for building extremely low-latency network applications. Seastar powers ScyllaDB, a high-performance distributed database similar to Cassandra, providing some incredible performance numbers.
Next up is the smf project. It is branded as a set of building blocks that are commonly found in distributed systems (e.g. consensus and write-ahead logging). It’s interesting because it assumes the use of Seastar, meaning that all its components are optimized for the framework. A particularly interesting component in smf is its RPC framework that is providing single digit microsecond latency.
And finally, OpenFabrics and OpenUCX are new networking software frameworks. These frameworks generally seek to provide APIs that abstract across the large array of different networking technologies. This means you may be able to write some software that uses remote atomics, test it on your laptop using a standard TCP/IP stack, and then deploy to a cluster with Infiniband and take advantage of hardware acceleration and kernel bypass. I’d recommend reading up on OpenUCX, or watching a presentation about the technology on YouTube. It might just be the perfect mechanism for shuffling memory around in a distributed shared memory implementation of a language virtual machine :)
I’d like to thank @ivotron and Michael Sevilla for reviewing this post.