Time for a Cluster OS?

The other week I attended The International Conference for High Performance Computing, Networking, Storage, and Analysis (commonly known as Super Computing or SC). It was a great conference, I had a great time and learned a ton, but I noticed an interesting theme that I didn’t really see anybody else mention. I thought I’d at least write it down.

One of the main events of the conference was Jack Dongara receiving the Turing Award. He is the maintainer of the Top 500 list (the list of the world’s fastest supercomputers). He talked about the linear algebra library he wrote (Linpack) and the benchmark that uses it for measuring the performance of a supercomputer, and thus defines the Top 500 list. He noted how many of the supercomputers were achieving near theoretical peak performance using that benchmark, and even noted that it was possible to rank based on energy efficiency as well. But then he noted that, most modern problems don’t match this type of benchmark, and put up a slide measuring the performance of these computers for a more “realistic” type of problem. At best they were achieving <2% of their theoretical performance. So it seems we’ve been measuring and optimising for the wrong type of problem. But why and how is that?

The Linpack benchmark that’s been used for the last several decades is just a simple matrix multiplication. The memory access pattern is highly regular and well defined, and thus very easy to split across machines for parallel execution. But most “real world” problems aren’t like this. For many problems the memory access pattern is not very regular, or even worse, the amount and types of calculations to be performed aren’t even known without having performed previous calculations. This makes it very difficult to efficiently split the data and calculations across machines. Unfortunately, our supercomputers have been built assuming calculation speed was the bottleneck based on observations from the benchmark used to measure their performance, but it turns out the movement of data around the machines is the actual bottleneck for most problems.

This idea was supported by a major theme of many of the talks I attended. They were all about data transfer libraries (things like MPI), alternatives to using those libraries directly, task scheduling libraries and frameworks, design patterns to keep data in certain places longer, faster networking hardware, and more of these same kinds of things. All of these technologies and advancements trying to solve the fundamental problem, “How do we coordinate data storage and movement so that it’s efficiently accessible to the calculations being performed”?

Now I’m not suggesting that this isn’t a hard problem to solve, especially efficiently for a large class of problems, but it seems like the kind of problem that’s already been solved pretty effectively at a smaller scale. The problem is simple, it’s inefficient for data to be stored far from the place that calculations will be performed on it, so let’s put the data closer when it’s needed and move it away when it’s not needed to make room for the next set of data and operations. The hard part is knowing which data will be needed when and where. But that’s exactly the problem that seems to have been solved by modern operating systems when it comes to the various levels of cache on a multi-core processor versus the main system memory. Our modern desktop machines are somewhat simpler, smaller scale versions of large supercomputers in a way. Each core has some of it’s own cache, there is some shared cache between them, and there is system memory that’s inefficient to access directly. It’s not perfect, but most operating systems have figured out ways to predict what data should be moved to what place in the system and when pretty effectively. There are some software design patterns that tend to let these predictions be more accurate, and occasionally failure to follow them leads to noticeable performance differences, but most software developers don’t seem to have to think about it anymore.

I’ll admit that a supercomputer does add another layer of complexity onto this. There are more pathways for data movement, a more complex measure of “distance” between processor and memory, and these days a larger variety of processors in the system, but it’s fundamentally the same problem that OSes have already solved. So why doesn’t a supercomputer take advantage of this existing solution? Because a supercomputer isn’t running one operating system! Each node in a supercomputer is running it’s own operating system, and thus doesn’t have a view into what other nodes of the system have in terms of data, memory and processors. But why? It seems to me if there was a single unified model and view of the memory, processors and pathways in a supercomputer, i.e. a single operating system, it would be possible to take advantage of the existing solutions to memory and data management.

What do you all think? Is there some fundamental issue with this idea? Has it been tried and failed? Are we already doing this to a degree and I just didn’t know? What would be required to explore this idea? Would it mean a radical change in the architecture of supercomputer systems?

Advertisement

2 thoughts on “Time for a Cluster OS?

  1. Very interesting analogy to approach the problem. For Hyper-Converged Infrastructure (basically focused on hosting a lot of virtual servers), we store multiple copies of data across the cluster and try to align compute tasks with data access in a way that isn’t likely to change except on the order of days. Then if a node does need to access data that it doesn’t have a copy of, we use network acceleration techniques like RDMA which allow the data access to be offloaded directly to the network card and bypass CPU. The bottleneck is rarely network bandwidth in these applications and is more likely to be actual storage or even CPU. You have to determine in this scenario if it is the cache or the bus that is the bottleneck. Maybe consider making multiple copies of data to make it accessible to more nodes depending on the size of your dataset.

    Like

    1. I think commercial cloud computing infrastructure and supercomputing infrastructure have a lot to learn from each other here. The multiple copies solution works great if the data is read-only, but sometimes the algorithms need to synchronise updates to the data. Making immutability the default (i.e. require being explicit about the places synchronised updates need to take place) in common HPC programming languages could help make the multiple copies approach more frequently usable though.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s