Some Scattered Thoughts on #DevOps and #HPC

Broadly defined, DevOps is a movement that advocates a close integration between the areas of software development, system administration and (to a lesser extent) quality assurance. It’s attracted a lot of attention lately in organizations which do software as a service and other webapp-y things, and as that’s where all the hot startups are these days. Applied to specific people, the “DevOps movement” seems to advocate either giving developers responsibilities traditionally allocated to sysadmins, or at least training the sysadmins to do software development (and, sometimes, unleashing them on the codebase).

As far as I can tell the movement has developed, at least in part, in response to the growing usage of “Infrastructure as a Service” clouds for Internet startups and other Web properties. IaaS doesn’t remove sysadmin responsibilities completely, but it negates the hardware bits and makes the idea of “programmable infrastructure” much more attractive. A lot of great tools have come out of DevOps or associated companies: in particular, the Chef and Puppet configuration management systems are incredibly interesting and potentially very useful in HPC.

It’s easy to see why I’d be interested in the whole DevOps thing: while my current responsibilities at R-HPC are much more on the sysadmin side of things, the majority of my career in research labs leaves me closer to development, and I definitely try to keep one foot in the door on the development side. So as the whole DevOps movement has developed, I’ve been trying to think about how it applies to traditional high-performance computing environments, and also what it means for the growing use of “cloud computing” for HPC.

On the one hand, it’s easy to say that the whole DevOps philosophy of integrating the sysadmin and development teams doesn’t apply very well to traditional sorts of HPC environments. Most of the biggest and most exciting HPC environments out there–the sort that appear near the top of the Top 500 list–are academic environments in which the majority of the code is developed by scientists and engineers doing domain-specific research. Good luck doing DevOps there: most of the scientists have very little interest in the infrastructure they’re running on, making these environments much closer to “Platform as a Service” clouds like Heroku or AppEngine. It’s “MPI as a Service”!

On the other hand, there’s a strong argument that HPC has already embraced some of the core features of the DevOps movement. While there’s still a pretty clear organizational line between sysadmin and scientist at most HPC sites, there’s a lot more interaction than you typically associate with ops and dev teams in an enterprise. Most HPC sysadmins worth their salt can do some programming, and know more about the sorts of optimizations available on their systems than most of their users. The ops team at an HPC site spends a lot of time interacting with users and even helping to debug code and improve performance. (I know that this is certainly true at R-HPC: while a lot of people do show up with commercial applications, we spend a lot of time on performance optimizations and helping our customers with ancillary code and scripts.)

On the gripping hand… “HPC in the Cloud” is not going away any time soon. (Which, to give a disclaimer, I’m obviously counting on at R-HPC.) If I recall correctly, at HPC 360 Addison Snell gave a talk which put the average size of a parallel job on an HPC cluster at about 32 cores. At that scale, it’s going to get easier and easier to run typical HPC jobs on “cloud providers” such as R-HPC or Amazon’s EC2. Most people don’t need a Top 500 system to do their work. And if you’re working in the cloud, it’s likely going to mean that you’ll be using the tools developed for it, such as Chef. So HPC will probably have to engage with the DevOps movement and their tools, and everyone is likely to learn something in that conversation.

I don’t have any good conclusions, but it continues to be an exciting time for anyone interested in parallel computing, HPC, infrastructure automation, or the Cloud in general. There’s still a lot to be done, and it’ll be a lot of fun figuring out how it’s all going to work.

Questions, comments, interesting anecdotes? Tweet to me at @ajdecon, or send me an email at ajdecon@ajdecon.org.