Learning Chef: Compute Cluster with SLURM

layout: post published: true Date: 2012-01-21 Tags: - chef - hpc - learning - slurm posterous_url: http://blog.ajdecon.org/learning-chef-compute-cluster-with-slurm posterous_slug: learning-chef-compute-cluster-with-slurm

I've recently started playing with Chef, a configuration management system which provides a lot of nice automation for setting up servers quickly and easily. As a science/HPC guy, my default question for any system like this is, "How easily can this be used to set up a cluster?"  

This blog post is a quick walkthrough of a simple cookbook I put together for a SLURM-managed compute cluster, targeted at running MPI applications.  I put it together simultaneously with learning a lot of the basics of Chef, so this is by no means a fully general SLURM cookbook, just a demonstration of concepts. In particular, I'm only targeting Ubuntu installations with this cookbook rather than generalizing to RHEL-ish distros as well, and I'm only testing on EC2 at the moment.  Nevertheless, I think it's a useful learning exercise.

The cookbook I walk through below can be found in my hpc-chef repo on Github, along with other HPC-related cookbooks. The version used in this blog post is at this commit.


Opscode provides a fast start tutorial for getting up and running using their Hosted Chef option; and I used the Ubuntu server tutorial for setting up the Chef server I used while developing.  Once I had a Chef server and a development repository to work with, I created a new cookbook for my compute cluster using the knife command-line tool:

[ajdecon@exp chef-repo]$ knife cookbook create slurm-mpi-cluster

This created a cookbook, with a lot of the directory structure and README files pre-populated, under cookbooks/slurm-mpi-cluster. 

SLURM

The SLURM resource manager is a flexible and scalable cluster resource manager commonly used at a lot of national labs, and I've been noticing more and more usage in private-sector projects as well.  It's doesn't seem as popular yet as Grid Engine or PBS/Torque, but I've had at least one production project require it for R Systems and it seems to be gaining, especially in CFD-heavy groups.  I chose to use it for this project because 

  • I've been using it at work recently
  • It can be configured using a single file, slurm.conf
  • It's included in the default Ubuntu repositories

The first step is to begin a Chef recipe for SLURM: basically, a little Ruby program in the Chef DSL which tells Chef what resources to create on the server we're configuring. This file will be in cookbooks/slurm-mpi-cluster/recipes/slurm.rb. (The final version is in this Github gist.)

First, I installed SLURM-related packages from the Ubuntu repositories:

%w{munge slurm-llnl slurm-llnl-torque slurm-llnl-basic-plugins slurm-llnl-basic-plugins-dev}.each do |pkg|
    package pkg do
        action [:install]
    end
end

This is just a little loop over a Ruby list of package names. For each one, there is a package directive with a single action line instructing Chef to install that package.

SLURM typically uses the munge package for authentication, and munge requires a little set-up: basically, make sure its configuration directory is present, that /etc has the right permissions, and that we have a key-file. This key-file is shared with all the nodes in the cluster.

# Make sure /etc/munge directory exists
directory "/etc/munge" do
    action :create
end

# Make /etc have suitable permissions
directory "/etc" do
    mode "0755"
end

# Make sure the munge user exists
user("munge")

# Create the munge key from template
template "/etc/munge/munge.key" do
    source "munge.key.erb"
    owner "munge"
end

Stepping through this, we first create the /etc/munge directory, if it doesn't already; then make sure that the /etc directory has the correct permissions, and that the munge user exists. Finally we insert the /etc/munge/munge.key file based on a template file we haven't created yet.

It's useful to note that in general, these operations are idempotent: that is, you can apply all of these operations repeatedly by running sudo chef-client more than once, and the final state will be the same. This isn't guaranteed with all Chef operations, but it's a good goal, so that Chef enforces a known-good state on the server.

Next I set up SLURM itself. Conveniently, SLURM determines whether to run the head-node daemon (slurmctld) or the compute-node daemon (slurmd) based on whether the hostname of the server it's running on is listed as a head-node or compute-node in slurm.conf. This means that I don't need to have separate recipes for the two node types or otherwise distinguish them here; I just need to set up the correct templates and services again.

# Create the slurm user based on settings
user(node.slurm['user'])

# Make sure the config directory exists
directory "/etc/slurm-llnl" do 
    owner "root"
    group "root"
    mode "0755"
    action :create
end

# Build slurm.conf based on the template
template "/etc/slurm-llnl/slurm.conf" do
    source "slurm.conf.erb"
    owner "slurm"
    mode "0755"
end

# Enable and start the slurm service
service "slurm-llnl" do
    action [:enable,:start]
end

Of course, a lot of the configuration logic therefore ends up in the slurm.conf.erb template.

slurm.conf and attributes

The first step in configuring SLURM is to set some basic properties for the cluster.  For this we use Chef's attributes. Default values can be defined in the cookbook, in this case using the attributes/default.rb file (gist), and you can override them later. For SLURM, I used the following default attributes:

default['slurm']['master']         = "slmaster"
default['slurm']['master_addr']    = "10.0.0.1"
default['slurm']['computes']       = [ "compute1", "compute2" ]
default['slurm']['compute_addrs']  = [ "10.0.1.1", "10.0.1.2" ]
default['slurm']['part_name']      = "production"
default['slurm']['user']           = "slurm"
default['slurm']['cpus']           = "1"

These define a pretty generic cluster of 1-core compute nodes. To generate the slurm.conf I would use as a template, I used the SLURM Configurator. I then replaced the relevant values with Ruby template code to use the attributes I defined in templates/default/slurm.conf.erb (gist):

ControlMachine=<%= node.slurm['master'] %>
ControlAddr=<%= node.slurm['master_addr'] %>
...
SlurmUser=<%= node.slurm['user'] %>
...
NodeName=<%= node.slurm['computes'].join(',') %>
NodeAddr=<%= node.slurm['compute_addrs'].join(',') %> Procs=<%= node.slurm['cpus'] %> State=UNKNOWN 
PartitionName=<%= node.slurm['part_name'] %> Nodes=<%= node.slurm['computes'].join(',') %> 
Default=YES MaxTime=INFINITE State=UP

We also need a munge.key file we can share across the cluster for authentication. Typically this is generated by dd'ing from /dev/random or similar, but here, I just used a text-based key I can override with an attribute later:

default['munge']['key']    = "RandomKeyGoesHereRandomKeyGoesHereRandomKeyGoesHereRandomKeyGoesHere"

And munge.key.erb (gist):

<%= node.munge['key'] %>

NFS share

Most compute clusters include an NFS share of either /home or a scratch space from the head node to the compute nodes. To set this up, I wrote two pretty basic recipes.

nfs_headnode.rb (gist):

# Install NFS packages
package("nfs-common")
package("nfs-kernel-server")

# Make sure the diretory to be exported exists
node.nfs['shared_dirs'].each do |d|
    directory d do
        mode "0777"
        action :create
    end
end

# Create the exports file and refresh the NFS exports
template "/etc/exports" do
    source "exports.erb"
    owner "root"
    group "root"
    mode "0644"
end

# Start the NFS server
service "nfs-kernel-server" do
    action [:enable,:start,:restart]
end

execute "exportfs" do
    command "exportfs -a"
    action :run
end

Note that I included an "execute" directive here, that calls for a command to be executed. This type of operation is not necessarily idempotent, though in this case it is (as exportfs -a should always produce the same result for a given exports file). This makes it a little less predictable than most Chef directives.

exports.erb (gist):

<% node.nfs['shared_dirs'].each do |dir| -%>
<%= dir %>  <% node.nfs['clients'].each do |client| -%><%= client %>(rw) <% end -%>
<% end -%>

nfs_computenode.rb (gist):

package("nfs-common")

# Make sure the diretory to be exported exists
node.nfs['shared_dirs'].each do |dir|
    directory dir do
        mode "0777"
        action :create
    end
end


file "/etc/fstab" do

    sourceip = node.nfs['headnode_addr']
    dirs = node.nfs['shared_dirs']

    # Generate the new fstab lines
    new_lines = ""
    dirs.each do |d| 
        new_lines = new_lines + "#{sourceip}:#{d}  #{d}  nfs  defaults 0 0\n"
    end

    print "*** Mount line: #{new_lines}\n"

    # Get current content, check for duplication
    only_if do
        current_content = File.read('/etc/fstab')
        current_content.index(new_lines).nil?
    end

    print "*** Passed the conditional for current content\n"

    # Set up the file and content
    owner "root"
    group "root"
    mode  "0644"
    current_content = File.read('/etc/fstab')
    new_content = current_content + new_lines
    content new_content

end

execute "mount" do
    command "mount -a"
    action :run
end

In the compute node template, note that rather than using a template, I'm directly modifying a system file; in this case, the /etc/fstab file. This is more dangerous in some ways, but also means that I don't care what the previous content of the file was (so other mounts can exist there, etc). The recipe works in three steps:

  1. Generate new fstab lines for the NFS mounts
  2. Check to see if these lines are present in the fstab, and stop if they are
  3. Modify the content

And with attributes...

default['nfs']['headnode_addr']    = default['slurm']['master_addr']
default['nfs']['shared_dirs']      = ["/scratch"]
default['nfs']['clients']          = default['slurm']['compute_addrs']

Other software

Finally, we have a couple of other small recipes to add. First, all the nodes should include the OpenMPI binaries and libraries (openmpi.rb):

%w{openmpi-common openmpi-bin openmpi-checkpoint libopenmpi-dev openmpi-doc}.each do |pkg|
    package pkg do
        action [:install]
    end
end

And we should install the pdsh "parallel ssh" command on the head node, for management (pdsh.rb):

package("pdsh")

Roles

Of course, now we have some recipes which should only be applied on either the head node or the compute node; for example, our two NFS recipes. Thankfully, Chef has roles that allow you to group related recipes:

slurm-headnode.rb:

name "slurm-headnode"
description "SLURM cluster headnode"
run_list(
    "recipe[slurm-mpi-cluster::slurm]",
    "recipe[slurm-mpi-cluster::nfs_headnode]",
    "recipe[slurm-mpi-cluster::openmpi]",
    "recipe[slurm-mpi-cluster::pdsh]"
)

slurm-computenode.rb:

name "slurm-computenode"
description "SLURM cluster compute node"
run_list(
    "recipe[slurm-mpi-cluster::slurm]",
    "recipe[slurm-mpi-cluster::nfs_computenode]",
    "recipe[slurm-mpi-cluster::openmpi]"
)

Overriding attributes

Finally, when I actually deploy this cookbook, I want to override the default attributes so that hostnames, IP addresses, and so on are correct. Chef provides a number of ways to do this, but one of the easiest is using environments. These are, more-or-less, just a list of attribute overrides which can be applied to nodes using the Chef server. An example of an environment I used when testing on EC2 (gist):

name "ec2"
description "An environment for testing the slurm cookbook on ec2"
override_attributes "slurm" => { 'master' => 'domU-12-31-39-00-A4-87', 'master_addr' => '10.254.171.117',
'computes' => ['domU-12-31-39-16-C9-DB','domU-12-31-39-13-D5-27'],
'compute_addrs' => ['10.96.202.41', '10.201.214.213'],
    'cpus' => '1' }, 
"nfs" => { 'headnode_addr' => '10.254.171.117', 'shared_dirs' => ['/scratch'],
    'clients' => ['10.96.202.41','10.201.214.213'] }