Why was RAM made a consumable resource on the cluster?

The way that SGE on the JHPCE cluster had been  configured, was to NOT view RAM as a “consumable” resource. The recent change to the SGE configuration changed this so that RAM is now “consumable” and gets reserved for a job when it is requested.  What does this mean?

As an example, for simplicity sake, let’s say you have a cluster with just 1 node, and that node has 20GB of RAM.  If you run a job that requests 8GB of RAM ( mem_free=8GB,-h_vmem=9GB) it will start to run on the node immediately. Now, this job takes a few minutes for all 8GB to actually be used by the program – let’s say it consumes 2GB/minute, so after 4 minutes all 8GB will be in use.  A minute later, the running job is using 2GB of RAM, and now a second job comes along and requests 8GB of RAM.  SGE will see that there is still 18GB of RAM on the node and start the second job.  Now, a minute later, a third job comes along, also requesting 8GB.  The first job is using 4GB, the second job is using 2GB, the node has 14GB free, so SGE, seeing that 8GB is available, starts the third job.  So now you have 3 jobs running that will eventually need 24GB of RAM in total, and there is only 20 GB on the system,  so at some point the node becomes RAM starved and the Linux oom-killer gets invoked to kill a process.  (For extra credit – at what time does the node run out of RAM? 🙂 )

The change made to the cluster alters the behavior of SGE so that RAM is “consumable”, so that when you request 8GB, SGE marks that 8GB as reserved.  In the above example, the first 2 jobs would have run, and SGE would have marked 16GB of RAM as “consumed”, so the third job would not have run until one of the other jobs finished.  The biggest downside to this approach though is that if people request much more RAM than what their job need, then jobs will have to wait longer to run, and resources may go unused.  If, in the above example, the first job requested 15GB of RAM “to be safe”, then that would have prevented the second job from starting until the first completed, even though the 2 jobs could have run concurrently.

Bookmark the permalink.