Dear JHPCE Community,
With the ever increasing usage on the JHPCE cluster, we have had, during very busy times, a number of cases where individuals have been unable to access cluster resources. In investigating the issue we have found that the limiting factor has been the amount of memory that was being used by individual users on the cluster.
To address this issues, the JHPCE management team will be modifying the cluster configuration to set a 1TB memory limit per user for jobs running on the shared queue. This limit should not be noticed by the vast majority of users on the cluster, and will only affect those that run a high number of jobs that use a large amount of memory. Please note that jobs running on dedicated queues will not be subjected to this limit; only jobs on the shared queue will be affected.
This 1TB memory limit will be managed in a similar manner to the current 200 core limit that is currently in place. As with the 200 core limit, we can temporarily increase the 1TB memory limit for individuals that will be submitting jobs that need more than 1TB of memory, but only at times when the cluster is less heavily loaded.
The 1TB limit will be based on the “mem_free” setting for running jobs that were submitted with qsub and interactive sessions run by qrsh. For every running job and every qrsh session, the “mem_free” values will be deducted from the 1TB limit. As an example, if you have 20 jobs running on the cluster, where each job has “mem_free” set to 10GB, then a total of 200GB will be counted against your 1TB memory limit, leaving 800GB available for additional jobs. As your jobs complete, your memory limit will be restored.
This change will be done on Tuesday, August 4th, from 6:00 PM – 7:00 PM. We do not anticipate any downtime on the cluster, or any impact to jobs currently running on the cluster. The way that you submit jobs will not change, other than being more cognizant of memory usage.
As a general reminder, please bear in mind the following guidelines for memory when submitting jobs to the cluster:
– When submitting jobs, please strive to set your “mem_free” to be as close as possible to the actual anticipated memory usage of your program. Using a “mem_free” value that is overly large will a) potentially cause your job to be delayed in running as it awaits a node with sufficient memory, b) limit the number of jobs you can run concurrently by deducting more memory than needed from the 1TB limit, and c) prevent others from accessing memory resources on the cluster that your job is unnecessarily holding.
– There is, sadly, no hard and fast rule for estimating how much memory a program will need. A good place to start is the size of data files that will be loaded into the program. You can also make use of the “qacct -j <jobnumber>” command to review memory usage from similar previous jobs, to determine memory usage for future jobs.
– Please set your “h_vmem” setting to be equal to, or, at most, 1GB more than your “mem_free” setting. Setting a higher “h_vmem” can cause oversubscription of memory on the compute nodes, which can cause other user’s jobs to be unceremoniously killed by the Linux oom-killer (Out-Of-Memory Killer).
Please feel free to email bitsupport if you have any questions