One commonly used feature on the JHPCE cluster is the “send me an email when my job completes” option in SGE. This option can be enabled by adding the “-m e -M EMAIL@ADDRESS.COM” options to your qsub command.
$ qsub -cwd -m e -M firstname.lastname@example.org script1.sh
This option is very convenient for longer running jobs on the cluster. It allows one to submit their jobs, and let them run on the cluster without having to continually login to the cluster to check the status of your job.
This option though can cause problems when running thousands of jobs or tasks on the cluster. Most email servers employ heuristics to detect spam email, and the sudden appearance of thousands of email messages over a short period of time can trigger this. This may result in one’s email or domain account getting locked, or a nasty note from the email support team. When this happens, it may take time to unravel and restore access to one’s email account.
We do go over this during the JHPCE orientation, so this is a gentle reminder to be use caution when using the email option in qsub. The slides for the JHPCE orientation can be downloaded from the Orientation page https://jhpce.jhu.edu/register/orientation/ Please email us at bitsupport if you have any questions.
Dear JHPCE community,
Update: 2022-06-22 13:00 – The cooling issue has been resolved, and the cluster is once again available.
There is currently an issue at the MARCC colocation facility with the cooling system. We had a number of compute nodes on the JHPCE cluster that overheated and have crashed, as well as a couple of storage arrays. At this point, as a precautionary measure, we are planning on shutting down as much as we can until the colling issue is resolved. Please consider the JHPCE cluster unavailable at this point. We will update you as the issue progresses.
The JHPCE cluster will be unavailable from April 11th – April 15th in order to accommodate scheduled preventative maintenance to be done on the HVAC system at the MARCC datacenter. We are planning to take the JHPCE cluster down beginning at 6:00 AM on Monday April 11th. We are expecting that the cluster will be available by Friday, April 15th at 5:00 PM.
If temperatures allow, we may be able to bring some storage resources and the transfer node online, but at this point, please plan for all cluster resources to be unavailable for the duration of the maintenance. Please let us know if you have any questions about this upcoming downtime.
Thank you for your understanding, and we apologize for any inconvenience.
The JHPCE cluster is currently down due to cooling issues at the Bayview/MARCC datacenter. We will keep you advised as the status changes.
As of June 17th, 2021, we are imposing a limit of 10,000 submitted jobs per user. Previously, there had been no limit, and this has caused issues in the past where the cluster scheduler was overloaded when there were 100s of thousands of jobs in the queue. Going forward, if you try to submit more than 10,000 jobs, you will receive the following error:
Unable to run job: job rejected: only 10000 jobs are allowed per user (current job count: 10000)
You will need to either submit your jobs in smaller batches, or, preferably, use an array job to submit your jobs. Arrays jobs can be used to submit multiple instances of the same script where different arguments or data is used for each instance. Please see https://jhpce.jhu.edu/question/how-do-i-run-array-jobs-on-the-jhpce-cluster for more details and examples.
The JHPCE cluster is currently unavailable due to cooling issues at the Bayview/MARCC datacenter where the JHPCE cluster is located. We apologize for any inconvenience, and we will keep you up to date as we are made aware of any changes in the situation.
Dear JHPCE community,
We will be rebooting one of the DCL01 storage servers this Friday morning in order to resolve an issue with one of the filesystems on that server. The following directories will be unavailable this Friday, April 2, between 8:00AM and 9:00 AM.
Please try to limit your access to these directories during this maintenance window. Typically, jobs that that are accessing these directories will simply pause while the server is being rebooted, and then continue once the server comes online again, however it is best to try to minimize the activity against the affected directories.
Thank you for your understanding. Please let us know if you have any questions.
The JHPCE cluster is currently down due to problems with the cooling and power facilities at the Bayview/MARCC Colocation facility. We are being told that the issue will not be repaired until Monday at the earliest, but may take longer. We will provide updates as soon as we receive them. We apologize for any inconvenience.
Dear JHPCE community,
In the wake of the recent JHU announcements regarding COVID-19, we would like address concerns raised regarding the JHPCE cluster.
- Operationally there will be no changes for the JHPCE cluster. The JHPCE management and administrative personnel will continue to provide support for the cluster remotely in the event of a JHSPH building closure, but will retain access to the facility if an on-site presence is needed.
- There is a very small chance that the Bayview Colocation Facility will need to be powered down, and the cluster will need to also be shut down. In that event, we will notify the JHPCE community and gracefully shut down the cluster.
- Functionally, access to the cluster will still be done via Secure SHell (ssh). You may want to use ssh keys on your remote system. Please see https://jhpce.jhu.edu/knowledge-base/authentication/ssh-key-setup/ for information on setting up keys.
- You may want to look at utilizing MobaXterm via the SAFE desktop if you are using graphical programs (rstudio/SAS) as the protocols used for the SAFE desktop (Citrix) are more forgiving over slower network connections than straight-up X11 forwarding through ssh.
- As always, please contact us at email@example.com if you have any questions.
On Friday, Dec. 27th from 7:00 AM to 5:00 PM we will be taking the JHPCE cluster offline for maintenance. This downtime is necessary for the following reasons:
- Power work will be performed in the Bayview datacenter, which could cause disruption to the JHPCE cluster.
- We will be replacing failed RAM in 2 DCL01 storage nodes.
- We will be performing failover testing of the SGE head nodes.
Thank you for your understanding, and please let us know if you have any questions..