Dear JHPCE community,
Update: 2022-06-22 13:00 – The cooling issue has been resolved, and the cluster is once again available.
There is currently an issue at the MARCC colocation facility with the cooling system. We had a number of compute nodes on the JHPCE cluster that overheated and have crashed, as well as a couple of storage arrays. At this point, as a precautionary measure, we are planning on shutting down as much as we can until the colling issue is resolved. Please consider the JHPCE cluster unavailable at this point. We will update you as the issue progresses.
The JHPCE cluster will be unavailable from April 11th – April 15th in order to accommodate scheduled preventative maintenance to be done on the HVAC system at the MARCC datacenter. We are planning to take the JHPCE cluster down beginning at 6:00 AM on Monday April 11th. We are expecting that the cluster will be available by Friday, April 15th at 5:00 PM.
If temperatures allow, we may be able to bring some storage resources and the transfer node online, but at this point, please plan for all cluster resources to be unavailable for the duration of the maintenance. Please let us know if you have any questions about this upcoming downtime.
Thank you for your understanding, and we apologize for any inconvenience.
The JHPCE cluster is currently down due to cooling issues at the Bayview/MARCC datacenter. We will keep you advised as the status changes.
As of June 17th, 2021, we are imposing a limit of 10,000 submitted jobs per user. Previously, there had been no limit, and this has caused issues in the past where the cluster scheduler was overloaded when there were 100s of thousands of jobs in the queue. Going forward, if you try to submit more than 10,000 jobs, you will receive the following error:
Unable to run job: job rejected: only 10000 jobs are allowed per user (current job count: 10000)
You will need to either submit your jobs in smaller batches, or, preferably, use an array job to submit your jobs. Arrays jobs can be used to submit multiple instances of the same script where different arguments or data is used for each instance. Please see https://jhpce.jhu.edu/question/how-do-i-run-array-jobs-on-the-jhpce-cluster for more details and examples.
The JHPCE cluster is currently unavailable due to cooling issues at the Bayview/MARCC datacenter where the JHPCE cluster is located. We apologize for any inconvenience, and we will keep you up to date as we are made aware of any changes in the situation.
Dear JHPCE community,
We will be rebooting one of the DCL01 storage servers this Friday morning in order to resolve an issue with one of the filesystems on that server. The following directories will be unavailable this Friday, April 2, between 8:00AM and 9:00 AM.
Please try to limit your access to these directories during this maintenance window. Typically, jobs that that are accessing these directories will simply pause while the server is being rebooted, and then continue once the server comes online again, however it is best to try to minimize the activity against the affected directories.
Thank you for your understanding. Please let us know if you have any questions.
The JHPCE cluster is currently down due to problems with the cooling and power facilities at the Bayview/MARCC Colocation facility. We are being told that the issue will not be repaired until Monday at the earliest, but may take longer. We will provide updates as soon as we receive them. We apologize for any inconvenience.
Dear JHPCE community,
In the wake of the recent JHU announcements regarding COVID-19, we would like address concerns raised regarding the JHPCE cluster.
- Operationally there will be no changes for the JHPCE cluster. The JHPCE management and administrative personnel will continue to provide support for the cluster remotely in the event of a JHSPH building closure, but will retain access to the facility if an on-site presence is needed.
- There is a very small chance that the Bayview Colocation Facility will need to be powered down, and the cluster will need to also be shut down. In that event, we will notify the JHPCE community and gracefully shut down the cluster.
- Functionally, access to the cluster will still be done via Secure SHell (ssh). You may want to use ssh keys on your remote system. Please see https://jhpce.jhu.edu/knowledge-base/authentication/ssh-key-setup/ for information on setting up keys.
- You may want to look at utilizing MobaXterm via the SAFE desktop if you are using graphical programs (rstudio/SAS) as the protocols used for the SAFE desktop (Citrix) are more forgiving over slower network connections than straight-up X11 forwarding through ssh.
- As always, please contact us at firstname.lastname@example.org if you have any questions.
On Friday, Dec. 27th from 7:00 AM to 5:00 PM we will be taking the JHPCE cluster offline for maintenance. This downtime is necessary for the following reasons:
- Power work will be performed in the Bayview datacenter, which could cause disruption to the JHPCE cluster.
- We will be replacing failed RAM in 2 DCL01 storage nodes.
- We will be performing failover testing of the SGE head nodes.
Thank you for your understanding, and please let us know if you have any questions..
Dear JHPCE Community,
We are planning on taking the JHPCE cluster offline from August 26th – 28th in order to upgrade the nodes from their current Redhat 6.9 environment to Centos 7.6. As part of the upgrade process, we have set up a small test cluster that is running Centos 7.6. If you have built software packages for the current cluster, you may want to test that software on the new cluster before the cutover happens at the end of August. While most software compiled under RedHat 6.9 should run on Centos 7.6, you may still want to recompile any programs you have built.
To connect to the test cluster, you will need to login to the “jhpce03.jhsph.edu” login node:
We have 2 compute nodes
(compute-051 and compute-052) that you can access by running “qsub/qrsh” from
jhpce03 just as on the current cluster. Please keep in mind that this is
a test system and we are still performing work on the nodes, so there may be
times when we may need to reboot nodes. We will be sure to notify any
active users prior to doing anything intrusive.
At this time we only have the /users, /jhpce/shared, and /fastscratch partitions available on the new cluster, but we will be making the /dcs01, /dcl01, /dcl02, and /legacy partitions available shortly. Please feel free to email bitsupport if you have any questions.