2020-07-05 – 7:00 PM – JHPCE Cluster currently down due to cooling/power problems at datacenter

The JHPCE cluster is currently down due to problems with the cooling and power facilities at the Bayview/MARCC Colocation facility. We are being told that the issue will not be repaired until Monday at the earliest, but may take longer. We will provide updates as soon as we receive them. We apologize for any inconvenience.

Posted in Cluster Status Updates | Comments Off on 2020-07-05 – 7:00 PM – JHPCE Cluster currently down due to cooling/power problems at datacenter

COVID-19 Information for the JHPCE cluster

Dear JHPCE community,

In the wake of the recent JHU announcements regarding COVID-19, we would like address concerns raised regarding the JHPCE cluster.

  • Operationally there will be no changes for the JHPCE cluster. The JHPCE management and administrative personnel will continue to provide support for the cluster remotely in the event of a JHSPH building closure, but will retain access to the facility if an on-site presence is needed.
  • There is a very small chance that the Bayview Colocation Facility will need to be powered down, and the cluster will need to also be shut down. In that event, we will notify the JHPCE community and gracefully shut down the cluster.
  • Functionally, access to the cluster will still be done via Secure SHell (ssh).  You may want to use ssh keys on your remote system. Please see https://jhpce.jhu.edu/knowledge-base/authentication/ssh-key-setup/ for information on setting up keys.
  • You may want to look at utilizing MobaXterm via the SAFE desktop if you are using graphical programs (rstudio/SAS) as the protocols used for the SAFE desktop (Citrix) are more forgiving over slower network connections than straight-up  X11 forwarding through ssh.
  • As always, please contact us at bitsupport@lists.jhu.edu if you have any questions.

Posted in JHPCE Announcements | Comments Off on COVID-19 Information for the JHPCE cluster

JHPCE cluster maintenance – Friday, Dec 27th from 7:00 AM to 5:00 PM

On Friday, Dec. 27th from 7:00 AM to 5:00 PM we will be taking the JHPCE cluster offline for maintenance.  This downtime is necessary for the following reasons:

  • Power work will be performed in the Bayview datacenter, which could cause disruption to the JHPCE cluster. 
  • We will be replacing failed RAM in 2 DCL01 storage nodes.
  • We will be performing failover testing of the SGE head nodes.

 Thank you for your understanding, and please let us know if you have any questions..

Posted in Cluster Status Updates | Comments Off on JHPCE cluster maintenance – Friday, Dec 27th from 7:00 AM to 5:00 PM

JHPCE Cluster Upgrade – August 26 – 28

Dear JHPCE Community,

We are planning on taking the JHPCE cluster offline from August 26th – 28th in order to upgrade the nodes from their current Redhat 6.9 environment to Centos 7.6.  As part of the upgrade process, we have set up a small test cluster that is running Centos 7.6.  If you have built software packages for the current cluster, you may want to test that software on the new cluster before the cutover happens at the end of August.  While most software compiled under RedHat 6.9 should run on Centos 7.6, you may still want to recompile any programs you have built.

To connect to the test cluster, you will need to login to the “jhpce03.jhsph.edu” login node:

ssh USERID@jhpce03.jhsph.edu

We have 2 compute nodes (compute-051 and compute-052) that you can access by running “qsub/qrsh” from jhpce03 just as on the current cluster.  Please keep in mind that this is a test system and we are still performing work on the nodes, so there may be times when we may need to reboot nodes.  We will be sure to notify any active users prior to doing anything intrusive.

At this time we only have the /users, /jhpce/shared, and /fastscratch partitions available on the new cluster, but we will be making the /dcs01, /dcl01, /dcl02, and /legacy partitions  available shortly.  Please feel free to email bitsupport if you have any questions.

Posted in Cluster Status Updates | Comments Off on JHPCE Cluster Upgrade – August 26 – 28

JHPCE Website Maintenance

The JHPCE website (https://jhpce.jhu.edu) will be offline for a complete site migration starting at 6:00 PM on Tuesday January 15th.  Estimated downtime will be two hours.  We apologize for any inconvenience this may cause.

Posted in JHPCE Announcements | Comments Off on JHPCE Website Maintenance

Upcoming JHPCE cluster downtime – Aug 24 – Aug 29 – for power work at Bayview datacenter

The JHPCE cluster will be unavailable beginning at 5:00 PM on Friday August 24, and will be available at 5:00 PM on Wednesday, August 29th. The downtime is needed for the installation of a new backup generator at the MARCC/Bayview datacenter. Power will need to be turned off at the site during the installation and testing of this new generator. We will also be performing some updates to the JHPCE cluster compute nodes once power has been restored.

Thank you for your understanding, and please let us know if you have any questions or concerns.

Posted in Cluster Status Updates | Comments Off on Upcoming JHPCE cluster downtime – Aug 24 – Aug 29 – for power work at Bayview datacenter

JHPCE cluster unavailable from 6:00 AM on Monday, January 22nd until 5:00 PM on Wednesday, January 24th for power work

Dear JHPCE Community,

Please be advised that the JHPCE cluster will be unavailable starting at 6:00 AM on Monday, January 22nd and will not be available until 5:00 PM on Wednesday, January 24th for power work and network maintenance scheduled to be performed at the Bayview/MARCC site.  Please plan your jobs accordingly.

We will also be performing a number of upgrades during this time.  Most notably, we will be setting the default version of R on the JHPCE cluster to version 3.4.2.  This is a significant change as this version of R will be run within a Conda environment.  Conda allows a more consistent, contained environment to be created for applications, and should eliminates issues we have been seeing recently with library incompatibilities with some newer R packages.  Thanks to Dr. Kasper Hansen for all of his work in setting up R under the Conda environment.

Please note that you will have to reinstall any locally installed packages that you have under the current R environment.  If you wish to install your packages prior to January 22nd, you can run:

module load conda_R

and then run “R”, and reinstall packages for the new R version.

Posted in Cluster Status Updates | Comments Off on JHPCE cluster unavailable from 6:00 AM on Monday, January 22nd until 5:00 PM on Wednesday, January 24th for power work

Upcoming patching on the JHPCE cluster – August 2017

During the early part of August, we will be installing system and security patches across all of the nodes on the JHPCE cluster.  The upgrades will be done in a rolling manner so as to minimize the impact of this upgrade on cluster utilization.  During this time, the cluster will still be operating at around 75% capacity, however wait times for jobs in queue may be longer than usual.

Login/Transfer Nodes:

The main login node, jhpce01, will be updated on Monday, August 14th at 8:00 AM, and should be back online by 9:00 AM.  Any qrsh or screen sessions running on jhpce01 at that time will be terminated.  If you need to login to the cluster during the time of the upgrade, you can use the secondary jhpce02 login node.  The lesser used jhpce02 login node will be upgraded the week prior, August 7th, between 8:00 AM and 9:00 AM.  The transfer-01 node will be upgraded on Tuesday, August 8th between 8:00 AM and 9:00 AM.

Compute Nodes:

The compute nodes will be updated in groups throughout the first weeks in August.  Several days prior to upgrading a group of compute nodes, access will be suspended on the nodes to prevent new jobs from starting, and allow running jobs on those nodes to complete.  On the day of the upgrade, if there are jobs still running on the nodes to be upgraded, those jobs will be terminated.  If you have a critical, long-running job on a node to be updated, please email bitsupport@lists.jhu.edu and we will try to accommodate your request.

If you are a stakeholder that owns compute nodes, and you need uninterrupted access to your nodes in early August, we can postpone the upgrade of your nodes to a later date.  Please email bitsupport@lists.jhu.edu if you have any questions.

The schedule for upgrading the compute nodes, and the queues affected, is as follows:

Aug 1:
compute-043 – gwas
compute-045 – jabba
compute-049 – ozone
compute-051 – chaklab
compute-053 – stanley
compute-055 – mathias
compute-057 – bader
compute-061 – hongkai
compute-063 – cegs2
compute-064 – cegs2
compute-071 – cegs
compute-072 – cegs
compute-077 – beer
compute-078 – beer
compute-085 – leek
compute-086 – leek
compute-093 – bluejay
compute-094 – bluejay
compute-101 – sas
compute-103 – chatterjee
compute-104 – chatterjee

Aug 3:
compute-047 – jabba
compute-059 – gwas
compute-067 – cegs2
compute-068 – cegs2
compute-075 – cegs
compute-081 – beer
compute-082 – beer
compute-089 – leek
compute-090 – leek
compute-097 – bluejay
compute-098 – bluejay
compute-107 – chatterjee
compute-108 – chatterjee
compute-112 – hpm

Aug 8:
compute-044 – gwas
compute-046 – jabba
compute-050 – ozone
compute-052 – chaklab
compute-054 – stanley
compute-056 – mathias
compute-058 – bader
compute-060 – ozone
compute-062 – hongkai
compute-065 – cegs2
compute-066 – cegs2
compute-073 – cegs
compute-074 – cegs
compute-079 – beer
compute-080 – beer
compute-087 – leek
compute-088 – leek
compute-095 – bluejay
compute-096 – bluejay
compute-102 – sas
compute-105 – chatterjee
compute-106 – chatterjee

Aug 10:
compute-048 – jabba
compute-069 – cegs2
compute-070 – cegs2
compute-076 – cegs
compute-083 – beer
compute-084 – beer
compute-091 – leek
compute-092 – leek
compute-099 – bluejay
compute-100 – bluejay
compute-109 – chatterjee
compute-110 – chatterjee
compute-111 – chatterjee

Posted in Cluster Status Updates | Comments Off on Upcoming patching on the JHPCE cluster – August 2017

Memory usage on the JHPCE cluster, and availability of the “gedit” editor

Dear JHPCE community,

As part of our continuing efforts to more efficiently utilize RAM on the JHPCE cluster, we will be making a couple of changes to the current cluster configuration.  Our analysis of jobs run over the last year has shown that that are there are often jobs run that do not make efficient use of RAM, and this can unfairly impact overall utilization of the cluster.  The changes that we will be making are intended to help you get the most out of using the cluster, and make the cluster more fairly available to all users.  Please see our recent “Memory Usage and Good Citizenship” blog post at https://jhpce.jhu.edu/2017/05/17/memory_usage_analysis/ for more details.

1) First, we will be lowering the per-user RAM limit to 512 GB for jobs on the shared queue.  Our previous limit had been 1 TB per user, however this higher limit at times led to utilization of the cluster being constricted by RAM.  We’re expecting that the reduced RAM limit will better balance core and RAM utilization on the cluster.  This change will be made on Monday, June 26th at 5:00 PM.  There will not be any downtime for this change, and running jobs will be completely unaffected, however jobs or tasks that start running after the change on Monday will be regulated by this new limit.

What this means for you is that if you submit a lot of jobs which use a lot of RAM, it may take longer for your jobs to complete.  For example, if you have jobs with use 10GB, you will now only be able to have 50 jobs running, where with the old setting you could have 100 jobs running simultaneously.  In our analysis though, the vast majority of users will not be impacted by this change.

We can of course increase your RAM limit for a short period, if, say, there is deadline to have your jobs get run and the cluster is not too busy.  In those cases, please email bitsupport to have your RAM limit temporarily increased.

2) In the past we had recommended that jobs be submitted with “h_vmem” set slightly higher than “mem_free” (either 1G higher, or 10% higher).  Going forward, you should have “h_vmem” set to be the same as “mem_free”.  This will help to ensure that RAM on the compute nodes does not get oversubscribed.

3) Lastly, we will be performing a weekly analysis of jobs on the JHPCE cluster to identify those jobs where RAM is not being utilized efficiently.  Each week we will be emailing those people that have run jobs where either a) their actual RAM usage is far less than requested, or b) where actual RAM usage was much more than requested.  It is hoped that these email messages can be a tool to help everyone tune their jobs to run more efficiently.

Lastly, on a non-RAM related note, we recently installed a graphical text editor called “gedit” on the JHPCE cluster.  The “gedit” editor has an interface similar to “Notepad” on the PC, or “TexteEdit” on MacOS.  Note that you will need to have a working X11 environment running on your laptop/desktop to use “gedit”.

Please let us know if you have any questions about these changes.

Mark

Posted in JHPCE Announcements | Comments Off on Memory usage on the JHPCE cluster, and availability of the “gedit” editor

Memory usage and good citizenship

On a shared machine such as the JHPCE cluster, it is important for everyone to use the resources wisely to avoid congestion and to maximize resources for everyone. Please be a good citizen and use memory wisely. Here are three recommendations.

Recommendations

1) Set your mem_free so it is as close as possible to your expected actual usage (your actual usage divided by the number of slots should be slightly lower than your mem_free). Avoid “playing it safe” and asking for substantially more memory than you actually need.

2) Use the commands described here to observe memory usage in running jobs or to get memory usage in completed or aborted jobs. We understand that sometimes it is not possible to know how much memory a job will use and in these cases it is certainly reasonable to waste memory, but submitting a large number of such jobs is not good citizenship.

We are asking users to be more careful with their memory requests because an analysis of jobs run during the three months starting in January revealed that, on average, the memory allocation rate is 81% of capacity (a good number), but the ACTUAL memory usage hovers around 23% (terrible!). This means that too many users are requesting far more memory than they need. This causes the grid engine to think that the cluster is busier than it actually is, which in turn causes it to unnecessarily restrict the number of jobs that it starts on the cluster.

We also determined that  some users are using much more memory than they request.  This can actually lead to node crashes. To prevent this we recommend that you:

3) Set your high-water mark (h_vmem)  so it matches your mem_free.  Using more memory than is requested is bad citizenship because it can lead to a situation where the scheduler sends too many jobs to a node which, in the worse case, results in a crash. This used to be a common occurence before we made memory a consumable resource.

Please be good citizens. Big Brother is watching …

Results of analysis

The table below shows the amount of memory (in TB-days) wasted  between Jan-Mar 2017 by users who requested more memory than they actually used. The results are ordered by user. We anonymized the users.

+-----------+----------------+---------+
| anonymized| wasted usage   | N(jobs) |
|  userid   |  (TB-days)     |         |
+-----------+----------------+---------+
| 329055714 | 54             | 234     |
| 329082844 | 28             | 880     |
| 329113510 | 21             | 23      |
| 329062952 | 19             | 1949    |
| 329055899 | 19             | 12496   |
| 329055645 | 17             | 96184   |
| 329056431 | 14             | 201     |
| 330857730 | 13             | 114     |
| 329055782 | 7              | 159919  |
| 329141862 | 7              | 255     |
| 329056101 | 7              | 2659    |
| 329055698 | 6              | 50425   |
| 329075706 | 6              | 225     |
| 329055637 | 6              | 521889  |
| 329056935 | 5              | 6298    |
| 329055992 | 4              | 22619   |
| 329063154 | 4              | 1403    |
| 330887458 | 4              | 11      |
| 329070410 | 4              | 405     |
| 329056652 | 4              | 2245    |
+-----------+----------------+---------+

Below are users who set their high water mark so high that their jobs consumed much more memory than they requested.. The scheduler does not know when a job is using more memory than is requested and so it can unknowingly overbook the  memory  on a host. This can negatively impact performance and (in the worst case) cause a host to crash. In effect these users are stealing memory by allocating more memory than they requested. Between Jan-Mar 2017 there were only 6 users who systematically did this.  The amount of  stolen memory is shown in the table below.

+-----------+----------------+---------+
| uid       | stolen TB-days | N(jobs) |
+-----------+----------------+---------+
| 329060039 | 8.192          | 308     |
| 329055678 | 0.434          | 257045  |
| 329056939 | 0.396          | 709     |
| 329351072 | 0.124          | 11      |
| 329082616 | 0.074          | 15      |
| 329058516 | 0.008          | 64      |
+-----------+----------------+---------+

 

Posted in JHPCE Announcements | Comments Off on Memory usage and good citizenship