This page is for experienced users of the enigma cluster who would like a quickstart on the new JHPCE cluster. Please let us know if you discover missing libraries, header files or applications.
accessing the JHPCE cluster
Access the new JHPCE cluster via ssh through one of two new login nodes
Use the same credentials that you used on the enigma cluster. If you previously configured key-pair authentication or two-factor authentication on enigma, these will continue to function just as before.
New queues and queue policies
We have reorganized the queues and queue policies to optimize cluster utilization. The policies were tested over the last few months on the enigma cluster and were found to substantially increase the availability of compute cores to all users, while eliminating much of the contention for cores that stakeholders were experiencing on their own machines. The dedicated stakeholder queues will continue to exist.
Open queues (anyone can use these queues)
shared.q # the default queue with 1860 slots available
Dedicated queues (users need permission of stakeholders to submit jobs to these nodes)
chacklab.q # chakravarti lab machines bader.q # Bader lab cegs.q # Center of Excellence in Epigenetic Science cegs2.q # Center of Excellence in Epigenetic Science gwas.q # Genetic Epidemiology hongkai.q # Hongkai Ji group jabba.q # Rafa lab machines mathias.q # Mathias lab hosts mcmc.q # Oncology Biostatistics ozone.q # Environmental Epidemiology & Biostatistics Scharpf.q # Oncology Biostatistics stanley.q # Stanley Division of Developmental Neurovirology
Most special purpose queues will continue to exist
sas.q # SAS math.q # Mathematica rnet.q # down/up-load data over the 10G research network. download.q # down/up-load data over a 1Gbps connection
We have done away with the
express queues in favor of a single
shared queue that spans the entire cluster. The shared queue is the default queue. Jobs submitted without a queue specification could run on nearly any host in the cluster. However, if a job runs a dedicated queue, it will run at a lower priority than the jobs submitted through the stakeholder queues. We have done away with the interactive queue. There is likely no need for it (see
qrsh discussion below).
qsub & qrsh
qsub On the enigma cluster, it was commonplace to specify a queue when submitting a batch job. On the JHPCE cluster, it will be advantageous for most users to simply submit a job without specifying a queue. This will allow SGE to look at all the hosts on the machine, including dedicated queues, and select the host that best fits your request. The default queue is the shared queue. Since the shared queue spans the entire cluster, it provides a larger base of possible hosts than you would get if you specified a host.
qrsh On the
qrsh jobs were submitted through the standard queue. When the standard queue was full, users could submit to a small dedicated interactive queue. On the JHPCE cluster,
qrsh jobs are submitted, by default, through the shared queue. No “
-l interactive” option is needed. The shared queue spans the entire cluster (nearly 2000 cores) it is unlikely that users will fail to be allocated a slot for their qrsh session. On the other hand, on a very busy cluster, users may experience slower performance because the shared queue has a low priority. Of course stakeholders can always run
qrsh on their own queues.
Environment modules or
modulefiles are a new feature of the JHPCE cluster. modulefiles are used to configure your shell environment. They will save you a lot of work. Please refer to the module file documentation prior to trying to run a job on the new JHPCE cluster. This may save you some grief!
- This site is searchable by entering queries in the search box in the upper right hand corner of every page
- email the community for help at email@example.com
- email the system administration team at firstname.lastname@example.org
Please be aware that we are still working on some application software.
- Python 2.6 is installed. Versions 2.7 and 3.0 will be ready by tomorrow (Friday Mar. 27).
- The Short read tools maintained by Kasper Hansen are still being ported.
Why does running “firefox” crash from a compute node.
It appears that firefox is allocating RAM based on the stack size setting for your qrsh session. To run firefox, you can either:
- Requesting 40GB of RAM (
qrsh -l mem_free=40G,h_vmem=40G)
- Reducing stack size from the default 128M to 8M (
qrsh -l h_stack=8M)
- Using qrsh as normal, then reducing stack size by running “ulimit” (
qrsh ; ulimit -s 8192 -S 8192)
How do I get the Rstudio program to work on the cluster?
The RStudio program (https://www.rstudio.com/products/rstudio/) is a graphical development interface to the R statistical package. Some people find the graphical RStudio program helpful in organizing R projects, and writing and debugging R programs.
To run rstudio on the JHPCE cluster, you can use the following steps:
- First, make sure you have an X11 server installed on your laptop/desktop (either Xquartz for MacOS, or MobaXterm for PCs).
- Next, ssh into the JHPCE cluster, making sure the the X11 forwarding option is used for SSH. To enable X11 forwarding from MacOs, add the “-X” option to your ssh command. For MobaXterm on Windows, X11 forwarding is enabled by default.
- Once you are on the cluster, use “qrsh” to connect to a compute node
- Load the “rstudio” module by running “module load rstudio“
- Start the rstudio program by running “rstudio“
- Within a couple of seconds, the RStudio interface should be displayed.
An example session for user “bob” would look something like:
BobsMac$ ssh -X email@example.com Last login: Mon Dec 26 08:24:01 2016 from 10.11.12.13 --- Use of this system constitutes agreement to adhere to all applicable JHU and JHSPH network and computer use policies. --- [jhpce01 /users/bob ]$ qrsh [compute-072 /users/bob]$ module load rstudio [compute-072 /users/bob]$ rstudio
Please be aware that rstudio is a very graphics-heavy program and uses a fair amount of network bandwidth. You will need to make sure that you have a fairly fast network connection in order to use rstudio effectively.
Why do I get memory errors when running Java?
You may see errors such as the following when you try to run Java:
$ java Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.
This is due to the default Maximum Heap Size in java being 32GB and the JHPCE cluster defaults for mem_free, your h_vmem being set to 2GB and 3GB respectively. The default settings for qrsh would be too small to accommodate the memory required by Java. You have 3 options to get this to work.
1) If you think you will really need 32GB of memory for your java program, you can increase your mem_free and h_vmem settings of your qrsh command:
jhpce01: qrsh -l mem_free=40G,h_vmem=40G compute-085: java
2) More likely, you do not need 32GB for your Java program, so you can direct Java to use less memory by using the “-Xmx” and “-Xms” options to Java. For instance, if you want to set the initial heap size to 1GB and the maximum heap size to 2GB you could use:
jhpce01: qrsh compute-085: java -Xms1g -Xmx2g
3) An alternative way to set the Java memory settings is to use the “_JAVA_OPTIONS” environment variable. This is useful if the call to run java is embedded within a script that cannot be altered. For instance, if you want to set the initial heap size to 1GB and the maximum heap size to 2GB you could use:
jhpce01: qrsh compute-085: export _JAVA_OPTIONS="-Xms1g -Xmx2g" compute-085: java
Why was RAM made a consumable resource on the cluster?
The way that SGE on the JHPCE cluster had been configured, was to NOT view RAM as a “consumable” resource. The recent change to the SGE configuration changed this so that RAM is now “consumable” and gets reserved for a job when it is requested. What does this mean?
As an example, for simplicity sake, let’s say you have a cluster with just 1 node, and that node has 20GB of RAM. If you run a job that requests 8GB of RAM ( mem_free=8GB,-h_vmem=9GB) it will start to run on the node immediately. Now, this job takes a few minutes for all 8GB to actually be used by the program – let’s say it consumes 2GB/minute, so after 4 minutes all 8GB will be in use. A minute later, the running job is using 2GB of RAM, and now a second job comes along and requests 8GB of RAM. SGE will see that there is still 18GB of RAM on the node and start the second job. Now, a minute later, a third job comes along, also requesting 8GB. The first job is using 4GB, the second job is using 2GB, the node has 14GB free, so SGE, seeing that 8GB is available, starts the third job. So now you have 3 jobs running that will eventually need 24GB of RAM in total, and there is only 20 GB on the system, so at some point the node becomes RAM starved and the Linux oom-killer gets invoked to kill a process. (For extra credit – at what time does the node run out of RAM? 🙂 )
The change made to the cluster alters the behavior of SGE so that RAM is “consumable”, so that when you request 8GB, SGE marks that 8GB as reserved. In the above example, the first 2 jobs would have run, and SGE would have marked 16GB of RAM as “consumed”, so the third job would not have run until one of the other jobs finished. The biggest downside to this approach though is that if people request much more RAM than what their job need, then jobs will have to wait longer to run, and resources may go unused. If, in the above example, the first job requested 15GB of RAM “to be safe”, then that would have prevented the second job from starting until the first completed, even though the 2 jobs could have run concurrently.
My X11 forwarding stops working after 20 minutes.
X11 forwarding can be enable in your ssh session using the -X options to the ssh command:
$ ssh -X firstname.lastname@example.org
This will allow you to run X based programs from the JHPCE cluster back top the X server running on your desktop (such as XQuartz on Mac computers). On some Mac computers X11 forwarding will work for a while but may eventually time out, with the error message:
Xt error: Can't open display: localhost:15.0
This error comes from the “ForwardX11Timeout” variable, which is set by default to 20 minutes. To avoid this issue, a larger timeout can be supplied on the command line to, say, 336 hours (2 weeks):
$ ssh -X email@example.com -o ForwardX11Timeout=336h
or it can be changed in the /etc/.ssh_config file on your desktop by adding the line:
to the end of the /etc/.ssh_config file, or your own ~/.ssh/config file. Note: a value higher than 596h may cause the X window server on your desktop to fail, as it is greater than 2^31 milliseconds and will exceed the signed 32bit size of the “ForwardX11Timeout” variable.
How do I copy a large directory structure from one place to another.
As an example, to copy a directory tree from
/dcs01/bob/dst, first, create a cluster script, let’s call it “
copy-job“, that contains the line:
rsync -avzh /home/bst/bob/src/ /dcs01/bob/dst/
Next, submit a batch job to the cluster
qsub -cwd -m e -M firstname.lastname@example.org copy-job
This will submit the “
copy-job” script to the cluster, which will run the job on one of the compute nodes, and send an email when it finishes.
My app is complaining that it can’t find a shared library, e.g. libgfortran.so.1 could you please install it?
We would guess that 9 times out of 10, the allegedly missing library is there. The problem is that your application is looking for the version of the library that is compatible with the old system software. It will not help to point your application to the new libraries. They are more than likely to be incompatible with the new system and we won’t help you debug any problems if you try to do this. The correct solution is to reinstall your software. If the problem persists after the reinstallation, then please contact us and we will install standard libraries that are actually missing.
My app claims it’s out of disk space, but I see there is plenty of space, what gives?
By default, every user should have a
.sge_request file in their home directory. This file contains a line like this:
This limits the size of all created files to 10GB. If you plan on creating larger files you should increase this limit, either in the .sge_request file before you start your qrsh session, or in the batch script you submit via qsub. From the command line, you would start a qrsh session as follows:
qrsh -l h_fsize=300G
ssh gave a scary warning: REMOTE HOST IDENTIFICATION HAS CHANGED!
Go into the ~/.ssh directory of your laptop/desktop and edit the known_hosts file.
Search for the line that starts with the host that you ssh’d to. Delete that line (it is probably a long line that wraps). Then try again.
Why aren’t SGE commands, or R, or matlab, or… available to my cron job?
cron jobs are not launched from a login shell, but the
module commands and the JHPCE default environment is initialized automatically only when you log in. Consequently, in a cron job, you have to do the initialization yourself. Do this by wrapping your cron job in a bash script that initializes the module command and then loads the default sge modules. You bash shell script should start with the following lines:
#!/bin/bash # Source the global bashrc if [ -f /etc/bashrc ]; then . /etc/bashrc fi module load JHPCE_DEFAULT_ENV
This should allow your cron jobs to run within SGE.