FAQs

Please note that we are in the process of updating our web site to reference the recent upgrade to SLURM.  Some articles may still contain information about the old SGE environment.

You may want to use the “Search” box over there —->

Why does bash report that it can’t find the module command?

The error you see is:

bash: module: command not found"

The module is a shell function that is declared in /etc/bashrc.
It is always a good idea for /etc/bashrc to be sourced immediately in you ~/.bashrc

Edit your .bashrc file so that the first thing it does is o execute the system
bashrc file, i.e. your .bashrc file should start with the following lines:

# Source the global bashrc
 if [ -f /etc/bashrc ]; then
 . /etc/bashrc
 fi

Why does my R program use more RAM when run via “srun” than it does when run within a “sbatch” session.

Some R programs, such a “randomForestSRC” or programs that use “libgomp”, rely on OpenMP, which may try to use multiple cores on the node it is running on, and will subsequently require more RAM to run.

To restrict your program to a single core, set mc.cores and rf.cores to “1”.

options(rf.cores=1, mc.cores=1)

Thanks to Kasper Hansen and Jacob Fiksel for their work on identifying the issue and finding a fix for it.

I’m trying to use “qrsh”, but I get a messages that says “Your “qrsh” request could not be scheduled, try again later.”

UPDATE – 2024-02-01 – The JHPCE Cluster has been recently migrated from SGE to SLURM. 

The SLURM version of this issue is where one runs “srun” and receives a message like:

srun: job 1083790 queued and waiting for resources

This typically means that the JHPCE cluster is currently very busy.  You can gauge how busy the cluster is by noting the banner during your ssh login.

The SLURM cluster is currently at 47% core occupancy and 48% RAM occupancy

You can also run the “slurmpic” script and note the availability of cores and RAM on the various nodes.  If you are asking for a large amount of RAM (more than say 200GB) or many cores, you can note the availability of resources across the compute nodes.

The information below is no longer relevant in JHPCE, but is being retained for historical purposes. 

The “Your “qrsh” request could not be scheduled, try again later.” message can happen when you have either requested resources that are not available or when the cluster is busy. Some possible causes and suggested steps are:

– Add the “-now n” option to your qrsh request. This will cause your qrsh to wait until resources become available on a compute node rather than timing out after 5 seconds.
– Make sure you are not making resource requests that cannot be met. For example, our largest compute nodes have 512GB of RAM, and those nodes are typically heavily used. So, if you are requesting more than 300GB of RAM, your request will likely never be satisfiable.
– Similarly, keep in mind that RAM is a per-core resource, so if you are requesting multiple cores along with RAM, you will need to divide your total RAM request by the number of cores requested. For example, if your job needs 160GB of RAM and 8 cores, you would need to add “-pe local 8 -l mem_free=20G,h_vmem=20G” to your qrsh request. If you were to mistakenly use “-pe local 8 -l mem_free=160G,h_vmem=160G”, you would in effect be requesting 1TB of RAM so your request would never be satisfied.

How do I run array jobs on the JHPCE Cluster?

Array jobs allow multiple instances of a program to be run via a single sbatch command.  This can often be more convenient than running numerous repetitive sbatch-es of the same program. The different instances of the job that get run are known as “tasks”.  These task values are numeric, and are specified by using the "-a START-END" option to sbatch. The specific task is referenced within the sbatch script via the $SLURM_ARRAY_TASK_ID environment variable.

As an example, suppose you have 3 data files you want to run your program against:

$ ls data*
data1    data2    data3

In this simple example, the SLURM script simply "cat"s each file.

$ more script1.sh
#!/bin/bash

FILENAME="data$SLURM_ARRAY_TASK_ID"
cat $FILENAME

exit 0

When the job is submitted, the "-a" option is used to specify the range of tasks to be run, so in our example, the command to submit 3 tasks, numbered 1, 2, and 3 would be "sbatch -a 1-3 script1.sh". Within the script, the $SLURM_ARRAY_TASK_ID variable will be assigned to 1, 2, and 3 for the 3 instances of the script that gets run.

$ ls
data1  data2  data3  script1.sh
$ sbatch -a 1-3 script1.sh
Submitted batch job 1911045
$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         1911045_1    shared script1. mmill116  R       0:03      1 compute-105
         1911045_2    shared script1. mmill116  R       0:03      1 compute-153
         1911045_3    shared script1. mmill116  R       0:03      1 compute-153
$ ls
data1  data2  data3  script1.sh  slurm-1911045_1.out  slurm-1911045_2.out  slurm-1911045_3.out

The result of running this sbatch would be 3 output files, where each output file has the task ID appended to it.

Now consider a more complicated scenario where the file names are not neatly numbered. One way to handle this situation is to create a file that contains a list of the files, and then use the $SLURM_ARRAY_TASK_ID number to refer to the line number of the entry in that file to get to the file name. For this example, let’s say we have 3 files:

$ ls
first   second   third      

We could create a file list using the “ls” command…

$ ls > file-list
$ cat file-list
first
second
third

We can now create and SLURM script that uses the awk command to pull out the line number from file-list based on the value of $SLURM_ARRAY_TASK_ID (there are of course numerous other options to use in Unix instead of awk).

$ cat script2.sh
#!/bin/bash
#SBATCH -a 1-3

FILENAME=`awk "NR==$SLURM_ARRAY_TASK_ID {print $1}" file-list` cat $FILENAME exit 0 $ sbatch script2.sh

By submitting this array job, 3 instances of the script2.sh script would get run, where each instance would access the filename from the file-list file, where the line number in file-list matches the value of $SLURM_ARRAY_TASK_ID. As in our previous example, 3 output files would get created by the 3 tasks, and each output file would contain the contents of the respective input file.

Why does running “firefox” crash from a compute node.

UPDATE – 2024-02-01 –The JHPCE Cluster has been recently migrated from SGE to SLURM.  We do though recommend that users use the “chromium-browser” rather than Firefox on the new JHPCE cluster.

This page is no longer relevant, but is being retained for historical purposes.  

It appears that firefox is allocating RAM based on the stack size setting for your qrsh session. To run firefox, you can either:

  • Requesting 40GB of RAM (qrsh -l mem_free=40G,h_vmem=40G)
  • Reducing stack size from the default 128M to 8M (qrsh -l h_stack=8M)
  • Using qrsh as normal, then reducing stack size by running “ulimit” (qrsh ; ulimit -s 8192 -S 8192)

I’m getting X11 errors when using rstudio with Putty and Xming

We’ve had issues reported by users of Putty with Xming where the rstudio program fails to launch with error such as:

$ rstudio
qt.qpa.xcb: X server does not support XInput 2
failed to get the current screen resources
libGL error: unable to load driver: swrast_dri.so
libGL error: failed to load driver: swrast
The X11 connection broke: I/O error (code 1)
XIO: fatal IO error 2 (No such file or directory) on X server "localhost:10.0"
after 391 requests (382 known processed) with 0 events remaining.

One solution we’ve found is to use the vcxsrv (from https://sourceforge.net/projects/vcxsrv/) instead of the older Xming

How can I add packages into emacs on the cluster?

From Brian Caffo (our resident emacs expert):

Here’s the steps if someone wants a package installed on emacs (for themselves)

srun --pty --x11 bash
module load emacs (or whatever)

Add these to .emacs (increases the packages that can get accessed)

(require 'package)
(add-to-list 'package-archives '("melpa" . "http://melpa.org/packages/"))

Then restart emacs. Then do

M-x package-list-packages

then find the package that you want, hit enter, then install. It
installs it locally.

How do I delete saved passwords in MobaXterm

When using MobaXterm you should not save your password when prompted to do so. MobaXterm will save your password, and then inadvisedly try to use that as your “Verification Code:”, which means that when you first connect to the cluster in MobaXterm, you are prompted for “Password:”, and which you will need to press to be prompted for “Verification Code:” If you accidentally saved your password, you can remove the saved password by the following steps.

1) In MobaXterm, go to “Settings->Configuration”

2) On the next screen select “MobaXterm Password Management”

3) This will display a list of saved passwords, and you should delete all of the entries that reference “jhpce”.

Once these entries are deleted, you should be prompted for “Verification Code:” when you connect to the cluster via MobaXterm.

My script is giving odd error messages about “\r” or “^M”. What is wrong?

Windows and Unix use different characters to indicate a “newline” or “end of line”.  If you have uploaded your script from a Windows machine, it may have the Windows “newline” characters.  These need to be replaced by the Unix “newline”characters.  To do this, you can run the “dos2unix” command on your script:

dos2unix myscript.sh

This will strip out all of the Windows newlines and replace them with the Unix newlines.

https://www.computerhope.com/unix/dos2unix.htm

Xauth error messages from MacOS Sierra when using X11 forwarding in SSH

With the recent upgrade to MacOS Sierra, the “-X” option to ssh to enable X11 forwarding may not work.  If  you receive the message:

"untrusted X11 forwarding setup failed: xauth key data not generated"

after a recent MacOS upgrade, and X11 forwarding does not work, you can resolve the issue by add the line “ForwardX11Trusted yes” to your ~/.ssh/config file on your Mac.  This should allow X11 forwarding to work from the JHPCE cluster.  You may still see the warning:

"Warning: No xauth data; using fake authentication data for X11 forwarding"

To eliminate this warning, add the line “XAuthLocation /usr/X11/bin/xauth” to your ~/.ssh/config file on your Mac.

When running SAS, an error dialog pops up about Remote Browser

UPDATE – 2024-02-01 –The JHPCE Cluster has been recently migrated from SGE to SLURM.  The recommended way to run SAS so that a browser can be used to display help pages and graphics is:

$ srun --pty --x11  bash 
$ module load sas
$ sas -helpbrowser SAS -xrm "SAS.webBrowser:'/usr/bin/chromium-browser'" -xrm "SAS.helpBrowser:'/usr/bin/chromium-browser'"

This information below is no longer relevant, but is being retained for historical purposes.  

 

In SAS, results are sometimes presented in HTML format, and need a web browser running in order to view the results. When SAS is used on a desktop, the local desktop browser is used, however when SAS is run on a remote system, such as on the JHPCE cluster, SAS doesn’t know how to connect to the browser, and needs to use the SAS “Remote Browser” utility to locate the web browser.

To use the “Remote Browser” when using SAS on the JHPCE cluster, we have found that the following works:

1) Login to the cluster as you normally would:
2) On the jhpce01 login node, start up the Firefox browser in the background:

$ firefox &

3) Still on jhpce01,

$ rbrowser &

4) Now connect to the SAS queue, and start up the SAS program.

 $ qrsh -l sas
 $ sas

This should allow you to see HTML formatted data from in the browser running on jhpce01. One last note, you will be prompted to allow pop-ups from the SAS node when you first try and display something back to the Firefox browser on jhpce01 – you should allow the pop-ups.

How do I get the Rstudio program to work on the cluster?

The RStudio program (https://www.rstudio.com/products/rstudio/) is a graphical development interface to the R statistical package.  Some people find the graphical RStudio program helpful in organizing R projects, and writing and debugging R programs.

To run rstudio on the JHPCE cluster, you can use the following steps:

  • First, make sure you have an X11 server installed on your laptop/desktop (either Xquartz for MacOS, or MobaXterm for PCs).
  • Next, ssh into the JHPCE cluster, making sure the the X11 forwarding option is used for SSH.  To enable X11 forwarding from MacOs, add the “-X” option to your ssh command.  For MobaXterm on Windows, X11 forwarding is enabled by default.
  • Once you are on the cluster, use “srun –pty –x11 bash” to connect to a compute node
  • Load the “rstudio” module by running “module load rstudio
  • Start the rstudio program by running “rstudio
  • Within a couple of seconds, the RStudio interface should be displayed.

An example session for user “bob” would look something like:

BobsMac$ ssh -X bob@jhpce01.jhsph.edu
Last login: Mon Dec 26 08:24:01 2016 from 10.11.12.13
---
Use of this system constitutes agreement to adhere to all applicable 
JHU and JHSPH network and computer use policies.
---
[jhpce01 /users/bob ]$ srun --pty --x11 bash
[compute-072 /users/bob]$ module load rstudio
[compute-072 /users/bob]$ rstudio

Screen Shot 2016-12-26 at 8.59.42 AM

Please be aware that rstudio is a very graphics-heavy program and uses a fair amount of network bandwidth.  You will need to make sure that you have a fairly fast network connection in order to use rstudio effectively.

Why aren’t SLURM commands, or R, or matlab, or… available to my cron job?

cron jobs are not launched from a login shell, but the module commands and the JHPCE default environment is initialized automatically only when you log in. Consequently, in a cron job, you have to do the initialization yourself. Do this by wrapping your cron job in a bash script that initializes the module command and then loads the default sge modules. You bash shell script should start with the following lines:

#!/bin/bash

# Source the global bashrc
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
module load JHPCE_ROCKY9_DEFAULT_ENV

This should allow your cron jobs to run within SLURM.

ssh gave a scary warning: REMOTE HOST IDENTIFICATION HAS CHANGED!

Go into the ~/.ssh directory of your laptop/desktop and edit the known_hosts file.
Search for the line that starts with the host that you ssh’d to. Delete that line (it is probably a long line that wraps). Then try again.

My app claims it’s out of disk space, but I see there is plenty of space, what gives?

UPDATE – 2024-02-01 – The JHPCE Cluster has been recently migrated from SGE to SLURM. SLURM does not have a file size limit, to this page is no longer relevant, but is being retained for historical purposes.  If you do get an “out of space” message, you can use the unix “df” command to look at the disk usage of your current filesystem (df -h .)

By default, every user should have a .sge_request file in their home directory.  This file contains a line like this:

-l h_fsize=10G

This limits the size of all created files to 10GB.  If you plan on creating larger files you should increase this limit, either in the .sge_request file before you start your qrsh session, or in the batch script you submit via qsub. From the command line,  you would start a qrsh session as follows:

qrsh  -l h_fsize=300G

My app is complaining that it can’t find a shared library, e.g. libgfortran.so.1 could you please install it?

We would guess that 9 times out of 10, the allegedly missing library is there. The problem is that your application is looking for the version of the library that is compatible with the old system software. It will not help to point your application to the new libraries. They are more than likely to be incompatible with the new system and we won’t help you debug any problems  if you try to do this. The correct solution is to reinstall your software. If the problem persists after the reinstallation, then please contact us and we will install standard libraries that are actually missing.

How do I copy a large directory structure from one place to another.

As an example, to copy a directory tree from /home/bst/bob/src to /dcs01/bob/dst, first, create a cluster script, let’s call it “copy-job“, that contains the line:

rsync -avzh /home/bst/bob/src/ /dcs01/bob/dst/

Next, submit a batch job to the cluster

sbatch --mail-type=FAIL,END --mail-user=bob@jhu.edu copy-job

This will submit the “copy-job” script to the cluster, which will run the job on one of the compute nodes, and send an email when it finishes.

My X11 forwarding stops working after 20 minutes.

X11 forwarding can be enable in your ssh session using the -X options to the ssh command:

$ ssh -X username@jhpce01.jhsph.edu

This will allow you to run X based programs  from the JHPCE cluster back top the X server running on your desktop (such as XQuartz on Mac computers).  On some Mac computers X11 forwarding will work for a while but may eventually time out, with the error message:

Xt error: Can't open display: localhost:15.0

This error comes from the “ForwardX11Timeout” variable, which is set by default to 20 minutes.  To avoid this issue, a larger timeout  can be supplied on the command line to, say, 336 hours (2 weeks):

$ ssh -X username@jhpce01.jhsph.edu -o ForwardX11Timeout=336h

or it can be changed in the /etc/.ssh_config file on your desktop by adding the line:

ForwardX11Timeout 336h

to the end of the /etc/.ssh_config file, or your own ~/.ssh/config file.  Note: a value higher than 596h may cause the X window server on your desktop to fail, as it is greater than 2^31 milliseconds and will exceed the signed 32bit  size of the “ForwardX11Timeout” variable.

Why was RAM made a consumable resource on the cluster?

UPDATE – 2024-02-01 – The JHPCE Cluster has been recently migrated from SGE to SLURM. RAM is still considered a consumable resource in SLURM, and you can read the historical rationale for that below.

The way that SGE on the JHPCE cluster had been  configured, was to NOT view RAM as a “consumable” resource. The recent change to the SGE configuration changed this so that RAM is now “consumable” and gets reserved for a job when it is requested.  What does this mean?

As an example, for simplicity sake, let’s say you have a cluster with just 1 node, and that node has 20GB of RAM.  If you run a job that requests 8GB of RAM ( mem_free=8GB,-h_vmem=9GB) it will start to run on the node immediately. Now, this job takes a few minutes for all 8GB to actually be used by the program – let’s say it consumes 2GB/minute, so after 4 minutes all 8GB will be in use.  A minute later, the running job is using 2GB of RAM, and now a second job comes along and requests 8GB of RAM.  SGE will see that there is still 18GB of RAM on the node and start the second job.  Now, a minute later, a third job comes along, also requesting 8GB.  The first job is using 4GB, the second job is using 2GB, the node has 14GB free, so SGE, seeing that 8GB is available, starts the third job.  So now you have 3 jobs running that will eventually need 24GB of RAM in total, and there is only 20 GB on the system,  so at some point the node becomes RAM starved and the Linux oom-killer gets invoked to kill a process.  (For extra credit – at what time does the node run out of RAM? 🙂 )

The change made to the cluster alters the behavior of SGE so that RAM is “consumable”, so that when you request 8GB, SGE marks that 8GB as reserved.  In the above example, the first 2 jobs would have run, and SGE would have marked 16GB of RAM as “consumed”, so the third job would not have run until one of the other jobs finished.  The biggest downside to this approach though is that if people request much more RAM than what their job need, then jobs will have to wait longer to run, and resources may go unused.  If, in the above example, the first job requested 15GB of RAM “to be safe”, then that would have prevented the second job from starting until the first completed, even though the 2 jobs could have run concurrently.

Why do I get memory errors when running Java?

UPDATE – 2024-02-01 – The JHPCE Cluster has been recently migrated from SGE to SLURM, and this does not appear to be an issues

[user@login31 ~]$ srun --pty --x11 bash
[user@compute-127 ~]$ java -version
openjdk version "1.8.0_372"
OpenJDK Runtime Environment (build 1.8.0_372-b07)
OpenJDK 64-Bit Server VM (build 25.372-b07, mixed mode)
[user@compute-127 ~]$ module avail java
----------------------- /jhpce/shared/jhpce/modulefiles ------------------------
   java/19 (D)
------------------------ /jhpce/shared/libd/modulefiles ------------------------
   java/17    java/18
  Where:
   D:  Default Module
[user@compute-127 ~]$ module load java/19
[user@compute-127 ~]$ java -version
java version "19.0.1" 2022-10-18
Java(TM) SE Runtime Environment (build 19.0.1+10-21)
Java HotSpot(TM) 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)

The information below is no longer relevant in JHPCE, but is being retained for historical purposes. 

You may see errors such as the following when you try to run Java:

$ java
Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

This is due to the default Maximum Heap Size in java being 32GB and the JHPCE cluster defaults for mem_free, your h_vmem being set to 2GB and 3GB respectively.  The default settings for qrsh would be too small to  accommodate the memory required by Java.  You have 3 options to get this to work.

1) If you think you will really need 32GB of memory for your java program, you can increase your mem_free and h_vmem settings of your qrsh command:

jhpce01: qrsh -l mem_free=40G,h_vmem=40G
compute-085: java

2) More likely, you do not need 32GB for your Java program, so you can direct Java to use less memory by using the “-Xmx” and “-Xms” options to Java.  For instance, if you want to set the initial heap size to 1GB and the maximum heap size to 2GB you could use:

jhpce01: qrsh 
compute-085: java -Xms1g -Xmx2g

3) An alternative way to set the Java memory settings is to use the “_JAVA_OPTIONS” environment variable.  This is useful if the call to run java is embedded within a  script that cannot be altered.  For instance, if you want to set the initial heap size to 1GB and the maximum heap size to 2GB you could use:

jhpce01: qrsh
compute-085: export _JAVA_OPTIONS="-Xms1g -Xmx2g" 
compute-085: java

I’m on a Mac, and the ~C command to interrupt an ssh session isn’t working. It used to, but I upgraded MacOS and now it does not work.

Some versions of MacOS have disable by default the ability to send an SSH Escape with “~C”.  To reenable this, on you Mac, you need to set the “EnableEscapeCommandline” option.  You can do this by either running “ssh -o EnableEscapeCommandline=yes . . .” or by editing your ~/.ssh/config file, and at the top of that file add the line:

EnableEscapeCommandline=yes

This should now let you use the “~C” as an interrupt to ssh.