Disk storage space on the JHPCE cluster

There are several types of storage on the JHPCE cluster. Some space is for permanent storage of files, and other spaces can be used for short term storage of data.

For long term storage of files, most users make use of their 100GB of space in their home directory. All users have a unique home directory /users/USERNAME which, by default, will only be visible to them. There are ways to share data in your home directory with others, using Unix groups or Access Control Lists (more info at https://jhpce.jhu.edu/knowledge-base/granting-permissions-using-acls ) , but by default, only the owner of the home directory will be able to use their home directory space. Home directories do get backed up, whereas other storage spaces on the cluster may not.

For those groups needing more storage space than their 100GB home directory, we have large storage arrays (over 10,000 TB of space in total), and we will sell allocations on these large arrays. We build a new storage array about every 18 months, so if you are interested in purchasing an allocation on our next storage build, please email us at bitsupport@lists.jhu.edu. We have additional information, include current charges for storage at https://jhpce.jhu.edu/policies/current-storage-offerings/

In addition to these long-term storage offerings, there are a couple of “scratch” space areas for short-term data storage. Scratch space tends to be faster than the long-term project space mentioned above, so you may see a reduction in the run time needed for your programs by using scratch space. You will also avoid taking up precious space in your home directory or project storage space by using scratch space. Some commonly use cases for scratch space are:

  • Temporary or intermediary files. Oftentimes programs will generate intermediary files which are only used while the program is running, and then not needed after the program completes.
  • Data downloaded from an external source. If you download data from another institute or from a web site that you don’t need to keep, you should download this data to scratch space to avoid taking up space elsewhere.
  • Files that are read multiple times. If you are using a data file this is being read from multiple times by your program, you may see a speedup by first copying that file to scratch space, and then having your program read from the file in scratch.

There are a couple of spaces which are used for scratch storage. The best area used for scratch is the “fastscratch” array on the cluster. The fastscratch array provides 22TB of space that is built on faster SSD drives, vs traditional hard drives used for project space and home directories. All users have a 1TB quota for scratch space, and data older than 30 days is purged from scratch space. More details on using fastscratch can be found at https://jhpce.jhu.edu/knowledge-base/fastscratch-space-on-jhpce/

The second type of scratch space is the SGE scratch space. The SGE scheduler will create a unique $TMPDIR directory under /scratch/temp for every job/task that is run on the cluster. This directory is created when the job starts on a compute node, and is removed when the job completes. The $TMPDIR space will be created on the local compute node’s internal disk drives, so the amount of scratch scratch space will vary from node to node, but will be between 100GB and 4TB. The SGE scratch space will typically be faster than other storage space, so this space is useful for small intermediary files which do not need to be stored long term, or for jobs which will be performing multiple reads/writes of the same files.

Traditionally in Unix/Linux, the /tmp (or /var/tmp) directories are used for storing temporary files. On the JHPCE cluster, the use of /tmp is strongly discouraged. The /tmp directory is smaller on the compute nodes than the /scratch directories, and can get filled up easily. If your application uses /tmp for temporary files, please use an option for your application which makes use of fastscratch (preferably) or the SGE scratch space.

When running R, the tmpdir() setting will dictate where temporary files are stored. When using qsub to submit a job, the tmpdir() is set to the SGE scratch directory. This should be fine for most cases, but if you might be generating 10s of GB of temporary files, you may want to use “fastscratch”. When you qrsh into a compute node and then run R, tempdir() is set to /tmp, so you will likely want to change the R temp dir to use “fastscratch”.

In SAS, the default WORK directory will be located under your “fastscratch” directory.

In Stata, the default “tempfile” location is under /tmp. This can be changed by setting the STATATMP environment variable.