GPUs on the JHPCE cluster

We have a number of GPU nodes on the JHPCE cluster that are available for general usage. Below is the process for accessing the GPU node, and

  1. First, login to the JHPCE cluster.
  2. From the login node, run “qpic -q gpu” to identify if any GPUs are available on the gpu queue.
$ qpic -q gpu
		gpu	shared	| Jobs   - Cores-   Load    |  Used - Tot RAM - mem_free
compute-117  :	2/3		|  2	-   48  -   2.43   |   32G -  376G   -   344G 
compute-123  :	0/1	2/24	|  2	-   40  -   4.21   |  108G -  754G   -   645G 
Totals:		2/4	2/24 |     3 /   88    3%       |  140G /1130 G   12% 989G
		gpu	shared	
[jhpce01 /users/mmill116]$ qrsh -q gpu

3. Connect to a GPU node by running “qrsh -l gpu”. Be sure to include sufficient RAM for your job. Typically 100GB will be required.
4. Identify which GPUS are available by running “nvidia-smi”. In the below example, GPUs 0 and 1 are in use, so GPU 2 is available:

[jhpce01 /users/mmill116]$ qrsh -l gpu -l mem_free=100G,h_vmem=100G
Last login: Mon Apr 25 16:30:39 2022 from jhpce01.cm.cluster
[compute-117 /users/mmill116]$ nvidia-smi 
Mon May 23 11:24:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   67C    P0   237W / 250W |   5853MiB / 32510MiB |     72%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:88:00.0 Off |                    0 |
| N/A   64C    P0   255W / 250W |   5853MiB / 32510MiB |     72%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN V      Off  | 00000000:B2:00.0 Off |                  N/A |
| 18%   36C    P0    35W / 250W |      0MiB / 12066MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    154362      C   ...da3/envs/V100/bin/python3     5849MiB |
|    1   N/A  N/A    154363      C   ...da3/envs/V100/bin/python3     5849MiB |
+-----------------------------------------------------------------------------+

5. To select a GPU, set the CUDA_VISIBLE_DEVICES environment variable to an available GPU. In the above example, you would run

[compute-117 /users/mmill116]$ export CUDA_VISIBLE_DEVICES=2

6. At this point you can start running your GPU specific code. As an example, to run the MNIST tensorflow example, you could use the following code. We have a conda environment set up with tensorflow and the required CUDA libraries.

[jhpce01 /users/mmill116]$ qrsh -l gpu -l mem_free=100G,h_vmem=100G
Last login: Mon May 23 11:43:53 2022 from jhpce01.cm.cluster
[compute-117 /users/mmill116]$ module load conda
[compute-117 /users/mmill116]$ source activate /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/tensorflow-gpu-2.2
(tensorflow-gpu-2.2) [compute-117 /users/mmill116]$ which python
/jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/tensorflow-gpu-2.2/bin/python
(tensorflow-gpu-2.2) [compute-117 /users/mmill116]$ export CUDA_VISIBLE_DEVICES=2
(tensorflow-gpu-2.2) [compute-117 /users/mmill116]$ python
Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> mnist = tf.keras.datasets.mnist
>>> (x_train, y_train),(x_test, y_test) = mnist.load_data()
>>> x_train, x_test = x_train / 255.0, x_test / 255.0
>>> model = tf.keras.models.Sequential([
...   tf.keras.layers.Flatten(input_shape=(28, 28)),
...   tf.keras.layers.Dense(512, activation=tf.nn.relu),
...   tf.keras.layers.Dropout(0.2),
...   tf.keras.layers.Dense(10, activation=tf.nn.softmax)
... ])
2022-05-23 11:57:15.100130: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-05-23 11:57:16.302334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:b2:00.0 name: NVIDIA TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2022-05-23 11:57:16.405242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-05-23 11:57:17.446421: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-05-23 11:57:17.995540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-05-23 11:57:18.546447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-05-23 11:57:19.260189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-05-23 11:57:19.656225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-05-23 11:57:20.837449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-05-23 11:57:20.852655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-05-23 11:57:20.853406: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2022-05-23 11:57:20.886903: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2100000000 Hz
2022-05-23 11:57:20.887830: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aaea148100 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-23 11:57:20.887871: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-05-23 11:57:21.073669: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aaea158db0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-05-23 11:57:21.073701: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA TITAN V, Compute Capability 7.0
2022-05-23 11:57:21.074813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:b2:00.0 name: NVIDIA TITAN V computeCapability: 7.0
coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 11.78GiB deviceMemoryBandwidth: 607.97GiB/s
2022-05-23 11:57:21.074860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-05-23 11:57:21.074874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-05-23 11:57:21.074887: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2022-05-23 11:57:21.074899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2022-05-23 11:57:21.074911: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2022-05-23 11:57:21.074923: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2022-05-23 11:57:21.074936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-05-23 11:57:21.084690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2022-05-23 11:57:21.084761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-05-23 11:57:21.086334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-23 11:57:21.086348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2022-05-23 11:57:21.086360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2022-05-23 11:57:21.088197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11054 MB memory) -> physical GPU (device: 0, name: NVIDIA TITAN V, pci bus id: 0000:b2:00.0, compute capability: 7.0)
>>> model.compile(optimizer='adam',
...               loss='sparse_categorical_crossentropy',
...               metrics=['accuracy'])
>>> model.fit(x_train, y_train, epochs=5)
Epoch 1/5
2022-05-23 11:58:15.454941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2217 - accuracy: 0.9338
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0975 - accuracy: 0.9700
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0684 - accuracy: 0.9784
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0535 - accuracy: 0.9831
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0450 - accuracy: 0.9857
<tensorflow.python.keras.callbacks.History object at 0x7f17c1ea0f10>
>>> model.evaluate(x_test, y_test)
313/313 [==============================] - 0s 1ms/step - loss: 0.0707 - accuracy: 0.9795
[0.07074514031410217, 0.9794999957084656]