Stability HPC Cluster User Guide

<aside> 💡 News for December 2023:

all clusters were updated to use several new features oriented to provide better security
please read Appendix 5 and Appendix 6 related to these changes

</aside>

<aside> 💡 New users, please start here: Please be aware that jobs launched at the same priority level are served on a first come first served basis. You can block yourself if you submit too many identical jobs at the same priority and the first one does not have enough resources to run. Always check sinfo and squeue then cancel stalled job requests before launching new ones.

</aside>

<aside> 💡 GPU and CPU clusters differ as the nodes in the GPU cluster are static (meaning the already run and starting a job takes seconds) whereas CPU cluster nodes are dynamic, meaning the job needs to create them on-demand from AWS capacity, then initialise them, a process that can take up to 10 minutes. During this time the job will appear in the CF state (CF=configuring) ands5 there will be no prompt available if you use an interactive session. Once the node is ready the prompt will appear.

</aside>

Changelog

Migration has finished before the end of 2023

Migration has ended. Please check the ‣ and the supplement HPC Cluster Migration Dec 26

Data migration guide

<aside> 💡 Warning: data sent to S3 should be sent to a bucket in us-west-2, because the GPU nodes can only access buckets from that region. Here is how to check a bucket location: aws s3api get-bucket-location --bucket my-bucket

LocationConstraint would be undefined (or null) for us-east-1

</aside>

aws s3api get-bucket-location --bucket datasets-west
{
    "LocationConstraint": "us-west-2"
}

Below are some suggestions to cope with the limited shared storage space on the west:

wandb logs and nccl logs, should you accumulate, need to be cleaned up automatically. Cron jobs are doing OK here.
Use the home folder to store environments and cache; leave project folders only for project scripts.

<aside> 💡 Please note that CPU clusters do not have persistent shared storage volumes, except for those that share the same volumes with the corresponding GPU clusters. That means they are not reliable to hold data especially when in maintenance. Currently ingresseast, ingress clusters have small scratch like volumes as they are intended to work with data from and to S3 directly.

intcpu and int* clusters share persistent shared storage volumes (coming from the GPU cluster). Likewise, ext1 and extcpu clusters share the external shared storage volumes.

A scratch storage could, in theory, crash when a disk breaks and data would be lost. Opposed to that the GPU clusters shared storage volumes have fault tolerance built in and they will heal in the event of a disk failure.

All data must be saved to S3 regularly, this is default backup strategy!

</aside>

Introduction & Overview

<aside> 💡 configuring your cluster user section is described right after the overview paragraphs

</aside>

In addition to this document, which contains more detailed information, please see the following resources.

Stability Quickstart Guide
- Quickly get up and running with distributed jobs on the HPC cluster.