HPC Cluster Migration Dec 26

<aside> 💡 Suppose your SSH public keys do not work anymore, or you cannot login to wandb or your password expires; here is the whole procedure to recover access should you need it:

Having readily your HPC username and your registered email address, please request a password reset on Jira and follow the instructions sent by email: https://stabilityai.atlassian.net/servicedesk/customer/portal/1/group/1/create/96
Doing the above will also restore access to the wandb enterprise server
Enrol in the self-service portal, then edit your profile at https://hpc.stability.ai:8888, where you can upload your public SSH key again.
Change the temp password on the same self-service portal.
Ssh into your login nodes. </aside>

<aside> 💡 The ext1, ext2 and extcpu login nodes are up. We ran chown commands to update the uid/gid for all home files.

</aside>

<aside> 💡 You will need to update all scripts from using /fsx to using /weka

</aside>

We want to inform you that we have scheduled maintenance for our HPC cluster on December 26 beginning at 8am CET. The maintenance is expected to last up to 8 hours, with service to resume before 4pm CET. The cluster will be inaccessible during this time. This maintenance is essential to ensure our systems' continued performance, reliability, and security. We apologise for any inconvenience this may cause and appreciate your understanding. Please mark your calendar and plan accordingly for this maintenance window. If you have any questions or need assistance, don't hesitate to contact our support team at our Support Page. This cluster maintenance will introduce changes that require review by your team to allow for smoothly resuming your workflows:

New login nodes designation (grantwest -> ext1, welcomewest -> ext2, grantwestcpu -> extcpu)
Partition name change (g40x -> a40x)
CPU count changes, per GPU, from 12 to 10. When using several tasks per node, as in --tasks-per-node or similar, you must be careful with the CPUs and try to distribute the available ones among the tasks. Add one more option, as in --cpus-per-task with the correct number. For instance, if you want eight tasks per node, add 10 cups for each task as a parameter. This should help avoid CPU allocation issues. The cluster will allocate the correct CPUs per GPU by default when these options are ignored.
Authentication will be done via a new directory, and uid/gid will change. While we will keep a backup of the original data in FSx, we highly recommend every user to backup their FSx data (and curate it simultaneously) to S3 for easy restoration on the new cluster. The new cluster will employ a larger and more performant file system in the /fsx mount point (the converged Weka file system)
/scratch space will shrink at about half, the other half being consumed by wekafs
the /fsx mount will be replaced with the /weka mount. This will feature about 160TB of SSD space, upended by 600TB of S3 storage. The filesystem will keep the most recent files on SSD, so you should not consider it an infinite fast storage. Nobody should keep massive datasets on this file system (please do not use it for more than 5TB in size. Revert to pure S3 solution instead)

<aside> 💡 To move the data from the old login node exposing fsx to the new login node exposing weka, a rclone task should also be an excellent solution to employ

</aside>
S5F (Stable S3 Security Framework) will be deployed. Access to S3 buckets will not be possible anymore from the login nodes, and an active Slurm job is necessary to get S3 access. Project leaders must request explicit access to S3 buckets for their teams by specifying the project name, s3 bucket name, and access level (list, read, write). This holds even if Stability does not own the bucket.
Stablessh will also be deployed. This only allows SSH access into the Slurm job space, and random SSH access to all nodes will be retired.

The old login nodes and FSx volumes will be alive until January 10th, but we encourage you to start data backup to S3 already. The new cluster will also have blank home folders, so your environments must be recreated. You might find it helpful to backup the home folders as well.

Thank you for your cooperation,

Stability AI

PS. Please read the updated HPC Cluster User Guide at https://stabilityai.notion.site/Stability-HPC-Cluster-User-Guide-226c46436df94d24b682239472e36843?pvs=4