Maui XC50 Slurm controller failover
Scheduled Maintenance Report for NeSI Status
Completed
This maintenance did not result in the outcome we had hoped. We will complete this maintenance in favour of a complekte outage/reboot.
Posted Dec 19, 2023 - 09:55 NZDT
Verifying
Verification is currently underway for the maintenance items.
Posted Dec 18, 2023 - 19:52 NZDT
Update
Due to continuing issues we are going to failover to our primary controller. There may be some job submission and Slurm command timeouts during the failover process.
Posted Dec 18, 2023 - 10:41 NZDT
In progress
We continue to see issues with the controllers on Maui XC50 and continue to investigate and escalate to vendors.
Posted Dec 14, 2023 - 15:19 NZDT
Update
We are continuing to verify the maintenance items.
Posted Dec 14, 2023 - 15:17 NZDT
Verifying
We have failed over to primary controller and we are verifying the stability of the system.
Posted Dec 14, 2023 - 13:08 NZDT
In progress
Due to continuing issues we are going to failover to our primary controller at 13:00 NZT. There may be some job submission and Slurm command timeouts during the failover process.
Posted Dec 14, 2023 - 12:38 NZDT
Update
The failover is complete, however, intermittent issues remain.
Posted Dec 13, 2023 - 19:30 NZDT
Verifying
Verification is currently underway for the maintenance items.
Posted Dec 13, 2023 - 18:12 NZDT
Update
Failover of slurm controller did not happen cleanly. We have extended maintenance until 18:15
Posted Dec 13, 2023 - 17:47 NZDT
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Dec 13, 2023 - 16:45 NZDT
Update
We will be undergoing scheduled maintenance during this time.
Posted Dec 13, 2023 - 16:43 NZDT
Scheduled
We have located an issue on the primary Maui XC50 slurm controller. To mitigate we need add a new controller to the configuration and fail over to that node. Because we need to restart all slurmd daemons of the nodes, this will cause disruption to the jobs currently running.
Posted Dec 13, 2023 - 16:41 NZDT
This scheduled maintenance affected: Submit new HPC Jobs and NeSI HPC Compute Infrastructure (HPC Compute nodes - Māui).