Maui interruption
Incident Report for NeSI Status
Postmortem

From Tony Racho:

An issue with the blade where it resided, it needed to be reset as one of the dvs nodes (dvs10) was being rebooted it affected the controller node and brought it down with it. warmswapping it out and xtbouncing the blade usually resets the blade and resolves the issue. The controller node is on a blade together with 3 dvs nodes.  (edited)

So the whole blade was bounced and warmswapped in and after that everything went fine. All 3 dvs nodes and the controller nodes are all back online and operational after that.

Posted Jun 30, 2022 - 15:46 NZST

Resolved
This incident has been resolved.
Posted Jun 30, 2022 - 15:38 NZST
Update
We are continuing to monitor for any further issues.
Posted Jun 30, 2022 - 15:16 NZST
Update
We are continuing to monitor for any further issues.
Posted Jun 30, 2022 - 14:25 NZST
Update
We are continuing to monitor for any further issues.
Posted Jun 30, 2022 - 14:12 NZST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 30, 2022 - 14:09 NZST
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 30, 2022 - 14:08 NZST
Investigating
Maui primary Slurm controller crashed. The majority of the running jobs are affected, and the team is now investigating the cause and the extent of the damage. We will notify users as soon as Maui is back online. We apologise for the inconvenience.
Posted Jun 30, 2022 - 13:55 NZST
This incident affected: NeSI HPC Compute Infrastructure (HPC Compute nodes - Māui) and Jobs running on HPC.