Mahuika compute nodes draining or rebooting
Incident Report for NeSI Status
Resolved
The underlying filesystem issue has been resolved and the Mahuika cluster is once again stable
Posted Jan 21, 2021 - 09:43 NZDT
Monitoring
The underlying filesystem issue has been resolved. All compute nodes are now available. We will continue monitoring the cluster for stability.
Posted Jan 20, 2021 - 14:06 NZDT
Investigating
Yesterday afternoon the Mahuika cluster had issues with numerous compute nodes failing with job related errors. Some of these nodes were rebooted whilst user jobs were still running. Apologies for the inconvenience but we had no choice.
These symptoms have occurred again overnight and we will be taking similar actions to recover the compute nodes. Meanwhile we are working on identifying the underlying issue.
Posted Jan 20, 2021 - 09:06 NZDT
This incident affected: NeSI HPC Compute Infrastructure (HPC Compute nodes - Mahuika).