Mahuika Slurm Controller - not functioning

Incident Report for NeSI Status

Monitoring

Good news, everyone! The Slurm controller services are back online as of yesterday, and we're keeping a close eye on things. Unfortunately, a few hundred jobs decided to take an unexpected holiday break due to the outage. 😅

In a classic case of split-brain, the controllers couldn't agree on what was happening, so the status of some failed jobs might be a bit... unreliable. We recommend you double-check those failed jobs to see if they need restarting, some of them may have completed. Apologies for the hiccup, and may your holidays be as glitch-free as possible! 🎄✨
Posted Dec 24, 2024 - 12:57 NZDT

Identified

We've identified an error with Slurm controller that was preventing it from starting and are now working to get it back online. Root cause is so far unknown but may be down to a filesystem issue. It is likely that many jobs have been killed or finished in a failure state. We will provide an update on follow up actions for users once the service is stable.
Posted Dec 23, 2024 - 10:34 NZDT

Investigating

The Mahuika Slurm Controller is not functioning. This affects the ability to submit new jobs and to launch Jupyter sessions.
Our apologies, due to the time of year we are unable to give a time when this will be fixed due to reduced levels of staffing over the holiday period.
Posted Dec 22, 2024 - 10:06 NZDT
This incident affects: Submit new HPC Jobs, Jupyter on NeSI (beta) and NeSI HPC Compute Infrastructure (HPC Compute nodes - Mahuika, Mahuika Extension nodes - Mahuika).