Identified - We have a data imbalance issue within the distributed storage that underpins the FlexiHPC object storage service and "ceph-hdd" type volumes. This has caused the object store to report intermittent errors to clients over the weekend when attempting to write new data. Cloud instances using the ceph-hdd volume type may have experienced intermittent IO errors and/or high latency. We expect most FlexiHPC users to be unaffected as ceph-ssd is the default and is not impacted.

Due to the amounts of data involved, resolving this issue and rebalancing the cluster may take some time. Please stay tuned for updates. If your deployment has been impacted by these issues and you need assistance recovering then please reach out to support.

Feb 17, 2025 - 10:20 NZDT
Monitoring - Good news, everyone! The Slurm controller services are back online as of yesterday, and we're keeping a close eye on things. Unfortunately, a few hundred jobs decided to take an unexpected holiday break due to the outage. 😅

In a classic case of split-brain, the controllers couldn't agree on what was happening, so the status of some failed jobs might be a bit... unreliable. We recommend you double-check those failed jobs to see if they need restarting, some of them may have completed. Apologies for the hiccup, and may your holidays be as glitch-free as possible! 🎄✨

Dec 24, 2024 - 12:57 NZDT
Identified - We've identified an error with Slurm controller that was preventing it from starting and are now working to get it back online. Root cause is so far unknown but may be down to a filesystem issue. It is likely that many jobs have been killed or finished in a failure state. We will provide an update on follow up actions for users once the service is stable.
Dec 23, 2024 - 10:34 NZDT
Investigating - The Mahuika Slurm Controller is not functioning. This affects the ability to submit new jobs and to launch Jupyter sessions.
Our apologies, due to the time of year we are unable to give a time when this will be fixed due to reduced levels of staffing over the holiday period.

Dec 22, 2024 - 10:06 NZDT
Monitoring - A fix has been implemented and we are monitoring the results.
Dec 10, 2024 - 08:37 NZDT
Investigating - We are experiencing some issues with the backend storage supporting the Maui_ancil Slurm controllers. To fix this we will be shutting down the Maui_ancil controllers for a short period, immediately. New job submissions will not be possible whilst this occurs, but existing jobs should be unaffected.
Dec 09, 2024 - 16:09 NZDT

About This Site

New Zealand eScience Infrastructure High Performance Compute and Storage Service Status

Apply for Access ? Operational
Data Transfer Operational
Submit new HPC Jobs Operational
Jobs running on HPC Operational
Jupyter on NeSI (beta) ? Operational
NeSI OnDemand ? Operational
90 days ago
99.61 % uptime
Today
HPC Storage Operational
Long-term Storage (Early Access) ? Operational
User Support System ? Operational
Flexible High Performance Cloud ? Degraded Performance
NeSI HPC Compute Infrastructure ? Operational
HPC Lander node ? Operational
HPC Login nodes - Māui ? Operational
HPC Login nodes - Mahuika ? Operational
HPC Compute nodes - Māui ? Operational
HPC Compute nodes - Mahuika ? Operational
Mahuika Extension nodes - Mahuika ? Operational
Māui Ancillary nodes ? Operational
Mahuika Ancillary nodes ? Operational
NeSI Storage Infrastructure Operational
HPC Shared Storage system ? Operational
Online storage ? Operational
Nearline storage Operational
Scratch storage ? Operational
NeSI Data Transfer Infrastructure ? Operational
NeSI HPC Facility (Greta Point, Wellington) DTN ? Operational
Flexible High Performance Cloud Services ? Degraded Performance
90 days ago
99.98 % uptime
Today
Virtual Compute Service Operational
Bare Metal Compute Service Operational
FlexiHPC Dashboard (web interface) ? Operational
90 days ago
99.98 % uptime
Today
FlexiHPC CLI interface ? Operational
90 days ago
100.0 % uptime
Today
Public API of the FlexiHPC Service ? Degraded Performance
90 days ago
99.98 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
my.nesi.org.nz system update Feb 18, 2025 15:30-16:30 NZDT
We will be undergoing scheduled maintenance during this time to update the system.
Posted on Feb 14, 2025 - 15:05 NZDT
Past Incidents
Feb 18, 2025

No incidents reported today.

Feb 17, 2025

Unresolved incident: Object storage and ceph-hdd volume issue.

Feb 16, 2025

No incidents reported.

Feb 15, 2025

No incidents reported.

Feb 14, 2025

No incidents reported.

Feb 13, 2025

No incidents reported.

Feb 12, 2025

No incidents reported.

Feb 11, 2025

No incidents reported.

Feb 10, 2025

No incidents reported.

Feb 9, 2025

No incidents reported.

Feb 8, 2025

No incidents reported.

Feb 7, 2025

No incidents reported.

Feb 6, 2025

No incidents reported.

Feb 5, 2025

No incidents reported.

Feb 4, 2025
Completed - The scheduled maintenance has been completed.
Feb 4, 16:32 NZDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Feb 4, 16:00 NZDT
Update - We will be undergoing scheduled maintenance during this time.
Jan 30, 12:52 NZDT
Update - We will be undergoing scheduled maintenance during this time.
Feb 4, 16:00 NZDT
Scheduled - We will be undergoing scheduled maintenance during this time.
Feb 4, 16:00 NZDT