Overnight problems

Incident Report for NeSI

Monitoring

Kia ora all, here is a belated update on this incident.

From approx 2200hrs on Thurs 4th September, tenant network and associated floating IP connectivity to all FlexiHPC VM instances started going offline. A subset of instances also went into the SHUTOFF state. This was the result of a config and automation regression in our OpenStack infrastructure. A config fix was rolled out approx 0400hrs on Friday the 5th which resolved network connectivity issues for impacted instances.

However, a subset of instances that were SHUTOFF by the original issue were additionally impacted by a serious corner case in the OpenStack deployment tooling that we use. This resulted in duplicate VM instances being launched, which in turn meant that some instance root drives and attached volumes were inadvertently multi-attached, which could lead to instance availability and potential data corruption issues. Our team have since worked tirelessly with impacted instance owners to address the follow-on issues and recover services. If you are still experiencing any issues, please contact support. We apologise for the disruption to service and are making adjustments to our processes to minimise the possibility of similar problems in future.
Posted Sep 11, 2025 - 22:27 NZST

Identified

All the impacted production services that we are currently aware of have been restored. There are on the order of 100 instances impacted that have duplicate virtual machine processes, though many of these are dev/test machines. These require operator intervention to attempt restoration of service. If you have an active instance that you can no longer log onto, or an instance that is shutoff and will not start up, please open a support ticket.
Posted Sep 05, 2025 - 15:24 NZST

Investigating

We had some serious issues from about 10pm last night that interrupted the lander node, Ondemand, and numerous Openstack instances. If you have VMs that were shutdown please try restarting them, if this is not successful please log a support ticket.
We are currently assessing what services may still be down. Slurm is available and jobs are running. Ondemand is available.
Posted Sep 05, 2025 - 08:46 NZST
This incident affects: Flexible High Performance Cloud.