Storage performance issue impacting virtual instances
Incident Report for NeSI Status
Resolved
There have been no further issues observed since our corrective actions a week ago.
Posted Apr 22, 2024 - 11:09 NZST
Update
We unfortunately encountered an additional bug yesterday that threatened the stability of the underlying volume storage cluster (Ceph). As a result there was a short period of additional storage disruption overnight (at about 21:45 NZ time) as we worked with support partners to stabilise the underlying Ceph cluster. This may have resulted in storage errors or hung task warnings within VMs at that time. We're now monitoring. Please let us know if you need support with any related issues.
Posted Apr 16, 2024 - 10:01 NZST
Monitoring
We've identified a specific node that has been dragging storage performance down and now taken it out of the Ceph cluster. IO performance has recovered considerably since this change. We're continuing to monitor.
Posted Apr 15, 2024 - 17:02 NZST
Investigating
Unfortunately we've seen further indications of storage performance issues over the weekend (affecting Ceph volumes accessed through the Cinder Block service). The impact seems to be less pronounced since Friday, though we expect VMs/user to be experiencing intermittent latency spikes and lower average storage performance. We're continuing to investigate and will be escalating with our support partners today.

Please let us know if you experience any problems with your VMs.
Posted Apr 15, 2024 - 09:28 NZST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 12, 2024 - 17:06 NZST
Update
We are continuing to work on this and planning a switch restart
Posted Apr 12, 2024 - 15:10 NZST
Identified
The underlying network issue has represented itself. Users may still experience some performance degradation especially with storage or other intermittent errors. We're continuing to investigate.
Posted Apr 12, 2024 - 13:14 NZST
Monitoring
We've mitigated the underlying storage problem, which appears to be network related, but limited to specific hosts. Investigation of that is ongoing.
Posted Apr 12, 2024 - 12:29 NZST
Identified
We're working to resolve a storage performance issue in the underlying Ceph storage cluster that provides all block storage for Flexible HPC Cloud VM instances. This may be noticeably impacting IO performance from within cloud instances, which could have knock on intermittent impacts to services within tenant environments, examples include slow or incomplete login interactions or database and web timeouts.
Posted Apr 12, 2024 - 11:30 NZST
This incident affected: Flexible High Performance Cloud Services (Virtual Compute Service) and Flexible High Performance Cloud.