Slow I/O Performance

Incident Report for REANNZ Advanced Computing and Data Services

Resolved

Additional SSD storage has been added to HPC3/Mahuika's WEKA filesystems earlier in February. This has almost completely eliminated capacity related back-pressure events, which were previously having an adverse impact on ongoing read and write performance. The additional SSD capacity also adds to the peak performance capability of our filesystems, nearly doubling their total throughput to 376 GB/s reads and 71 GB/s writes, and so giving us plenty of headroom for more data-intensive workloads, particularly of the AI variety.

We are still working on additional tuning to improve performance for tiered reads from underlying object storage, but the overall situation should already be much improved. If you continue to experience any IO related issues or performance inconsistencies please reach out to support so we can investigate.

Posted Mar 04, 2026 - 23:10 NZDT

Monitoring

With the implementation of a new autocleaning process on the scratch/nobackup filesystem last month we have now resolved the associated storage capacity issues related to this incident.

Unfortunately, some workloads will still be experiencing reduced read performance depending on other activity on the system - if you notice highly variable or poor IO performance then please report it to support as we may be able to assist with workarounds.

There are several pieces of work in progress to improve IO performance on the new platform based on learnings from the first 6 months of operations, including an expansion of SSD/NVMe capacity. We will make an announcement soon providing some additional details on this and also update documentation to reflect important considerations for users.

Posted Jan 15, 2026 - 11:57 NZDT

Update

There has been a period of IO stalls this morning as we've dealt with some storage hardware failures. That issue is now resolved, however space reclamation continues in the backend and is having a detrimental impact on read performance. We are working with WEKA support to look at mitigation options. Apologies for the performance impact - if your jobs are impacted and need a runtime limit extension then please reach out to support.

Posted Nov 24, 2025 - 12:44 NZDT

Identified

Our storage system is currently very full and this is forcing the backend object storage to undertake some urgent administration by way of defragmentation. This increased load is having a detrimental affect on I/O performance, especially read I/O, and this is likely to continue for some days. We are urgently looking at ways to mitigate this. Researchers could help alleviate this in the short term by cleaning up any unwanted files and data as soon as possible please.

Posted Nov 21, 2025 - 12:29 NZDT

This incident affected: Data Transfer and HPC Storage.