Working with the Platforms team, our support vendor has analysed last weeks storage disruption. While we don't have a definitive root cause, we were able to arrive at a very good understanding of what happened and the circumstances that led to instability in filesystem communications, which then caused a portion of the storage to become unavailable. Affected was the filesystem that houses user home directories, NeSI projects, and various compute environment shared data such as the NeSI software stack. As such, the event caused a major disruption to platform access and running jobs/workloads.
Posted Nov 04, 2021 - 17:21 NZDT
Identified
HPC Storage services have been restored for the moment and login nodes are active.
Our support vendor (IBM) has assisted restoration of service and is engaged in reviewing diagnostics. At this stage the root cause is not known but we have evidence that an intermittent network issue may be to blame. We have taken the opportunity of the disruption to upgrade GPFS client code on many nodes to assist troubleshooting and remove known bugs as a variable. The issue may recur and could again disrupt active jobs, so users are encouraged to implement checkpointing before starting any long running jobs while this incident remains active (please contact NeSI Support if you need assistance with this).
This afternoon's storage hiccup caused disruption to jobs running on a few Mahuika compute nodes (wbn017, wbn077, wbn060), these have likely aborted due to the IO issues and NeSI Support will be in touch with impacted users directly. Other running jobs appear to be continuing unabated.
We apologise for this frustrating and ongoing period of service disruption.
Posted Oct 28, 2021 - 22:10 NZDT
Investigating
We are again experiencing filesystem issues affecting the Maui and Mahuika clusters. We are working urgently with IBM to resolve the issue.
Posted Oct 28, 2021 - 16:42 NZDT
Update
We are continuing to monitor for any further issues.
Posted Oct 26, 2021 - 17:18 NZDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 26, 2021 - 17:18 NZDT
Update
We are continuing to investigate this issue.
Posted Oct 26, 2021 - 15:58 NZDT
Update
We are continuing to investigate this issue.
Posted Oct 26, 2021 - 15:57 NZDT
Investigating
Due to the file system problem some of the users may experience problems logging in. Some of the jobs running on the cluster may also be affected. We are investigating the issue.
Posted Oct 26, 2021 - 14:13 NZDT
This incident affected: NeSI Storage Infrastructure (HPC Shared Storage system), NeSI HPC Compute Infrastructure (HPC Login nodes - Māui, HPC Login nodes - Mahuika), and Submit new HPC Jobs, Jobs running on HPC, HPC Storage.