Resolved -
We have made various changes to improve the Weka filesystem performance so we consider this to be resolved. However should you experience any I/O slow downs please advise us via support@nesi.org.nz
Oct 8, 16:27 NZDT
Update -
We are continuing to tune the filesystem and monitor performance. We have had occasional reports of slow interactive metadata performance (e.g. when extracting many files from a bundle/archive or pulling code from a remote git repository). These issues appear to be limited to specific nodes/clients and we have recently made changes on login03 which have improved performance on that primary login node. However, if you notice anything out of the ordinary please report it to Support.
Oct 1, 17:36 NZDT
Update -
The filesystem has been stable today, however several users have reported degraded interactive IO experience. We expect this is caused by ongoing heavy metadata load as a result of the continuing background integrity check. Based on current progress we unfortunately expect this to continue into next week.
There have been no major impact to jobs, though some workloads paused when trying to write to the filesystem during the incident, and as a result a few jobs have timed out. If you see this and need help resolving it then please contact support.
Aug 28, 16:08 NZST
Update -
We are continuing to monitor for any further issues.
Aug 28, 00:16 NZST
Monitoring -
Full filesystem functionality was restored at approx 11pm NZST. The issue appears to have been triggered by a brief backend network disruption - WEKA support are investigating why the filesystem didn't recover automatically. Ongoing data integrity checks may impact IO performance for a while longer.
Thankfully there seem to be no widespread job impacts, however we will check this more thoroughly in the morning and contact any users who may have had work impacted. Apologies again for the disruption (and goodnight)!
Aug 28, 00:16 NZST
Investigating -
We have identified an ongoing issue with our high performance filesystems. This is impacting scratch/nobackup, project, home and likely impacting any new logins to the HPC and OnDemand services. At present, existing jobs are continuing to run and complete, however we anticipate there may be job failures as a result of this problem. We are currently awaiting urgent vendor support. Apologies for the inconvenience and disruption, we'll update as soon as we know more.
Aug 27, 21:19 NZST