Mahuika cluster jobs fail to complete cleanly

Incident Report for NeSI Status

Postmortem

switched BCM head node master from 01 to 02 and that's clearing all our alerts

Posted Sep 14, 2021 - 15:54 NZST

Resolved

This incident has been resolved.
Posted Sep 14, 2021 - 15:53 NZST

Monitoring

The service has now been restored and jobs are running in an orderly fashion. We will continue monitoring the situation.
Posted Sep 14, 2021 - 13:48 NZST

Investigating

Due to a network issue Mahuika compute nodes may fail to leave completion state and are stuck in the queue with state indication CG. We are investigation the issue now.
Posted Sep 14, 2021 - 12:53 NZST
This incident affected: NeSI HPC Compute Infrastructure - Legacy (HPC Compute nodes - Mahuika).