Top Menu

Monday 1 Aug 2011 – SAN Hardware Component Failure

Monday 1 August 2011

One of Mach’s Data Centre sites (Cooroy) suffered an unplanned outage for just over 1hr with one SAN hardware subsystem server as a result of hardware component failure, it was isolated and only affected a small number of servers/applications.

Background

This afternoon, one core storage subsystem server went offline unexpectedly and “Priority 1” SLA restoration efforts were immediately enacted, concluding with a successful restoration of services well within SLA timeframes and identification of a SAN hardware component failure as the root cause; there were no planned maintenance activities at the time.

Issue

From 12.02pm one high grade SAN (Storage Area Networking) subsystem server from our storage farm at the Cooroy DC was detected by our automated monitoring systems as offline. This was an isolated issue that affected a small number of our hosted services including some Virtual Machine and Online Storage Vault subscriptions that had this unit as their Active node. Customers that have procured highly available redundant services from Mach have not been affected by this issue at all.

Resolution Plan

Mach engineers were automatically notified by our 24/7 monitoring system and worked rapidly to identify root cause and to determine that restoration of normal services was possible (vice failover to alternative infrastructure services).

Status Updates

Mach Technology apologises for the inconvenience caused and will ensure the restoration of services as soon as possible. We will publish further updates as new news comes to hand below…..

  • Update 12.04pm: network operations centre engineers have identified offline subsystem & are solving suspected root cause issue
  • Update 12.20pm: “Priority 1” incident process underway and initial notice published to website
  • Update 12.57pm: core subsystem operation has been restored and Mach staff have determined that restoration should be possible without recourse to alternative infrastructure services
  • Update 13:02pm: the SAN systems are operating again and all linked services are already restored or in the process of being restored; root cause identified as a failed 3ware RAID hardware component within the SAN unit (which will be replaced in due course during planned maintenance)
  • Update 13:15pm: services are restored and all automated monitoring system tests are passing OK, futher user testing of key services will progress to provide added quality check
  • Update 13.26pm: all services online per normal operations
Comments are closed.