Top Menu

Saturday 4 Sep 2010 – Storage Hardware

Saturday 4 September 2010

Hardware failure in a low level storage system affected some isolated, non-HA services for the 3.5hrs it took to install replacement. Following 24hrs during Sunday, the team migrated services to alternative platforms/hardware as a precaution (individual non-HA services intermittently offline during process).

Background

This afternoon, one particular server had an isolated hardware level fault associated with its storage subsystem and access to network storage services, which required the replacement of the defective hardware part and a reboot; there were no planned maintenance activities at the time.

Issue

From 2.34pm one high grade Sun server from our server farm at the Cooroy DC had a hardware fault which affected a small percentage of customers in terms of shared website hosting and related subscriptions. It was an isolated issue that had no adverse affect on any other server or service. Customers that have procured highly available redundant services from Mach have not been affected by this issue at all.

Resolution Plan

Mach engineers were automatically notified by our 24/7 monitoring system. They diagnosed the root cause, and determined that hardware component replacement was the fastest way to restore services on Saturday. During the night and into Sunday, the team have then migrated services to alternative platforms/hardware as a precaution (individual non-HA services intermittently offline during this process).

Status Updates

Mach Technology apologises for the inconvenience caused and will ensure a restoration of services as soon as possible. We will publish further updates as new news comes to hand below…..

  • Update 3.05pm: network operations centre engineers have identified & are solving root cause issue
  • Update 3.30pm: services restoration attempted
  • Update 4.10pm: services restoration will require hardware component replacement onsite
  • Update 4.20pm: staff dispatched to data centre having obtained required hardware components
  • Update 5.10pm: replacement hardware installed onsite, commencing re-activation of associated services
  • Update 6.00pm: all services online per normal operations
  • Following 24hours update: all services migrated to alternative platforms/hardware (non-HA subscriptions intermittent outages during the process)

 

Comments are closed.