Top Menu

Thursday 20 Oct 2011 – Cluster Hardware Migration

Thursday 20 October 2011

A high-end Sun server within a hosting cluster had an isolated hardware level fault associated with its storage subsystem and access to network storage services, which due to the nature of the failure, prevented services being immediately/live migrated/restored; a P1 incident was declared and Mach was able to successfully restore services on alternative cluster hardware through the morning with the last completed at 11:40am.

Background

Early this morning, one high-end Sun server within a hosting cluster had an isolated hardware level fault associated with its storage subsystem and access to network storage services, which due to the nature of the failure, services are not able to be immediately/live migrated/restored; there was a routine planned software maintenance activity at the time on a different physical server in this cluster – but the failure cause was hardware and unrelated.

Issue

From 4.13am one high grade Sun server from a shared hosting cluster server farm at the Cooroy DC had a hardware fault which affected a small percentage of customers in terms of some Virtuozzo containers, shared website hosting and related subscriptions. It was an isolated issue that had no adverse affect on any other server or service. Customers that have procured highly available redundant services from Mach have not been affected by this issue at all.

Resolution Plan

Mach engineers were automatically notified by our 24/7 monitoring system. Within minutes staff diagnosed the root cause, and swiftly developed a number of options to restore services as fast as possible (and minimising the need to restore from last full backups). As it transpired, their “Plan A” worked successfully which reduced the outage window length significantly and avoided the need to restore from a previous backup point – a fabulous result our staff and customers should be proud of.

Status Updates

Mach Technology apologises for the inconvenience caused and will ensure a restoration of services as soon as possible. We will publish further updates as new news comes to hand below…..

  • Update 4.20am: network operations centre engineers have identified & are solving root cause issue – staff were immediately on the case as they were performing routine planned maintence on a different server at the time
  • Update 5.28am: lead engineer performing low level storage checks and repair routines
  • Update 7.00am: Mach commencing outbound phone calls to advise affected customers proactively
  • Update 8.25am: all services on the Sun server are being migrated to new/different hardware; some services are “up” to reduce impact of outage, but in a read-only state whilst copy/migration is performed
  • Update: 10.30am: the non-automated migration routine is working successfully and the great majority of services have already been restored within the cluster from new and different hardware
  • Resolved 11:40am: all services restored (within estimated completion window of 11.30am – 12.30pm)
Comments are closed.