Wednesday 7 September 2011
Backup software issue caused extended timeframe and load on one Hosted Exchange node early this morning, which degraded “sync” performance for some users for 1hr 49mins (P2).
Nature of Incident
This morning the daily backup routine for one of the nodes within the Hosted Exchange clustered platform ran much longer than normal to complete (generally it completes during late overnight hours and well before business hours commence the next day). This had a direct impact during the outage window detailed above, which is the period when the combined load of extended backup operation plus normal morning user load on the server, resulted in markedly degraded server performance.
Some clients of the Hosted Exchange platform (approximately 20%) were affected during the above time window, evidenced by slow or timing out connections when trying to “sync” their mailboxes. All incoming mail was not affected and was available for sync to users Email Clients (Exchange, iPhone, etc) when the server load reduced following completion of the backup process. All outgoing mail was similarly not affected and any cached on a user’s device was immediately sent when sync was completed.
Mach was immediately made aware of this situation via our 24/7 Monitoring Platform and immediately declared a P2 (major but not widespread) issue. Engineers were able to swiftly confirm the root cause (vendor software issue causing degraded and slower mode of backup software) and determined the fastest way to reduce server load was to allow the backup routine to complete. Customers should note that the response and recovery parameters exceeded our highest level P1 undertaking (respond 15mins and resolve 2hrs), even though this was a P2 incident. Moreover, Mach is proud that it was alerted to the issue via its 24/7 Monitoring Platform, responded immediately and recovered the services to normal very rapidly.
Root Cause Information & Permanent Correction Actions (PCA)
Root cause is vendor software operation. New technology has been procured and will be fast tracked (rollout commencing tonight) onto affected node in the Exchange Platform, to prevent re-occurrence.
Mach apologises for the inconvenience caused to those customer users that were affected this morning. Whilst we never want technology to fail – on occasion it does – and our guarantee is to set the benchmark in how we know, respond and resolve.