Network issue earlier this morning

At approximately 16:18PM on the 9th December 2009 we experienced a partial power failure to our private datacentre suite in Bluesquare II, Maidenhead. This was caused by a short in one of the datacentre PDUs. As we understand it, this was a problem with the original installation of the component and is not something which is likely to reoccur.

Bluesquare provide generator and UPS backup to keep services online in the event of a power failure. This is why your websites don’t go offline every time there is a power cut! However, the fault occurred ‘upstream’ of the UPS system and therefore the UPSs offered no protection.

Approximately 65 servers were rebooted, including a number of Directadmin servers and some clustered hosting servers. Whilst the rack containing our clustered storage and core networking equipment did lose power, we have an additional UPS in this rack which powers the secondary inputs to the NASs and approximately 50% of the networking equipment. Thanks to this, the network outage only lasted approximately 45 seconds whilst the routing tables re-established themselves and there was no outage at all to any of the storage nodes, meaning no loss of data and no chance of corruption.

The majority of services (cPanel, dedicated, VPS, clustered e-mail, database, etc) which are in other racks remained online throughout and almost all Directadmin, Cpanel, managed dedicated and VPS services in the affected racks were restored within a few minutes apart from four servers where downtime was around two hours (monthly statistics will be available at the end of the month regarding SLA). However many of the web servers for the clustered hosting platform were rebooted and/or lost their NFS mounts. This resulted in a period of approximately 20 minutes of downtime and unacceptably poor performance whilst the cluster was re-established.

Some users with NAT devices on their home/office internet connection also experienced some residual issues with the FTP service which took some time to resolve.

All services are currently stable and we anticipate no further problems. We sincerely apologise for any inconvenience caused and we thank you for your patience and understanding during this time.

December 10, 2009 В· admin В· Comments Closed
Posted in: Network issues