Web Servers Unavailable.
Postmortem
Introduction

Cloud Web Servers (Application Servers) sit behind a load balancer to handle traffic variation. At 2017-05-24 03:48:22 GMT two application servers lost connection to the data network and were then unable to reconnect. This restricted available resources to the static application server causing it to automatically disconnect from the load balancer. Additional application servers were automatically spun up to handle the additional load however they were unable to connect to the data servers causing them to not successfully built.

The Issue

The data server had scheduled maintenance carried out by the data center which included updating services which restart the file distribution element of the servers. Although the data servers successfully reconnected to other servers two application servers failed to reconnect. This automatic safety step prevented servers accessing data in an unstable manner to protect the data.

Moving Forward

Whilst discussing this with our provider we have moved scheduled updates to occur at a more appropriate time and have additionally created a failsafe process to ensure data is served. At the time of writing we have removed all SPOF other than the data network link, we have been advised to use cloud storage for handling the data. Unfortunately this is not something Moodle is able to utilise at this time, however additional avenues are being explored as a way to prevent this from happening again.

Summary

Annual Scheduled maintenance carried out by the data center caused failure to access Data services, in order to protect the data's integrity we run health checks against these connections. We have finished investigations and will look into additional prevention methods to ensure this does not happen again.

Posted May 26, 2017 - 09:27 BST

Resolved
Full service has resumed. At 03:50:01 there was an incident which caused part of the data network to not respond and for the web servers to report as unavailable. We will continue to monitor the network closely over the next few hours and have began investigating the root cause of the issue to prevent this from reoccurring.
Posted May 24, 2017 - 09:48 BST
Monitoring
At 03:50:01 there was an incident which caused part of the data network to not respond and for the web servers to report as unavailable. We have restored service as we now investigate what caused this issue.
Posted May 24, 2017 - 09:26 BST
Investigating
We are currently investigating this issue.
Posted May 24, 2017 - 09:07 BST