Cloud Web Servers (Application Servers) sit behind a load balancer to handle traffic variation. At 2017-05-24 03:48:22 GMT two application servers lost connection to the data network and were then unable to reconnect. This restricted available resources to the static application server causing it to automatically disconnect from the load balancer. Additional application servers were automatically spun up to handle the additional load however they were unable to connect to the data servers causing them to not successfully built.
The data server had scheduled maintenance carried out by the data center which included updating services which restart the file distribution element of the servers. Although the data servers successfully reconnected to other servers two application servers failed to reconnect. This automatic safety step prevented servers accessing data in an unstable manner to protect the data.
Whilst discussing this with our provider we have moved scheduled updates to occur at a more appropriate time and have additionally created a failsafe process to ensure data is served. At the time of writing we have removed all SPOF other than the data network link, we have been advised to use cloud storage for handling the data. Unfortunately this is not something Moodle is able to utilise at this time, however additional avenues are being explored as a way to prevent this from happening again.
Annual Scheduled maintenance carried out by the data center caused failure to access Data services, in order to protect the data's integrity we run health checks against these connections. We have finished investigations and will look into additional prevention methods to ensure this does not happen again.