On Tuesday, 17 May 2016, we suffered a partial outage which resulted in some users not being able to access PPO or experiencing severely degraded response. The outage started at 12:04 PM (UTC+0) and was fully resolved at 12:18 PM (UTC+0). Below is a detailed explanation of the events surrounding the incident, our investigation into the root cause of the incident, as well as the actions planned for preventing a similar occurrence in future.
At 14:07 we were alerted by our monitoring system that one of our application servers was experiencing a high CPU condition. Other application servers were unaffected. Since we were unable to log into the server to troubleshoot, we initiated a restart of the server at 14:14. The application server restarted successfully and started serving requests normally at 14:18.
Subsequent investigation showed that the incident had been triggered by a routine automatic application pool recycle, which occurs every 29 hours. It appears that some sort of race condition had resulted in the normal process for overlapped recycling causing 100% CPU utilisation. Further research seems to indicate that this has been known to happen, but is extremely rare. This is confirmed by the fact that we have never experienced this, despite thousands of application pool recycles.
Based on our investigation it would appear that there is very little we could have done to prevent this from occurring and that it is unlikely to re-occur. We are however investigating options to automatically terminate an application pool that is consuming high CPU. It is however not clear whether this would have resolved the problem in this particular case. Longer term we are planning on making some architectural changes to allow us to dynamically move traffic from an unhealthy application server to a healthy server. We also identified room for improvement in our incident response and communication process and will be making improvements in this regard.
We would like to apologise for any inconvenience that resulted from this incident.