Details on Last Week’s Ning Platform InstabilityGeneral Maintenance
We have always prided ourselves on our uptime record. In March, we had 99.999% uptime. However, Ning Networks experienced slowness last week and brief downtime on Thursday evening. I wanted to take a few moments to walk you through what happened last week and what we’re working on to improve on situations like this going forward.
Starting Tuesday morning we saw sporadic instability in our servers, resulting in slowness that many of you experienced. Our operations and engineering teams investigated a variety of potential causes for the instability including a back-end release that was pushed out on Monday. We rolled the release back as part of our investigation and continued to look for the root cause.
On Thursday morning we believe we identified and addressed the primary cause of the instability: we had started testing a prototype to deliver real-time information to Ning Networks on Monday. We stopped the test and continued to monitor the platform. At around 8pm that day a cluster of cache servers that had become slightly unstable during the investigation started failing. At this point, we began an unplanned maintenance to restart them and it took about 70 minutes to complete the restarts and go live.
Whenever there’s an issue on the Ning Platform, we pursue the following protocol: First, our on-call engineers begin investigating what might be causing the issue. Then, for larger issues like this week, we escalate it to all-hands on deck. We also strive to get information out to you as fast as possible through several channels including the Ning Status Blog, Creators Ning Network and Ning Status Twitter account. Longer-term, we are working on an NC landing page with an announcement bar for important messages and a separate, lightweight “Report an Issue” link, which will allow NCs to quickly and simply report an issue they may be experiencing.
Following this incident, I wanted to highlight two specific areas we have identified for improvement:
- It took longer than we’d like to identify and fix the problem: To address this, we are continuing our work on a series of projects that will simplify the platform and improve our ability to identify problems quickly when they happen. I will share more details with you early next week.
- We were not as effective as we could have been in communicating both the state of the platform and what we were working on: To address this, we are reviewing our internal and external communication processes to ensure that we can give you the most accurate and timely information on what is happening.
Keeping your Ning Network online and speedy is our top priority. Thanks again for all of your patience. As always we appreciate your feedback, so please let us know if you experience any problems on your Ning Network or have ideas for how we can improve the service.