Today we experienced some turbulence on the platform and as a result we were down on two separate occasions for about 20 minutes each time. In fact, we have had similar issues twice before in the last two weeks, making for a total of about 60 minutes of unplanned downtime since then, including today’s 40 minutes. Today’s downtimes have taken us a little longer to recover since we have spent some extra time gathering information to debug the problem.
In complex software systems there are at times issues that occur as a combination of a number of factors, load, which servers are involved in certain operation, and depend on real-time traffic. This makes such problems, unfortunately, very hard to replicate and debug. This is why it’s taking us longer than usual to identify and fix the problem.
What we do know is that these incidents are all related to the same distributed caching system (although causing issues in different configurations) and for the last two weeks we’ve been hard at work on tracking down exactly what causes the problem and how to fix it. We have also put in place some measures to improve the situation and reduce the probability of future downtimes if we see this specific issue happen again.
What to expect from here
We take this problem extremely seriously. We have people dedicated to addressing it and we believe we’re making progress. The fact that we’re still working on it doesn’t mean we will have to take downtime to solve it, whenever possible we deploy fixes to components live. That being said, if there is another outage due to this problem, we have a set of actions planned that should be able to recover fairly quickly.
Just like before, this problem is affecting runtime, not storage, so the data on your Ning Network remains safe.
We will post an update here on the Ning Blog on this topic tomorrow night or as soon as we have more information. In the meantime, you can get in touch with us via the Help Center or watch for more real-time updates on Network Creators and the Ning Status Blog.
Thanks for your patience as we hunt down and resolve this issue!
No related posts.