Today we had a database bug that we introduced last night when we were finalizing the release of Ning Version 2. We’ve seen it twice today which in both cases have slowed the networks on Ning down to a crawl. It manifests itself on Ning as people experiencing “hanging” pages.
The short term fix has required us to take Ning offline for an hour each time, once this morning and once tonight.
The irony of this bug is that it has absolutely nothing to do with scalability or the load we’ve seen today.
Yes, as you can imagine, it made us want to laugh and cry too 🙂
For the technical or curious among you, here’s what’s happening: we’re seeing a fairly simple deadlock that is the result of our interconnected databases and storage systems.
This particular deadlock jams up an important component in Ning and the deadlock starts to propagate through the system as people intermix operations that depend on the lock and others that don’t, creating further locks.
As anyone who’s done multithreading knows deadlocks can be hard to reproduce which is why we missed the windows to fix it before it had an impact on you.
The good news is that we now know what triggers this issue so we think we can catch it before it becomes anything you experience. Additionally, we have a fix we just rolled out that we’ll monitor throughout the night.
We’d like to make it up to anyone who couldn’t give Ning a try today. Drop me a note at ceo(at)company(dot)com and we’ll give you a free month of running your own ads on your social network.
No related posts.