The Story of a ReleaseMaintenance Technology
From where you’re sitting as a Network Creator on Ning, platform releases aren’t very transparent. The Ning Platform and your social networks go down for a couple of hours, then come back up, and not much has changed.
With nothing visible to you, the natural question is why did we take everything down at all? And, in rare occasions like this past weekend, the release is followed by another release right after the first one, and you’re left wondering what is happening.
So, here’s a little on what went on behind the scenes of Saturday night’s release, and the update that followed on Sunday night…
We started with prep work an hour before the release, at 8 pm Saturday (other minor items get completed during the day). Prep work involves a number of things: priming repositories that contain the code that we will deploy to the servers, which we at Ning call “cores” (in all, over 20 Gigabytes of binaries get propagated across the Ning platform to update the cores to a new release), disabling some alerting systems that would otherwise go haywire during the maintenance, and doing setup on standby systems that would have to be switched on during the maintenance.
The maintenance started on time at 9 pm Saturday, and we were off to the races. For platform releases, we have a plan that parallelizes steps as much as possible, but also requires a lot of coordination. On Saturday, the full release process was completed within one hour – Ning and your social networks were running again behind “closed doors” by 10:05 pm PDT.
However, as with every platform release, we do a full Quality Control check -largely automated – which we clocked at 35 minutes to verify all major functionality across the platform and networks. Needless to say, we do a lot of Quality checks in staging environments that replicate what’s in the Ning Platform, but there’s always some potential in the small non-obvious differences between the environments that may create subtle problems live in production. So before going live again, we double check everything we can.
In this case, by 10:40 PM on Saturday we had completed the QA cycle successfully except for one item in the “core” that handles messaging in the platform. We spent the next 25 minutes fixing the small issue that caused this and Ning came back online at 11:05 pm PDT.
But for us that’s only the beginning. After a release is done and the site is live, the team always keeps “watch” over the system for up to 2 hours, regardless of what time we completed the release. This is to check and recheck items that may have been missed. This “platform watch” involves verifying various system health checks, memory consumption, and comparing platform behavior under load with previously established parameters, as well as the more prosaic manual testing. In this case, our paranoia paid off, when we discovered that one of the panels in the Ningbar (“My Networks on Ning”) was not updating properly, but at the same time showing no errors.
It took over an hour of head-scratching and debugging before we realized that we had erroneously disabled one of the internal indexing systems that was involved in generating that data. In another 30 minutes we had checked and double-checked the solution and by around 1 am PDT everything was normal. We continued system watch until 2 am and called it a night.
Sunday at 8 am PDT we got alerts on an internal system that was about to run out of capacity to serve requests. This was not visible to anyone using the Ning Platform – we have redundancy built in at many levels to prevent that. We also have very sensitive alerting in place to identify, as much as possible, issues before they become visible by you.
For this first occurrence it seemed like an isolated event, so we left it alone but continued watching carefully. At 9:30 am PDT the alert sounded again. This time, the team got online to get to the root cause of the problem. In the process, we discovered other minor issues (again, nothing that would impact you directly) that we wanted to get fixed before the issues became a real problem, for example, sub-optimal settings on some servers on how they connected to databases.
By 6 pm PDT, we had the updated release ready and QA started. Our plan initially was to do a release without going into maintenance, but at 7:20 pm it became clear that this wasn’t going to be possible, and we had to do a brief downtime. We activated our response systems but thanks to a typing mistake by yours truly not everyone got notified in time. We are now reviewing how we can avoid this from happening again in the future.
We started the release process finally around 7:30 pm PDT, and completed all the steps for the release and checks after going live by 10:00 pm PDT, with the actual downtime in between.
All in all, it wasn’t a great release, from your perspective, or ours. You can be sure that we will look at what went right and what went wrong and work to get it right the next time. In the meantime, we thank you for your patience, and for using Ning!