Sridatta Viswanath
The Ning Infrastructure Roadmap
5 (100%) 1 vote

The Ning infrastructure runs all Ning Networks. It consists of thousands of machines with software that scales to millions of users. We have consistently delivered high uptime with our platform. As the usage characteristics change, the behavior of our infrastructure changes as well. We keep working on a number of projects that are designed to evolve our infrastructure and make the platform even more stable. As I promised last week, I wanted to provide details of the infrastructure projects we are currently working on. They’re not visible to Network Creators and are all low-level “plumbing” work, but they will enable higher uptime for all Ning Networks.

The first major project is Project Bacon (note: I am a vegetarian). Project Bacon will be the next generation “Content Store” and will give us much better fault isolation and stability for all the content on Ning Networks like comments, forum posts, blog posts, etc. Let me go over some details so you have context. The Content Store is the most heavily taxed component in our platform. It holds photos, blog posts, comments, and other information for your social network. We currently have two different back-ends for the content store, called Rocky and Coco. Rocky is optimized for our largest networks and Coco for our smaller networks. While running both has been good in terms of optimizing the solutions for different problems, it has not been so good in that it means double the work for adding features, and a much larger surface area for bugs to creep into. Taking what we have learned about building these (and they are the third and fourth generations of the Content Store), we are re-unifying the back-end with Project Bacon. While most of Bacon is an evolutionary step in terms or organizing and accessing data, a key component is much better fail-over and fault isolation in cases where we do see issues in its constituent components.

A second big project we have underway is revisiting how we store user-to-user messages and invitations. For efficiency reasons we have kept all of these in one big storage pool, but the fragility of this approach has started to outweigh its efficiency benefits. If you think of the single large pool as a window with a single pane of glass, a rock hitting the pane can shatter the whole thing. On the other hand if you have a window with many small panes of glass, a rock hitting one is not going to hurt the rest of the window.

Another major undertaking is supporting much better automation in our system, from development through to production. We’ve just started this but expect to see incremental fruit from it very quickly. Jim Gray discovered 25 years ago that the largest causes of failure in a system (about 42%) are operator error, misconfiguration, and system maintenance. Automating away the human-intensive (and therefore error prone) steps here have outsized returns, so we are revisiting our environment build-out, deployment, and configuration systems to remove room for errors.

Finally, we are finishing a long process of transitioning our caching infrastructure away from a more sophisticated and cluster-aware solution onto memcached. While memcached offers fewer features, it is also inherently more stable and predictable. This has been a long process to see all the way through as it meant reworking some components which relied on features not available in memcached. On the other hand, memcached’s architecture leads to better predictability and performance at large scale.

Overall, in these projects and others, our goals are:

• Simpler internal systems
• Removing components which add more problems than value
• Isolating failures to as few networks or users as possible
• Limiting failure duration to as short as we can