Author Archive

xnperfstat: wrapping NetApp’s PerfStat Tool

Posted by gerir on November 3, 2011 – 1:46 AM

Perfstat is a diagnostic data collection tool for NetApp filers. If and when they experience performance issues, NetApp Support will likely ask for perfstat to be run against the ailing filers. This is (hopefully) not something that is done often, and therefore, the details of how to run it may get rusty, which is problematic in the middle of an availability storm.

We wrote xnperfstat in late 2008 to provide a cleaner and more straightforward method to run perfstat: it performs perfstat housekeeping chores for us, with options geared directly towards situations where support cases are open. It can also be run on a “continuous” basis from cron (for cases where the data collection has to be done over a period of days), storing and rotating output files.

See the README for details.

The Nagios Shell (ngsh)

Posted by gerir on October 26, 2011 – 1:24 AM

Another tool we’ve got a fair amount of mileage on is what we internally refer to as the Nagios Shell (ngsh). We have used Nagios since our early days (circa 2005), and it has served us very well to keep an eye on our infrastructure. Over time, we started writing tools to poke and probe Nagios in one way or another. The end result of this process was a hodgepodge of tools that parsed status.dat and did other things they really shouldn’t. We lacked consistence across the toolset, some of them took forever to run (we have a decently large environment), and others failed in mysterious ways.

About a year and a half ago we decided to stop the madness, and were lucky enough to run across Mathias Kettner’s fantastic MK Livestatus module. We consolidated eight different tools into a single one, added richer functionality in terms of querying Nagios, and put away mysterious failures we had grown accustomed to live with, knowing status.dat parsing was biting us. We christened the new tool the Nagios Shell, since it was intended to run on the CLI, and that opened an entire new set of functionality and correctness in managing our environment.

The current incantation is comprised of two scripts, ngsh (a shell script) and ngsq (Python script), and requires that you build MK Livestatus into your Nagios instance. A new generation is on the works, one which replaces this mixture with a toolkit written entirely in Ruby and provides far more flexibility than the current one, including a RESTish interface so that Nagios can be controlled over a HTTP interface (more on that soon). The README has some brief examples of usage, and soon the wiki will contain a roadmap of improvements.

 

Zettabee and Theia

Posted by gerir on October 21, 2011 – 2:13 PM

It’s hard to believe it has almost a year since we started the process of open sourcing tools, but it has indeed been that long, and it picked up steam a few weeks ago, when pushed out nddtune, which is admittedly a very simple tool. Today we’re continuing that effort with a couple of more significant tools: Zettabee and Theia.

A Little History

About four years ago, we had a very real need to have fairly detailed performance metrics for NetApp filers. At the time, the available solutions relied on SNMP (NetApp’s SNMP support has historically been weak) or were NetApp’s own, which, asides from expensive, were hard to integrate with the rest of our monitoring infrastructure (which is comprised of Nagios and Zenoss). As such, we set out to write a tool that would both perform detailed filer monitoring (for faults and performance) and that would be able to interface with those systems. Theia was born.

In more recent times, as we were looking at beefing up our DR strategy, we found ourselves needing a good ZFS-based replication tool, and set out to write Zettabee, which gave us an opportunity to dive deeper into ZFS capabilities.

Let the Games Begin

Today we’re very excited to be releasing those two tools into the open. Theia has been in production for the last four years, dutifully keeping an eye on our filers, while Zettabee has been pushing bits long-distance for well over nine months. We are working on putting together a roadmap for future work, but are happy to have them out in the open for further collaboration. Tim has written a good post on some of the work he has done to make this happen, and I am grateful for his help on this endeavor.

Operations Toolkit

Posted by gerir on September 13, 2011 – 11:20 AM

A few months ago (has it been that long already?!) we started the process of pushing some of our internal Operations toolkit out in the open (and you can rightly argue that we barely dipped our toes in the water). We are picking up where we left off, and are working towards releasing several other tools over the next few weeks, some of them trivial (let’s call them utilities), others far more significant (true tools).

We are starting today on the utility end of the spectrum with nddtune, a SMF manifest/method combo you can use to ensure ndd tweaks that are not configurable via /etc/system (in Solaris) stay tweaked when a system reboots. See the README file for details. This is based on Dr. Hung-Sheng Tsao’s SMF and tcp tuning 2008 blog post, primarily adding a configuration file.

I chose a trivial utility to start with because the aim of this post is really to provide a preview of some of the tools we are working on publishing. Part of the work involves fixing some of the known bugs we have identified, as well as removing or improving code that is not generic and assumes the tool is running in Ning’s environment. Specific blog posts will follow as we make releases.

A sample of the upcoming tools, and in no particular order:

  • The Nagios Shell sits halfway in between utility and tool. We have traditionally not used any of the Nagios graphical interfaces but still need a way to interact with Nagios in a sane fashion. After a couple of years of writing a hodge-podge of utilities, we decided to collapse the functionality into a single tool, which uses Matthias Kettner’s incredible MK-Livestatus module to provide far more of the functionality that we had before. The current incantation of the Nagios Shell is actually a shell script coupled with a Python script, and we’re in the middle of a from-the-ground-up rewrite in Ruby to add further funcionality while removing complexity.
  •  Zettabee is definitely in tool territory. We use ZFS storage for a variety of purposes, and the bulk of the data we store on ZFS has to be replicated to other facilities. Zettabee encapsulates and manages zfs send and receive functionality to provide incremental, block-level, asynchronous replication of remote ZFS file systems (no support for synchronous operations is available), and it’s tightly (but optionally) integrated with Nagios as well.
  • Theia provides NetApp filer performance monitoring and alerting through its integration with Nagios and Zenoss. One of the oldest tools in our toolkit, it has been in production for well over four years, constantly prodding and poking our filers to extract performance data and alert as necessary. It used to be that monitoring NetApp filers was a) painful and/or b) expensive (DFM anyone?). Theia takes care of that for us.

There are other tools in the pipeline, but these will have to do for now. It is our sincere hope that they are useful outside of our environment, and hopefully, other bright coders out there can crank out additional fixes and functionality that we have not implemented. And don’t forget to check out the rest of Ning’s open source projects on GitHub!

    Attend Tech Talks by Ning's Engineering & Ops teams at Ning HQ in downtown Palo Alto, CA!

    Archives by Category

    Search this Blog


    RSS