Why OpenNMS?
OpenNMS checked a lot of these boxes. It was (mostly) java, so we could run it on our Sun hardware. OpenNMS was already running in environments an order of magnitude larger than ours. It had a lot of the enterprise level features absent from other Open Source products. There were documents available on the Internet [1],[2] that pointed to its extensibility. It was based on a lot of familiar components (tomcat, postgres, rrdtool). Finally, in Open Source terms, it was a relatively mature product.
We took a cautious approach deploying OpenNMS.
Simplest to replace, and therefore first to go were the existing network monitoring products. Only after a month of parallel running with OpenNMS did we decommission our existing solutions.
Second to go were the diverse collection of emails that were sent by applications or batch jobs. We replaced the destination email addresses with some mailboxes that delivered the notifications directly into OpenNMS. This turned out to be a bigger win than we'd expected. By having a central point where application alerts could be received and processed, we revealed hidden issues with applications that had existed for weeks or months.
This was painful at first. The respective teams were often uncomfortable in having their problems aired to the world. Once we started to address these problems, however, and the frequency of the alerts started to reduce, we started to see real benefits. The operations team had a single console to monitor applications, and we could reduce the number of application support staff on call.
The next target was system performance data collected by our existing tools. That which could be readily moved into OpenNMS went quickly. Platform specific data collectors (such as those which collected from Microsoft hosts using WMI) had any important alerts channeled in to OpenNMS.
Our current focus, now that we believe our OpenNMS installation is mature, is back in application space. We are extending the end-to-end monitoring capabilities of OpenNMS to our web services providers. We are also starting to use it to retrieve instrumentation data directly from applications themselves, as well as their hosts.
Did We Meet Our Requirements?
Here's how things shook out:
Platform independence: Yes. OpenNMS can run on spare hardware. But it's not a good idea. A year after our first rollout of OpenNMS, we moved from a shared SUN Ultrasparc 2 machine to a dedicated dual Xeon machine running RedHat Advanced Server.
Performance: Yes. We are comfortable in that there will always be users pushing the scalability of OpenNMS much harder than we are.
Enterprise Level Features: A cautious yes. OpenNMS met our initial requirements, but also quickly highlighted new ones. Some customers are never satisfied.
Rationalize Support Roles: Yes. OpenNMS is now the single point for the distribution of all actionable network, server and application events. This does need to be constantly policed, to ensure that non-standard notification paths do not creep in again.
Reduce Tasks: A cautious yes. In general, the operator's load has lessened, if only because it has reduced the numbers of open windows on their desktops.
Extensibility: Yes. OpenNMS has proved to be highly extensible.
Low cost of entry: We deployed OpenNMS with minimal capital outlay. We believe that the subsequent people based operational costs have been roughly equivalent to those of a commercial solution.
Longevity: We seem to have backed a product with "legs." The mailing lists [3] are as busy as ever and new features are being added to OpenNMS faster than we can make use of them.
The "sweet spot" for OpenNMS seems to be about as wide as any Open Source solution and getting bigger by the month. We look forward to enhancements in the web user interface, a new JMX based data collector and support for event correlation in the near future.