There’s lots of things that have to be working in order for a DC-X system to do its job. Even though DC-X is very stable, things can still go wrong (a hard disk is full, hardware fails, a process crashes) and you’ll want a monitoring system to tell you about it before users start complaining.
Nagios is the monitoring system we’re concentrating on, as it is free, popular, mature and well-documented. (But other monitoring software will probably work fine as well.)
The standard Nagios plug-ins let you check general server health as well as database, web server and fulltext search availability. They are used to verify that all DC-X command line processes are running. In addition, DC-X comes with its own plug-in (a command line tool) that can report data specific to DC-X: The number of failed import jobs, number of new documents in the last hour, free diskspace across a pool of storage devices and much more.
We’re using Supervisor to start and stop DC-X command line processes (and automatically restart them after a crash). It’s a great tool that also reports the process status nicely.
You’ll want to see trends and statistics as well – how fast is diskspace filling up, how is the number of DC-X documents or workflow jobs evolving? For generic server information, we’re recommending something like collectd. DC-X trends are captured and graphed on our demo server using custom scripts calling rrdtool – this functionality is going to be packaged into the standard DC-X distribution so that every DC-X installation can benefit.