Blog
November 12, 2014 Marie H.

Writing Custom Nagios Checks in Python

Writing Custom Nagios Checks in Python

Photo by <a href="https://unsplash.com/@dkomow?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Daniil Komov</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

I've been writing Nagios checks for a while now, and the thing I see trip people up most often is not understanding the contract between Nagios and a check script. Once that clicks, everything else is just plumbing.

The contract: exit codes are everything

Nagios does not care what your check prints (much). What it cares about is the exit code your script returns when it finishes. That's it. The mapping is:

  • 0 = OK
  • 1 = WARNING
  • 2 = CRITICAL
  • 3 = UNKNOWN

Get that wrong and your entire alert logic is broken. I've seen setups where a check was exiting 1 for a healthy state and nobody noticed for weeks because the graph looked "kind of normal." Don't do that.

The check also prints exactly one line of output to stdout. That line is what shows up in the Nagios UI next to the service status. Nagios only reads the first line — everything after the first newline is technically extended output and shows up in a separate detail view. I'll come back to why multi-line output is a footgun.

Basic check structure

Every check I write follows this skeleton:

#!/usr/bin/env python3

import argparse
import sys

def parse_args():
    parser = argparse.ArgumentParser(description='Check something useful')
    parser.add_argument('--warning', '-w', type=float, required=True,
                        help='Warning threshold')
    parser.add_argument('--critical', '-c', type=float, required=True,
                        help='Critical threshold')
    return parser.parse_args()

def main():
    args = parse_args()

    try:
        value = get_value()  # whatever you're measuring
    except Exception as e:
        print(f"UNKNOWN: could not retrieve value: {e}")
        sys.exit(3)

    if value >= args.critical:
        print(f"CRITICAL: value is {value}")
        sys.exit(2)
    elif value >= args.warning:
        print(f"WARNING: value is {value}")
        sys.exit(1)
    else:
        print(f"OK: value is {value}")
        sys.exit(0)

if __name__ == '__main__':
    main()

The UNKNOWN exit is important. If your check crashes trying to connect to something or import a library, you want Nagios to know the check failed, not that the service is OK. Swallowing exceptions and exiting 0 is a classic way to create false confidence.

Performance data

If you want to graph your check values in PNP4Nagios or ship them to Graphite, you need to include performance data in your output. The format is a | character after your status message, followed by one or more metrics:

label=value[UOM];warn;crit;min;max

A real example output line:

OK: memory usage is 512MB | rss_mb=512;800;1024;0;2048

The UOM (unit of measure) field is optional but useful — common values are s (seconds), %, B, KB, MB, GB, c (counter). The warn/crit/min/max fields can be left empty if you don't have them, just leave the semicolons: rss_mb=512;;;0;.

PNP4Nagios picks this up automatically and starts graphing. Graphite integrations typically parse it out of the Nagios log. Either way, you get the data for free with no extra work if you add that | line from the start.

Real example: checking process memory usage

Here's a complete, working check that uses psutil to monitor RSS memory for a named process. This is something I use to keep an eye on Java processes that have a habit of growing without bound.

#!/usr/bin/env python3
"""
check_process_memory.py - Nagios check for process RSS memory usage
Usage: check_process_memory.py -p <process_name> -w <warn_mb> -c <crit_mb>
"""

import argparse
import sys

try:
    import psutil
except ImportError:
    print("UNKNOWN: psutil not installed. Run: pip install psutil")
    sys.exit(3)


def parse_args():
    parser = argparse.ArgumentParser(description='Check process memory usage')
    parser.add_argument('--process', '-p', required=True,
                        help='Process name to check (e.g. java, nginx)')
    parser.add_argument('--warning', '-w', type=float, required=True,
                        help='Warning threshold in MB')
    parser.add_argument('--critical', '-c', type=float, required=True,
                        help='Critical threshold in MB')
    return parser.parse_args()


def get_process_rss_mb(name):
    """Return total RSS in MB for all processes matching name."""
    total_rss = 0
    found = False
    for proc in psutil.process_iter(['name', 'memory_info']):
        try:
            if proc.info['name'] == name:
                total_rss += proc.info['memory_info'].rss
                found = True
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            continue
    if not found:
        raise ValueError(f"No process named '{name}' found")
    return total_rss / (1024 * 1024)


def main():
    args = parse_args()

    try:
        rss_mb = get_process_rss_mb(args.process)
    except ValueError as e:
        print(f"UNKNOWN: {e}")
        sys.exit(3)
    except Exception as e:
        print(f"UNKNOWN: unexpected error: {e}")
        sys.exit(3)

    perf = (f"rss_mb={rss_mb:.1f};{args.warning:.1f};"
            f"{args.critical:.1f};0;")

    if rss_mb >= args.critical:
        print(f"CRITICAL: {args.process} RSS is {rss_mb:.1f}MB | {perf}")
        sys.exit(2)
    elif rss_mb >= args.warning:
        print(f"WARNING: {args.process} RSS is {rss_mb:.1f}MB | {perf}")
        sys.exit(1)
    else:
        print(f"OK: {args.process} RSS is {rss_mb:.1f}MB | {perf}")
        sys.exit(0)


if __name__ == '__main__':
    main()

Install psutil first: pip install psutil. Then test it manually before wiring it into Nagios:

python3 check_process_memory.py -p java -w 800 -c 1024
echo $?

echo $? immediately after running the check tells you the exit code. That's how you verify your check is behaving correctly before you trust it to wake you up at 3am.

Second example: HTTP endpoint check

This one checks both that an endpoint returns the expected HTTP status code and that it responds within a time threshold. Both are worth monitoring separately — a 200 that takes 30 seconds is not OK.

#!/usr/bin/env python3
"""
check_http_endpoint.py - Check HTTP status code and response time
"""

import argparse
import sys
import time

try:
    import requests
except ImportError:
    print("UNKNOWN: requests not installed. Run: pip install requests")
    sys.exit(3)


def parse_args():
    parser = argparse.ArgumentParser(description='Check HTTP endpoint')
    parser.add_argument('--url', '-u', required=True, help='URL to check')
    parser.add_argument('--warning', '-w', type=float, default=1.0,
                        help='Warning threshold in seconds (default: 1.0)')
    parser.add_argument('--critical', '-c', type=float, default=3.0,
                        help='Critical threshold in seconds (default: 3.0)')
    parser.add_argument('--expected-status', '-e', type=int, default=200,
                        help='Expected HTTP status code (default: 200)')
    return parser.parse_args()


def main():
    args = parse_args()

    try:
        start = time.monotonic()
        resp = requests.get(args.url, timeout=args.critical + 1)
        elapsed = time.monotonic() - start
    except requests.exceptions.Timeout:
        print(f"CRITICAL: connection to {args.url} timed out")
        sys.exit(2)
    except requests.exceptions.ConnectionError as e:
        print(f"CRITICAL: could not connect to {args.url}: {e}")
        sys.exit(2)
    except Exception as e:
        print(f"UNKNOWN: {e}")
        sys.exit(3)

    perf = f"response_time={elapsed:.3f}s;{args.warning};{args.critical};0;"

    if resp.status_code != args.expected_status:
        print(f"CRITICAL: got HTTP {resp.status_code}, expected "
              f"{args.expected_status} ({elapsed:.3f}s) | {perf}")
        sys.exit(2)

    if elapsed >= args.critical:
        print(f"CRITICAL: response time {elapsed:.3f}s >= {args.critical}s "
              f"(HTTP {resp.status_code}) | {perf}")
        sys.exit(2)
    elif elapsed >= args.warning:
        print(f"WARNING: response time {elapsed:.3f}s >= {args.warning}s "
              f"(HTTP {resp.status_code}) | {perf}")
        sys.exit(1)
    else:
        print(f"OK: HTTP {resp.status_code} in {elapsed:.3f}s | {perf}")
        sys.exit(0)


if __name__ == '__main__':
    main()

Common mistakes I've seen (and made)

Forgetting to flush stdout before exit. If you use print() in Python 3 normally this is fine, but if you're ever mixing sys.stdout.write() with sys.exit(), add a sys.stdout.flush() before the exit call. On some systems the buffer doesn't get flushed on a hard exit and Nagios receives empty output, which it interprets as a problem with the check itself.

Multi-line output. Nagios displays only the first line of your check output in the service list. Everything after the first \n is "long output" and only appears in the detail view. I've seen people write checks that print a nice summary and then a table of details — and then wonder why the service list shows a truncated mess. Put your most important information on line one.

Not handling the UNKNOWN case. If your check can't run — the library isn't installed, the thing you're measuring isn't accessible, you got an unexpected exception — exit with code 3. Don't let the exception bubble up and produce a Python traceback as the "output," and definitely don't silently exit 0.

Threshold direction. For some metrics, lower is worse (free disk space, available connections). Make sure your warning/critical comparisons are going the right direction. It's embarrassing to alert on "too much free disk space."

The nagiosplugin library

If you find yourself writing a lot of checks, look at the nagiosplugin library (pip install nagiosplugin). It handles argument parsing, threshold comparison, performance data formatting, and exit codes for you. The tradeoff is a steeper learning curve upfront — you write a class rather than a script. For a one-off check I usually just roll my own. For something I'm going to maintain long-term or hand off to someone else, nagiosplugin is worth it.

Making the check executable and placing it

chmod 755 check_process_memory.py
sudo cp check_process_memory.py /usr/lib/nagios/plugins/
sudo chown root:root /usr/lib/nagios/plugins/check_process_memory.py

Then in your Nagios config:

define command {
    command_name    check_process_memory
    command_line    /usr/lib/nagios/plugins/check_process_memory.py -p $ARG1$ -w $ARG2$ -c $ARG3$
}

And in your service definition:

define service {
    use                 generic-service
    host_name           myserver
    service_description Java Memory
    check_command       check_process_memory!java!800!1024
}

Always test the check manually as the nagios user before relying on it:

sudo -u nagios /usr/lib/nagios/plugins/check_process_memory.py -p java -w 800 -c 1024
echo $?

Running it as the nagios user matters because the nagios user has a stripped-down environment. A check that works as root or your own user can easily fail as nagios because of missing PATH entries, missing environment variables, or permissions on files the check needs to read. Test it as nagios before you trust it.