Blog
February 18, 2015 Marie H.

Running Remote Nagios Checks with NRPE

Running Remote Nagios Checks with NRPE

Photo by <a href="https://unsplash.com/@dawson2406?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Stephen Dawson</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Nagios runs on your monitoring server. That's great for network-level checks — ping, HTTP, port availability. But a lot of what you actually care about is invisible from the outside: disk usage, memory consumption, process counts, whether a specific log file is growing errors. For that you need something running on the monitored host itself.

NRPE — Nagios Remote Plugin Executor — is the standard solution. It's a small daemon that runs on each monitored host, listens for commands from the Nagios server, executes local checks, and sends the results back. Simple concept, a few gotchas in practice.

The architecture

Nagios server
  └── check_nrpe plugin
        └── (TCP port 5666)
              └── nrpe daemon on monitored host
                    └── local check script
                          └── result (exit code + output) back to Nagios

The Nagios server runs check_nrpe just like any other check plugin. check_nrpe opens a TCP connection to the NRPE daemon on the target host, tells it which check to run, and waits for the response. The NRPE daemon runs the local check script and ships the exit code and output back. From Nagios's perspective it looks like any other plugin — it gets an exit code and a line of output.

Installing NRPE on the monitored host

On Amazon Linux or CentOS:

sudo yum install nrpe nagios-plugins-all
sudo systemctl enable nrpe
sudo systemctl start nrpe

nagios-plugins-all gives you the standard set of check scripts (check_disk, check_load, check_procs, etc.) that NRPE will be calling. You want these on the monitored host, not the Nagios server.

Configuring nrpe.cfg

The main config file is /etc/nagios/nrpe.cfg. The key settings:

# Which hosts are allowed to connect to this NRPE daemon.
# Only your Nagios server should be here.
allowed_hosts=127.0.0.1,10.0.1.50

# This controls whether the Nagios server can pass arguments to checks.
# Keep this at 0 unless you specifically need it (see the section below).
dont_blame_nrpe=0

# Define the checks this host exposes.
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 5,4,3 -c 10,8,6
command[check_mem]=/usr/lib/nagios/plugins/check_mem.pl -w 80 -c 90
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_myapp]=/usr/local/lib/nagios/plugins/check_myapp.py -w 100 -c 200

The command[name]=... lines define what checks are available. The name in brackets is what the Nagios server uses to request the check. Notice that the thresholds are hardcoded here on the monitored host — that's the safe default when dont_blame_nrpe=0.

After editing the config, restart NRPE:

sudo systemctl restart nrpe

Firewall

NRPE listens on TCP port 5666. The monitored host needs to allow inbound connections on that port from the Nagios server, and only the Nagios server. If you're on AWS, that means a security group rule. If you're using iptables directly:

sudo iptables -A INPUT -p tcp --dport 5666 -s 10.0.1.50 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 5666 -j DROP

Don't leave port 5666 open to the world. NRPE doesn't have authentication beyond the allowed_hosts IP whitelist, and you don't want random people querying your monitoring checks.

On the Nagios server side

Install the check_nrpe plugin on the Nagios server:

sudo yum install nagios-plugins-nrpe

Define the NRPE command in your Nagios config:

define command {
    command_name    check_nrpe
    command_line    /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Then use it in service definitions:

define service {
    use                 generic-service
    host_name           myserver
    service_description Disk Usage
    check_command       check_nrpe!check_disk
}

define service {
    use                 generic-service
    host_name           myserver
    service_description Load Average
    check_command       check_nrpe!check_load
}

The !check_disk part is the check name — it has to match exactly what you defined in nrpe.cfg on the monitored host.

Verifying from the command line

Before wiring things into Nagios, test from the Nagios server directly:

/usr/lib64/nagios/plugins/check_nrpe -H 10.0.1.100 -c check_disk

You should see output like:

DISK OK - free space: / 42567 MB (68% inode=99%): | /=19661MB;52428;58982;0;65536

If you get "Connection refused" check that NRPE is running and the firewall rule is right. If you get "CHECK_NRPE: Error - Could not complete SSL handshake," you may have an SSL version mismatch between the client and server versions of NRPE — this is common when the Nagios server is on a different distro version than the monitored host. Usually resolved by making sure both are using the same NRPE version.

Passing arguments from Nagios to NRPE

By default, checks in nrpe.cfg have their arguments hardcoded. If you want to pass thresholds from the Nagios server at check time, you need to set dont_blame_nrpe=1 in nrpe.cfg and use $ARG1$ macros in your command definitions.

# nrpe.cfg on the monitored host
dont_blame_nrpe=1
command[check_disk_args]=/usr/lib64/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$

Then on the Nagios server:

define command {
    command_name    check_nrpe_args
    command_line    /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$
}
define service {
    ...
    check_command   check_nrpe_args!check_disk_args!20%!10%!/
}

The security tradeoff with dont_blame_nrpe=1 is real: if an attacker can manipulate the arguments being passed, they can potentially inject shell commands into your check invocations. The allowed_hosts whitelist helps, but defense in depth means keeping dont_blame_nrpe=0 when you can and hardcoding the thresholds. I only enable argument passing when I have a legitimate operational reason — like wanting to control thresholds from the Nagios server without touching every nrpe.cfg file.

A real nrpe.cfg with useful checks

Here's what a reasonably complete nrpe.cfg command section looks like for a typical Linux server:

# Disk usage - warn at 20% free, critical at 10% free
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /

# Memory usage (requires check_mem.pl from nagios-plugins-contrib)
command[check_mem]=/usr/lib/nagios/plugins/check_mem.pl -f -w 20 -c 10

# System load - warn/crit thresholds for 1m, 5m, 15m averages
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 4,3,2 -c 8,6,4

# Zombie processes
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z

# Custom app check - our Python check from the previous post
command[check_myapp_memory]=/usr/local/lib/nagios/plugins/check_process_memory.py -p myapp -w 512 -c 1024

Debugging: the nagios user's environment problem

This is the gotcha that burns everyone at least once. You SSH to the monitored host, run the check manually as yourself, it works fine. Then Nagios reports UNKNOWN or the check fails. The reason is almost always the nagios user's environment.

The nagios user has a minimal shell environment. It may not have the same PATH as your user. It may not have environment variables your check depends on. It may not have permission to read files your check needs.

Always test checks as the nagios user:

sudo -u nagios /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
echo $?

If that fails and running as yourself works, you've found your problem. Common fixes:

  • Use full absolute paths in your check scripts instead of relying on PATH
  • Set required environment variables explicitly in the check script
  • Use sudoers to grant the nagios user permission to run specific commands as root

For that last one, if your check needs elevated permissions — reading from /proc in certain ways, running netstat, checking hardware sensors — add a sudoers entry:

# /etc/sudoers.d/nagios
nagios ALL=(root) NOPASSWD: /usr/lib64/nagios/plugins/check_disk
nagios ALL=(root) NOPASSWD: /usr/local/lib/nagios/plugins/check_process_memory.py

And call the check in nrpe.cfg with sudo:

command[check_disk_root]=sudo /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /

Don't give nagios a blanket NOPASSWD: ALL — that's a privilege escalation waiting to happen. Be specific about which commands it's allowed to run.

Summary

NRPE is straightforward once you understand what's happening:

  1. Install NRPE and the nagios-plugins on each monitored host
  2. Configure allowed_hosts and define your commands in nrpe.cfg
  3. Open port 5666 from the Nagios server only
  4. Wire up check_nrpe commands on the Nagios server side
  5. Test from the command line with check_nrpe -H host -c check_name before you trust the config
  6. Debug as the nagios user, not as yourself

The nagios user environment thing is where most NRPE debugging time goes. Check that first when something works manually but fails through NRPE.