Managing Nagios Configuration with Ansible
I wrote about automating Nagios with Puppet's exported resources pattern back in 2015. That approach works well if you're already running a Puppet environment with PuppetDB, but the setup cost is real. By mid-2016 most new projects I'm seeing are going with Ansible, and for good reason: no agent, no central database, and Jinja2 templates for config generation are intuitive enough that you can hand them to someone who's never used Ansible before and they'll figure out what's happening within a few minutes.
This post covers the Ansible approach to keeping Nagios configurations in sync. It's a different set of tradeoffs from the Puppet approach, and I'll be honest about where you give something up.
Why Ansible Works Well Here
The core of Nagios config management is: given a list of hosts and what we want to check on them, generate the right config files and reload Nagios when they change. That's a straightforward data-to-template problem, and Ansible's Jinja2 templating handles it cleanly.
There's also an nagios Ansible module for managing Nagios objects at runtime — scheduling downtime, acknowledging alerts — which is useful during deploys. I'll cover that separately below.
No PuppetDB dependency means less infrastructure. You're relying on your Ansible inventory, which you already maintain, as the source of truth for what hosts exist and what groups they're in. That's the right place for that information anyway.
Directory Structure
Inside a nagios role:
roles/nagios/
tasks/
main.yml
templates/
hosts.cfg.j2
services.cfg.j2
nrpe.cfg.j2
handlers/
main.yml
vars/
main.yml
Host-specific monitoring thresholds live in group_vars/:
group_vars/
webservers.yml # HTTP check intervals, SSL expiry warning threshold
all.yml # NRPE disk and memory warning/critical levels
The Jinja2 Templates
This is the part that does the real work. The hosts.cfg.j2 template loops over a group from Ansible inventory:
# Managed by Ansible — do not edit by hand
{% for host in groups['webservers'] %}
define host {
host_name {{ host }}
alias {{ hostvars[host]['ansible_hostname'] | default(host) }}
address {{ hostvars[host]['ansible_host'] | default(host) }}
hostgroups webservers,linux-servers
use linux-server
max_check_attempts 3
check_interval 5
retry_interval 1
check_period 24x7
notification_period workhours
}
{% endfor %}
{% for host in groups['dbservers'] | default([]) %}
define host {
host_name {{ host }}
alias {{ hostvars[host]['ansible_hostname'] | default(host) }}
address {{ hostvars[host]['ansible_host'] | default(host) }}
hostgroups dbservers,linux-servers
use linux-server
max_check_attempts 3
check_interval 5
retry_interval 1
check_period 24x7
notification_period workhours
}
{% endfor %}
The services.cfg.j2 generates service checks by group:
{% for host in groups['webservers'] %}
define service {
host_name {{ host }}
service_description HTTP
check_command check_http
use generic-service
check_interval 5
max_check_attempts 3
}
define service {
host_name {{ host }}
service_description SSL Certificate Expiry
check_command check_http!-S --sni -C {{ ssl_cert_warning_days | default(30) }},{{ ssl_cert_critical_days | default(14) }}
use generic-service
check_interval 1440
max_check_attempts 1
}
define service {
host_name {{ host }}
service_description Disk Usage
check_command check_nrpe!check_disk
use generic-service
check_interval 10
max_check_attempts 3
}
define service {
host_name {{ host }}
service_description Memory Usage
check_command check_nrpe!check_mem
use generic-service
check_interval 10
max_check_attempts 3
}
{% endfor %}
The NRPE config template (nrpe.cfg.j2) gets deployed to the monitored hosts:
# Managed by Ansible — do not edit by hand
server_port=5666
allowed_hosts=127.0.0.1,{{ nagios_server_ip }}
dont_blame_nrpe=0
command[check_disk]=/usr/lib/nagios/plugins/check_disk \
-w {{ nrpe_disk_warning | default('20%') }} \
-c {{ nrpe_disk_critical | default('10%') }} \
-p /
command[check_mem]=/usr/lib/nagios/plugins/check_mem \
-w {{ nrpe_mem_warning | default(80) }} \
-c {{ nrpe_mem_critical | default(90) }}
The Playbook
- name: Configure Nagios server
hosts: nagios_server
roles:
- nagios
# roles/nagios/tasks/main.yml
- name: Install Nagios packages
apt:
name:
- nagios3
- nagios-nrpe-plugin
state: present
- name: Deploy Nagios hosts config
template:
src: hosts.cfg.j2
dest: /etc/nagios3/conf.d/ansible-hosts.cfg
owner: nagios
group: nagios
mode: '0644'
notify: validate and reload nagios
- name: Deploy Nagios services config
template:
src: services.cfg.j2
dest: /etc/nagios3/conf.d/ansible-services.cfg
owner: nagios
group: nagios
mode: '0644'
notify: validate and reload nagios
The handler is where I enforce the validate-before-reload pattern:
# roles/nagios/handlers/main.yml
- name: validate and reload nagios
block:
- name: Validate Nagios configuration
command: nagios3 -v /etc/nagios3/nagios.cfg
register: nagios_verify
changed_when: false
- name: Reload Nagios
service:
name: nagios3
state: reloaded
when: nagios_verify.rc == 0
- name: Fail if config is invalid
fail:
msg: "Nagios config validation failed. Review the template output."
when: nagios_verify.rc != 0
This pattern matters. If you let Ansible call service: state=reloaded directly after a template change and the template has a syntax error, Nagios reloads against a broken config. Validation first means a bad template deploy fails loud and early, and your existing monitoring keeps running.
The Nagios Module for Runtime Management
The nagios Ansible module is for managing Nagios objects at runtime — not config files, but live Nagios state. The most useful application during a rolling deploy is acknowledging or scheduling downtime so you don't get paged for expected alerts:
- name: Schedule downtime for web server during deploy
nagios:
action: downtime
host: "{{ inventory_hostname }}"
service: HTTP
minutes: 15
author: ansible-deploy
comment: "Rolling deploy in progress"
nagios_url: http://nagios.internal/nagios3
nagios_user: nagiosadmin
nagios_password: "{{ nagios_api_password }}"
This runs against each web server as it's cycled through the rolling deploy, so you're not flooded with HTTP check failures during restarts. The downtime window is short — 15 minutes — and Nagios recovers normal monitoring automatically once it expires.
Also Deploy NRPE to Monitored Hosts
Don't forget the other half. The Nagios server config is useless if NRPE isn't configured correctly on the monitored hosts. A separate play handles that:
- name: Configure NRPE on monitored hosts
hosts: webservers:dbservers
tasks:
- name: Install NRPE daemon
apt:
name: nagios-nrpe-server
state: present
- name: Deploy NRPE config
template:
src: nrpe.cfg.j2
dest: /etc/nagios/nrpe.cfg
owner: root
group: nagios
mode: '0640'
notify: restart nrpe
handlers:
- name: restart nrpe
service:
name: nagios-nrpe-server
state: restarted
The Gotcha: Don't Use lineinfile for Nagios Configs
I've seen people try to manage Nagios configs by patching specific lines with lineinfile. Don't. Nagios config files have implicit ordering requirements for some directives, and define blocks need to be self-contained. If you use lineinfile to add a service definition, you risk half-written blocks, duplicate stanzas on repeated runs, and configs that pass nagios -v validation but behave unexpectedly at runtime.
Template the whole file. If you need per-host customization, put it in host_vars/ and reference it in the template. This keeps the config generation path simple and predictable.
Honest Comparison with the Puppet Approach
The Puppet exported resources approach has one significant advantage: nodes register themselves. Add a new server to your Puppet environment with the right class, and it shows up in Nagios on the next puppet run. No one has to update the Ansible inventory.
With Ansible, you explicitly add the new host to inventory before running the playbook. That's an extra manual step.
I've come to prefer the Ansible model anyway, for two reasons. First, the Ansible inventory is already the authoritative list of infrastructure in most shops I work with — adding a host to inventory is something that happens as part of provisioning regardless of monitoring. Second, the "magic" of self-registration is also a footgun: a misconfigured puppet run on a new host can create Nagios definitions before the host is actually ready to be monitored, and then you're chasing phantom alerts. Explicit is better here.
Less magic, more control. That's the tradeoff and I'll take it.