Blog
October 19, 2016 Marie H.

Managing Nagios Configuration with Ansible

Managing Nagios Configuration with Ansible

Photo by <a href="https://unsplash.com/@freeche?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Kvistholt Photography</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Managing Nagios Configuration with Ansible

I wrote about automating Nagios with Puppet's exported resources pattern back in 2015. That approach works well if you're already running a Puppet environment with PuppetDB, but the setup cost is real. By mid-2016 most new projects I'm seeing are going with Ansible, and for good reason: no agent, no central database, and Jinja2 templates for config generation are intuitive enough that you can hand them to someone who's never used Ansible before and they'll figure out what's happening within a few minutes.

This post covers the Ansible approach to keeping Nagios configurations in sync. It's a different set of tradeoffs from the Puppet approach, and I'll be honest about where you give something up.

Why Ansible Works Well Here

The core of Nagios config management is: given a list of hosts and what we want to check on them, generate the right config files and reload Nagios when they change. That's a straightforward data-to-template problem, and Ansible's Jinja2 templating handles it cleanly.

There's also an nagios Ansible module for managing Nagios objects at runtime — scheduling downtime, acknowledging alerts — which is useful during deploys. I'll cover that separately below.

No PuppetDB dependency means less infrastructure. You're relying on your Ansible inventory, which you already maintain, as the source of truth for what hosts exist and what groups they're in. That's the right place for that information anyway.

Directory Structure

Inside a nagios role:

roles/nagios/
  tasks/
    main.yml
  templates/
    hosts.cfg.j2
    services.cfg.j2
    nrpe.cfg.j2
  handlers/
    main.yml
  vars/
    main.yml

Host-specific monitoring thresholds live in group_vars/:

group_vars/
  webservers.yml   # HTTP check intervals, SSL expiry warning threshold
  all.yml          # NRPE disk and memory warning/critical levels

The Jinja2 Templates

This is the part that does the real work. The hosts.cfg.j2 template loops over a group from Ansible inventory:

# Managed by Ansible — do not edit by hand

{% for host in groups['webservers'] %}
define host {
    host_name               {{ host }}
    alias                   {{ hostvars[host]['ansible_hostname'] | default(host) }}
    address                 {{ hostvars[host]['ansible_host'] | default(host) }}
    hostgroups              webservers,linux-servers
    use                     linux-server
    max_check_attempts      3
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_period     workhours
}
{% endfor %}

{% for host in groups['dbservers'] | default([]) %}
define host {
    host_name               {{ host }}
    alias                   {{ hostvars[host]['ansible_hostname'] | default(host) }}
    address                 {{ hostvars[host]['ansible_host'] | default(host) }}
    hostgroups              dbservers,linux-servers
    use                     linux-server
    max_check_attempts      3
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_period     workhours
}
{% endfor %}

The services.cfg.j2 generates service checks by group:

{% for host in groups['webservers'] %}
define service {
    host_name               {{ host }}
    service_description     HTTP
    check_command           check_http
    use                     generic-service
    check_interval          5
    max_check_attempts      3
}

define service {
    host_name               {{ host }}
    service_description     SSL Certificate Expiry
    check_command           check_http!-S --sni -C {{ ssl_cert_warning_days | default(30) }},{{ ssl_cert_critical_days | default(14) }}
    use                     generic-service
    check_interval          1440
    max_check_attempts      1
}

define service {
    host_name               {{ host }}
    service_description     Disk Usage
    check_command           check_nrpe!check_disk
    use                     generic-service
    check_interval          10
    max_check_attempts      3
}

define service {
    host_name               {{ host }}
    service_description     Memory Usage
    check_command           check_nrpe!check_mem
    use                     generic-service
    check_interval          10
    max_check_attempts      3
}
{% endfor %}

The NRPE config template (nrpe.cfg.j2) gets deployed to the monitored hosts:

# Managed by Ansible — do not edit by hand
server_port=5666
allowed_hosts=127.0.0.1,{{ nagios_server_ip }}
dont_blame_nrpe=0

command[check_disk]=/usr/lib/nagios/plugins/check_disk \
    -w {{ nrpe_disk_warning | default('20%') }} \
    -c {{ nrpe_disk_critical | default('10%') }} \
    -p /

command[check_mem]=/usr/lib/nagios/plugins/check_mem \
    -w {{ nrpe_mem_warning | default(80) }} \
    -c {{ nrpe_mem_critical | default(90) }}

The Playbook

- name: Configure Nagios server
  hosts: nagios_server
  roles:
    - nagios

# roles/nagios/tasks/main.yml
- name: Install Nagios packages
  apt:
    name:
      - nagios3
      - nagios-nrpe-plugin
    state: present

- name: Deploy Nagios hosts config
  template:
    src: hosts.cfg.j2
    dest: /etc/nagios3/conf.d/ansible-hosts.cfg
    owner: nagios
    group: nagios
    mode: '0644'
  notify: validate and reload nagios

- name: Deploy Nagios services config
  template:
    src: services.cfg.j2
    dest: /etc/nagios3/conf.d/ansible-services.cfg
    owner: nagios
    group: nagios
    mode: '0644'
  notify: validate and reload nagios

The handler is where I enforce the validate-before-reload pattern:

# roles/nagios/handlers/main.yml
- name: validate and reload nagios
  block:
    - name: Validate Nagios configuration
      command: nagios3 -v /etc/nagios3/nagios.cfg
      register: nagios_verify
      changed_when: false

    - name: Reload Nagios
      service:
        name: nagios3
        state: reloaded
      when: nagios_verify.rc == 0

    - name: Fail if config is invalid
      fail:
        msg: "Nagios config validation failed. Review the template output."
      when: nagios_verify.rc != 0

This pattern matters. If you let Ansible call service: state=reloaded directly after a template change and the template has a syntax error, Nagios reloads against a broken config. Validation first means a bad template deploy fails loud and early, and your existing monitoring keeps running.

The Nagios Module for Runtime Management

The nagios Ansible module is for managing Nagios objects at runtime — not config files, but live Nagios state. The most useful application during a rolling deploy is acknowledging or scheduling downtime so you don't get paged for expected alerts:

- name: Schedule downtime for web server during deploy
  nagios:
    action: downtime
    host: "{{ inventory_hostname }}"
    service: HTTP
    minutes: 15
    author: ansible-deploy
    comment: "Rolling deploy in progress"
    nagios_url: http://nagios.internal/nagios3
    nagios_user: nagiosadmin
    nagios_password: "{{ nagios_api_password }}"

This runs against each web server as it's cycled through the rolling deploy, so you're not flooded with HTTP check failures during restarts. The downtime window is short — 15 minutes — and Nagios recovers normal monitoring automatically once it expires.

Also Deploy NRPE to Monitored Hosts

Don't forget the other half. The Nagios server config is useless if NRPE isn't configured correctly on the monitored hosts. A separate play handles that:

- name: Configure NRPE on monitored hosts
  hosts: webservers:dbservers
  tasks:
    - name: Install NRPE daemon
      apt:
        name: nagios-nrpe-server
        state: present

    - name: Deploy NRPE config
      template:
        src: nrpe.cfg.j2
        dest: /etc/nagios/nrpe.cfg
        owner: root
        group: nagios
        mode: '0640'
      notify: restart nrpe

  handlers:
    - name: restart nrpe
      service:
        name: nagios-nrpe-server
        state: restarted

The Gotcha: Don't Use lineinfile for Nagios Configs

I've seen people try to manage Nagios configs by patching specific lines with lineinfile. Don't. Nagios config files have implicit ordering requirements for some directives, and define blocks need to be self-contained. If you use lineinfile to add a service definition, you risk half-written blocks, duplicate stanzas on repeated runs, and configs that pass nagios -v validation but behave unexpectedly at runtime.

Template the whole file. If you need per-host customization, put it in host_vars/ and reference it in the template. This keeps the config generation path simple and predictable.

Honest Comparison with the Puppet Approach

The Puppet exported resources approach has one significant advantage: nodes register themselves. Add a new server to your Puppet environment with the right class, and it shows up in Nagios on the next puppet run. No one has to update the Ansible inventory.

With Ansible, you explicitly add the new host to inventory before running the playbook. That's an extra manual step.

I've come to prefer the Ansible model anyway, for two reasons. First, the Ansible inventory is already the authoritative list of infrastructure in most shops I work with — adding a host to inventory is something that happens as part of provisioning regardless of monitoring. Second, the "magic" of self-registration is also a footgun: a misconfigured puppet run on a new host can create Nagios definitions before the host is actually ready to be monitored, and then you're chasing phantom alerts. Explicit is better here.

Less magic, more control. That's the tradeoff and I'll take it.