Custom Health Monitors on Citrix NetScaler with Ansible

Load balancers are only as smart as their health checks. NetScaler ships with a solid set of built-in monitors — HTTP, TCP, SSL, DNS, and more — but they all share a fundamental limitation: they check connectivity and response codes, not application-level health. For many services, that is not enough.

This post walks through deploying custom health monitors on Citrix NetScaler using Ansible and the NITRO API. The end result is a fully automated, idempotent playbook that uploads a custom Perl probe script, configures a monitor, creates a backend service, and binds the monitor to that service.

Why Custom Monitors

The built-in HTTP monitor is the workhorse of most setups. You point it at a URL, give it an expected status code, and optionally match on response content. That works well for REST services that return a clean 200 with a meaningful body when healthy.

But real infrastructure is messier. SOAP services have different conventions entirely. Internal health endpoints often return HTTP 200 regardless of actual state, encoding the real status inside an XML body. Legacy services may not have a health endpoint at all and need a synthetic check. Some services are "up" in the sense that they accept connections but "degraded" in a way that matters to callers — a background job queue backed up, a database connection pool exhausted, a downstream dependency unreachable.

For these cases, NetScaler supports custom USER monitors. You write a Perl script, deploy it to the appliance, and NetScaler's monitoring engine calls it periodically. The script does whatever it needs to — make an HTTP request, parse XML, check a custom condition — and returns 0 for healthy or 1 for unhealthy. NetScaler handles the rest: tracking state, counting retries, marking services up or down.

The KAS Framework

NetScaler's custom monitor system is built around the KAS (Keep-Alive Service) framework. Your Perl script imports Netscaler::KAS and calls probe() with a reference to your check subroutine. The monitoring engine calls your subroutine on each probe interval, passing the backend IP, port, and any script arguments configured on the monitor.

Your subroutine returns a two-element list: a status code (0 for online, 1 for offline) and a message string that shows up in the monitor logs.

Scripts live on the appliance at /nsconfig/monitors/ and must be executable.

Here is a complete example — a monitor that POSTs a SOAP request to a health check endpoint and inspects the XML response:

#!/usr/bin/perl
use warnings;
use strict;
use Netscaler::KAS;
use LWP::UserAgent;
use HTTP::Request;
use IO::Socket::SSL qw( SSL_VERIFY_NONE );
use Socket;

my $soap_body = '<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
  <soap12:Body>
    <GetServiceStatus xmlns="http://ws.example.com/API" />
  </soap12:Body>
</soap12:Envelope>';

sub check_service_health {
    my @args = @_;
    my $ip   = inet_aton($args[0]);
    my $port = $args[1];
    my $host = gethostbyaddr($ip, AF_INET);
    my $script_args = $args[2];

    $script_args =~ /path=(\S+)/ or return (1, "Invalid argument format");
    my $path = $1;

    my $endpoint = "https://$host:$port/$path/API.asmx";

    my $ua = LWP::UserAgent->new();
    $ua->ssl_opts(verify_hostname => 0, SSL_verify_mode => SSL_VERIFY_NONE);
    my $req = HTTP::Request->new(POST => $endpoint);
    $req->content($soap_body);
    $req->content_type("text/xml; charset=utf-8");
    my $resp = $ua->request($req);
    my $xml  = $resp->as_string;

    if ($xml =~ /OnlineConnected/) {
        return (0, "Online");
    } else {
        return (1, "Offline");
    }
}

probe(\&check_service_health);

A few things worth noting. The script args are passed as a raw string — in this case path=some/api/path — so you need to parse them yourself. The gethostbyaddr call is there because the backend service uses virtual hosting and the correct Host header matters; the SOAP endpoint cares what hostname it sees. The SSL verification is disabled here because internal services often use self-signed certs — adjust that for your environment.

The probe(\&check_service_health) call at the bottom is what registers your subroutine with the KAS framework. Without it the script does nothing.

Deploying with Ansible

Here is where it gets interesting. Deploying a custom monitor involves five distinct steps:

Upload the Perl script to the appliance
Make it executable
Configure the monitor resource
Create the backend service
Bind the monitor to the service

The NITRO API handles file uploads through the systemfile resource. File content is passed as a base64-encoded string. Ansible's b64encode filter and the lookup('file', ...) plugin make this straightforward — you encode the script inline without needing a separate step or a temporary file on the control node.

- name: Deploy custom service health monitor
  hosts: localhost
  gather_facts: false
  vars_files:
    - vars/netscaler.yml

  tasks:
    - name: Upload monitor script to NetScaler
      netscaler_nitro_request:
        nsip: "{{ nsip }}"
        nitro_user: "{{ nitro_user }}"
        nitro_pass: "{{ nitro_pass }}"
        operation: add
        resource: systemfile
        attributes:
          filename: check_service_health.pl
          filelocation: /nsconfig/monitors
          filecontent: "{{ lookup('file', 'files/check_service_health.pl') | b64encode }}"
          fileencoding: BASE64

    - name: Make monitor script executable
      command: >
        ssh {{ nitro_user }}@{{ nsip }}
        "shell chmod +x /nsconfig/monitors/check_service_health.pl"

    - name: Configure LB monitor
      netscaler_lb_monitor:
        nsip: "{{ nsip }}"
        nitro_user: "{{ nitro_user }}"
        nitro_pass: "{{ nitro_pass }}"
        monitorname: "{{ monitor_name }}"
        type: USER
        scriptname: check_service_health.pl
        scriptargs: "path={{ service_api_path }}"
        interval: "{{ monitor_interval | default(10) }}"
        alertretries: "{{ monitor_alert_retries | default(3) }}"

    - name: Create backend service
      netscaler_service:
        nsip: "{{ nsip }}"
        nitro_user: "{{ nitro_user }}"
        nitro_pass: "{{ nitro_pass }}"
        name: "{{ service_name }}"
        ip: "{{ service_ip }}"
        port: "{{ service_port }}"
        servicetype: TCP

    - name: Bind monitor to service
      netscaler_nitro_request:
        nsip: "{{ nsip }}"
        nitro_user: "{{ nitro_user }}"
        nitro_pass: "{{ nitro_pass }}"
        operation: add
        resource: lbmonitor_service_binding
        attributes:
          monitorname: "{{ monitor_name }}"
          servicename: "{{ service_name }}"

The chmod step via SSH is the ugliest part of this playbook, and I will get to why below. There is no NITRO API endpoint for file permissions on NetScaler — you have to SSH to the management interface and run the command through the NetScaler shell. It works, but it sticks out against otherwise API-driven automation.

The Vars File

The vars/netscaler.yml file holds the connection parameters and service-specific values:

nsip: "192.168.1.100"
nitro_user: "nsroot"
nitro_pass: "changeme"

monitor_name: "mon_example_service"
monitor_interval: 10
monitor_alert_retries: 3

service_name: "svc_example_backend"
service_ip: "10.0.1.50"
service_port: 443
service_api_path: "ExampleService"

Do not put credentials in a plaintext vars file in production. Use Ansible Vault. The pattern is straightforward — encrypt the vars file with ansible-vault encrypt vars/netscaler.yml and pass --ask-vault-pass or configure a vault password file when running the playbook. The nsroot account has full administrative access to the appliance; treat those credentials accordingly.

Idempotency Gotchas

The netscaler_lb_monitor and netscaler_service modules handle idempotency correctly. Run the playbook twice and they check whether the resource exists, compare the desired state, and only make changes when needed. They behave like proper Ansible modules.

netscaler_nitro_request does not. It is a thin wrapper around the NITRO API and does exactly what you tell it to do. If you run operation: add on a resource that already exists, NITRO returns an error and the task fails. This affects two tasks in the playbook above: the file upload and the monitor-to-service binding.

There are a few ways to handle this. The cleanest is to use operation: update instead of add for resources you expect to already exist after the first run, but that fails on the first run when the resource does not exist yet. Some people reach for ignore_errors: true on the add tasks, which works but is sloppy — you will silently swallow real errors.

A more defensible approach is to add a check task first using operation: get and register the result, then conditionally run the add only when the resource is absent. That adds verbosity but makes the behavior explicit. For the file upload specifically, you may also want to track whether the script content has changed and use operation: update when it has.

For a one-time deployment playbook this is manageable. For something you run frequently against a live appliance, the idempotency gaps in netscaler_nitro_request are worth thinking through carefully.

Testing the Monitor

Before binding a custom monitor to a production service, test it in isolation. From the NetScaler shell (shell from the CLI prompt):

nsconmsg -d current -g mon_probe_failed

This shows currently failing probes with details about what went wrong. If your probe script has a syntax error or is failing to connect, this is where you will see it.

To check the current state of a specific monitor:

stat lb monitor

This gives you the state (UP/DOWN/UNKNOWN), the last response, and probe statistics. Run it after binding the monitor and give it a few probe intervals to see results.

Testing manually before binding saves significant debugging time. A monitor in UNKNOWN state does not affect service availability, but binding a broken monitor to a service that was previously healthy will immediately trigger the alertretries countdown. Know that your script works before you connect it to anything real.

Honest Take

The NITRO API is genuinely powerful. The data model is comprehensive, the REST interface is consistent, and it covers the vast majority of NetScaler configuration. When everything works, Ansible-driven NetScaler automation is a significant improvement over manual CLI work or brittle expect scripts.

But the Ansible module coverage is incomplete. The community modules handle the most common operations — load balancing, SSL, services, virtual servers — and they handle them well. When you need something less common, you are back to netscaler_nitro_request and digging through the NITRO API documentation, which is extensive but not always clear about which operations are actually supported on your firmware version.

The chmod via SSH is the most obvious rough edge. It would not be hard to add a NITRO API endpoint for file permissions; it just apparently was not a priority. So you have an otherwise API-driven automation playbook with an SSH command sitting in the middle of it. It works, and I have not found a cleaner alternative.

Once the pattern is established and the idempotency gaps are handled, the automation runs reliably. The playbook above has been in use for several months across a handful of appliances with different service configurations, and it behaves consistently. The initial investment in understanding where the modules stop and raw API calls begin is worth it.