Sensu Go: Modern Infrastructure Monitoring
During the Rails-to-GCP migration at Privia Health, we were also running Sensu monitoring checks across both the old Rackspace infrastructure and the new GCP environment. We were on Sensu Classic (0.x) when I arrived and were in the process of migrating to Sensu Go (5.x). The architectural differences between the two are significant enough that it's worth explaining both.
What Sensu Is and How It Differs from Nagios
Nagios uses a centralized polling model: the Nagios server connects out to monitored hosts to run checks (via NRPE or SSH), or hosts schedule passive results to be sent back. Everything is defined in flat config files on the Nagios server. The host/service model is rigid — a service belongs to a host, and that's the primary organizational unit.
Sensu flips this. Agents run on monitored nodes and execute checks locally, then publish results back to the backend over a persistent WebSocket connection. The backend doesn't reach into hosts. This works much better in cloud environments where IPs are ephemeral and you're not maintaining a static list of hosts.
Sensu also uses an entity model instead of host/service. An entity represents anything being monitored — a server, a network device, a cloud function, a proxy check target. Entities aren't pre-registered; they appear when an agent first checks in. This is a better fit for autoscaling infrastructure.
Architecture
Sensu Go has three main components:
sensu-backend: The control plane. Handles check scheduling, result processing, event filtering, and handler dispatch. Uses an embedded etcd cluster for state storage. The API and web UI run here. In production you run an odd number of backends (3 or 5) for etcd quorum.
sensu-agent: Runs on each monitored node. Subscribes to check topics, executes check commands, publishes results. Communicates with the backend over WebSocket. Also exposes a local StatsD listener for metrics.
etcd: Embedded in the backend. No Redis, no RabbitMQ, no external message queue — this was one of the pain points in Sensu Classic.
Writing a Custom Check
Sensu checks follow the same exit code contract as Nagios plugins:
0— OK1— WARNING2— CRITICAL3— UNKNOWN
Here's a Python check that verifies a Rails Puma process is running:
#!/usr/bin/env python3
"""
check_puma_process.py — verifies Puma master process is running
"""
import subprocess
import sys
def check_puma():
try:
result = subprocess.run(
['pgrep', '-f', 'puma.*master'],
capture_output=True,
text=True,
timeout=5
)
if result.returncode == 0:
pids = result.stdout.strip().split('\n')
print(f"OK: Puma master running, PID(s): {', '.join(pids)}")
sys.exit(0)
else:
print("CRITICAL: Puma master process not found")
sys.exit(2)
except subprocess.TimeoutExpired:
print("UNKNOWN: pgrep timed out")
sys.exit(3)
except Exception as e:
print(f"UNKNOWN: {e}")
sys.exit(3)
if __name__ == '__main__':
check_puma()
The check script lives on the agent nodes (or is distributed via Sensu assets). The output written to stdout becomes the event message in Sensu.
Check Definition (sensu-agent resource)
# check_puma.yaml
type: CheckConfig
api_version: core/v2
metadata:
name: check-puma-process
namespace: default
labels:
app: rails
environment: production
spec:
command: /usr/local/bin/check_puma_process.py
interval: 30
timeout: 10
subscriptions:
- rails-app
handlers:
- pagerduty
- slack
publish: true
round_robin: false
Apply it with sensuctl:
sensuctl create -f check_puma.yaml
Agents that have the rails-app subscription in their agent.yml will execute this check every 30 seconds. The subscription system replaces Nagios's host groups and service templates — you tag agents with subscription labels and assign checks to those labels.
Handlers
A pipe handler sends event JSON to a command's stdin:
type: Handler
api_version: core/v2
metadata:
name: pagerduty
namespace: default
spec:
type: pipe
command: sensu-pagerduty-handler --token $PAGERDUTY_TOKEN
env_vars:
- PAGERDUTY_TOKEN=$PAGERDUTY_TOKEN
timeout: 10
filters:
- is_incident
- not_silenced
A set handler fans out to multiple handlers:
type: Handler
api_version: core/v2
metadata:
name: alerts
namespace: default
spec:
type: set
handlers:
- pagerduty
- slack
The is_incident filter fires the handler only on state changes (OK to non-OK or vice versa), not on every recurring CRITICAL result. Without it you'll spam PagerDuty with the same alert every 30 seconds.
Silencing During Maintenance
# Silence a specific check on a specific entity for 2 hours
sensuctl silenced create \
--subscription entity:privia-app-01 \
--check check-puma-process \
--expire 7200 \
--reason "Scheduled maintenance - upgrading Puma"
# List active silences
sensuctl silenced list
# Delete a silence
sensuctl silenced delete entity:privia-app-01:check-puma-process
sensuctl for Day-to-Day Operations
# List all entities
sensuctl entity list
# Show recent events
sensuctl event list
# Show events in a non-OK state
sensuctl event list --field-selector "event.check.state == failing"
# Force a check to run immediately
sensuctl check execute check-puma-process
# View a specific check config
sensuctl check info check-puma-process --format yaml
Sensu Classic vs. Sensu Go
Sensu Classic (0.x) was Ruby-based, used Redis for the event store, and RabbitMQ as the transport. Configuration lived in JSON files. The check result pipeline was: agent → RabbitMQ → Ruby server process → Redis.
Sensu Go (5.x+) is a complete rewrite in Go. The backend is a single binary with embedded etcd. The agent is also a single binary. The configuration format is YAML/JSON resources managed via sensuctl or the API. No Redis, no RabbitMQ.
The migration from Classic to Go is not an in-place upgrade — you're essentially re-implementing your checks and handlers in the new format. We did this incrementally, running both systems in parallel for about six weeks until all checks were migrated and validated.
If you're migrating from Nagios, the mental model shift is: replace hosts/services with entities/subscriptions, replace NRPE with the Sensu agent, and replace Nagios config files with sensuctl-managed YAML resources. Your existing check scripts (anything that outputs text and exits with 0/1/2/3) work without modification.