SaltStack in Production: Lessons from Two Years

I ran SaltStack in production from mid-2015 through the end of 2016, managing somewhere between 80 and 250 minions depending on the time of year. These are the things I wish I had known at the start, written down before I forget them.

1. The Master Is a Single Point of Failure

This sounds obvious, but it takes a production outage to really appreciate. When your salt-master goes down, your minions keep running. Whatever state they last applied stays applied. Nothing breaks immediately. But you cannot push new jobs, you cannot apply new states, and you cannot remediate anything that goes wrong on a minion while the master is offline.

We ran a single master for the first year. It was fine until it wasn't.

The proper fix is multi-master with syndics, or at minimum a hot standby master. We went the standby route. Both masters share a gitfs backend (more on that below), so they both have access to the same state tree and pillar data. Failover is manual — we update DNS and restart minions — but it covers the outage scenarios that actually happen in practice (master host needs a reboot, disk fills up, you accidentally rm'd something).

If you're multi-datacenter with latency concerns, syndics are worth the complexity. Each datacenter gets a syndic that communicates with the master on behalf of its local minions. We didn't go that far, but we designed for it.

The thing to internalize: minions are resilient by design. The master is not. Treat it accordingly.

2. gitfs vs Local Fileserver

We started with the default setup: state files in /srv/salt on the master, edited in place, applied immediately. This worked fine for the first dozen minions. It stopped working fine when two people were editing states at the same time and someone applied a half-written state to a production box.

Switching to gitfs backed by a private Git repository was the right call and I'd do it immediately on any new deployment.

The benefits are straightforward: every change is version-controlled, you can require pull requests before merging to the branch that maps to production, and you have a full audit trail of who changed what and when. When something breaks after a state run, you look at the git log. This has saved significant debugging time.

The gotchas are real, though:

gitfs caches aggressively. By default, Salt polls the git remote every 60 seconds. If you push a fix and immediately run state.apply, you may get the old state. You can force a cache clear with salt-run fileserver.update followed by salt-run cache.clear_all, but you need to know this is the problem first. We wasted time debugging "why isn't my change doing anything" before we understood this.

The __env__ branch convention maps git branches to Salt environments. Your base environment maps to the branch named base (or master, configurable). Your prod environment maps to prod. This is elegant once you understand it and confusing until you do. Document it for your team.

Lock down push access to the states repo. Granting someone push rights to prod is equivalent to giving them unrestricted root on every managed host. We required two approvals on PRs to the prod branch. This is not paranoia, it is appropriate access control.

3. Targeting at Scale

Salt gives you several ways to target minions: glob matching on minion IDs, grains, pillars, compound matchers, nodegroups, and CIDR ranges. Using them well matters when you have 200+ minions.

Grains are metadata stored on the minion and sent to the master at connection time. Targeting by grain is fast because the master has them cached. We used grains for role (role:webserver), datacenter (datacenter:us-east), and OS. Compound matchers let you combine these: salt -C 'G@role:webserver and G@datacenter:us-east'.

Pillars are authoritative data stored on the master and pushed to minions. Targeting by pillar requires the master to query pillar data for each minion, which is slower. Use pillars for sensitive configuration values, not for targeting.

The grain manipulation problem is real and worth understanding before you rely on grains for security-relevant targeting. Minions can write their own grains. A minion can set grain:role to whatever it wants. If you're using grain-based targeting to decide which minions receive secrets via pillar, a compromised minion could potentially set itself to receive pillar data it shouldn't. We used pillars for access control decisions and grains for operational targeting. Know the difference and be intentional about it.

Nodegroups are worth setting up for any target combination you run frequently. We had nodegroups for prod-web, prod-db, staging-all, and a few others. Define them in the master config. It's a small thing that eliminates a lot of typos on compound matchers at 2am.

4. The Mine System

salt-mine collects function output from minions and stores it centrally on the master, where other minions and master-side tooling can query it with mine.get. It sounds niche. It turned out to be one of the most useful features we used.

Three concrete uses:

First, inventory. We configured minions to publish the version of the application running on each host via the mine. A single salt-run call gives you a current map of what version is deployed where. This replaced a fragile custom script we'd been maintaining.

Second, dynamic HAProxy configuration. We templated our HAProxy config using Jinja, with the backend server list populated from mine.get. When a new web server came up, registered itself in the mine, and the next HAProxy state run pulled the updated list. This is the pattern for service discovery without a dedicated service registry.

Third, monitoring. We pulled mine data into our monitoring system to cross-reference what Salt thought was deployed against what was actually running.

The important caveat: mine data is cached on the master and refreshed on a schedule (default is every 60 minutes, configurable per function). It can be stale. For anything time-sensitive, know your mine refresh interval and factor it in. For inventory-style queries where a few minutes of lag is fine, it works well.

5. Performance at 200+ Minions

The default ZeroMQ settings are tuned for smaller deployments. At 200+ minions we started seeing intermittent timeouts and minions not responding to jobs even when they were healthy.

The settings that helped most:

tcp_keepalive: 1, tcp_keepalive_idle: 180, tcp_keepalive_cnt: 5, tcp_keepalive_intvl: 10 — these go in the minion config and tell the OS to send keepalive probes on the ZeroMQ connections. Without them, connections through network equipment that aggressively times out idle TCP connections silently drop, and the minion appears offline until it reconnects.

On the command side, --timeout controls how long the salt CLI waits for the master to return results, and --gather-job-timeout controls how long it waits for a follow-up check on minions that didn't respond. Increasing both catches slow minions that would otherwise appear as failures. We settled on --timeout 60 --gather-job-timeout 30 as defaults for interactive runs.

presence_events: True in the master config causes minions to periodically broadcast their presence. Combined with the event system, this lets you track which minions are actually online versus which ones the key system thinks should exist. We piped presence events to a simple monitoring check. Knowing that 198 of 200 expected minions are online before you run a broad state.apply is better than finding out mid-run.

6. State Organization

We started with a single top.sls that grew to several hundred lines and became impossible to reason about. Refactoring it was a weekend project we kept putting off until it became a sprint.

What we settled on:

Environments map to git branches: base, staging, prod. Most state logic lives in base. Environment-specific overrides live in their respective environments and use the include: directive to pull in base states.

Within each environment, we organize by role. Each role is a directory with an init.sls that includes the components that role needs. A webserver role includes the nginx component, the app component, the logging component. Components are reusable building blocks with no knowledge of which roles include them.

The top.sls assigns roles to minions based on grains. It's short because the logic is pushed down into the role definitions.

The explicit structure that works: roles/ at the top level, each role as a subdirectory, components/ for shared building blocks, environment-specific pillar overrides in the pillar tree mirroring the same structure. When someone asks "what does a database server get?", you look at roles/database/init.sls and read down the include chain.

7. Pillars for Secrets: The Problem

Pillar data is encrypted in transit between master and minion. This is good. Pillar data is stored plaintext on the master filesystem in /srv/pillar. This is a problem.

We hit this in a security audit. The auditor was correct. If the master host is compromised, all secrets stored in pillars are exposed. For a lot of teams this is an acceptable risk given everything else required to compromise a Salt master. For us it wasn't acceptable for certain credential classes.

We addressed it in two stages.

First, the GPG renderer. Salt has a built-in GPG renderer for pillars that lets you store GPG-encrypted values in pillar files. The master decrypts them at render time using a key stored on the master host. This moves the problem — the key is still on the master — but it means the secrets aren't sitting in plaintext in a file that might end up in a backup, a log, or an accidental cat in a recorded terminal session. It also gives you an audit trail of who encrypted what.

Second, we started moving credentials toward HashiCorp Vault. Salt has a Vault runner and returner. Minions authenticate to Vault via their Salt identity and retrieve secrets directly, so the secrets never touch the Salt master. This is more operationally complex but is the right model for credentials that need strong access control and audit logging. I wrote more about the Vault integration separately.

For most teams: start with the GPG renderer. It's low friction and addresses the obvious audit finding. Graduate to Vault if your security requirements justify it.

8. Testing States

state.apply test=True is the single most important operational habit to build. It renders and evaluates the state tree against the target minions and tells you what would change, without making any changes. Run it before every production apply. Always.

It has limits. It cannot fully simulate requisites in all cases, and some states report as "would change" when they would actually be no-ops, or vice versa. But it catches the obvious failures — syntax errors, missing files, template rendering errors — before they hit production.

For local development of new states, we used kitchen-salt to test states against local VMs before committing. The feedback loop is much faster than deploying to a staging minion, and you can iterate on a state without cluttering your salt-master's job history.

Our CI workflow: any push to the salt-states repository triggers a pipeline that spins up a minimal Docker container, installs the salt-minion, and runs salt-call state.apply test=True against the changed states. This catches broken states before they can be merged to a branch that any real minion is tracking. It does not catch logic errors in your states — it only catches errors that prevent the state from running at all — but it's a meaningful gate.

9. When Things Go Wrong

state.apply applies states sequentially and stops at the first failure by default, unless you use --force. If it stops mid-run, some states have been applied and some haven't. This is usually fine if your states are idempotent — re-running the failed state apply will pick up where it left off, skip the already-applied states, and try again.

The problem is non-idempotent operations. We had database migrations wrapped in cmd.run. A migration would run, make a partial change, fail, and leave the database in an undefined state that the next state apply couldn't automatically recover from. This caused incidents.

The fixes: use unless or onlyif guards on cmd.run states to make them conditionally idempotent. Log what ran. And use the onfail requisite to trigger cleanup steps when something fails. We added onfail handlers to our migration states that would log the failure and notify the on-call channel before halting the run. Getting an alert saying "migration failed at step 3, manual intervention required" is much better than silently leaving the system in a broken state.

The broader lesson is that Salt's requisite system — require, watch, onfail, onchanges — is what separates a well-designed state tree from a brittle one. Learn it. Use it. The states that caused us problems were invariably the ones where someone had strung together a sequence of cmd.run calls without requisites.

10. Why We Eventually Moved Away

By late 2016, Kubernetes was managing our containerized workloads. We were actively reducing the number of long-lived VMs that needed configuration management. The use case for Salt was shrinking, not because Salt was failing us but because the operational model was changing.

For the VMs we kept — database hosts, some legacy services, build infrastructure — Salt remained the right tool and we kept using it. We didn't rip it out. We just stopped expanding it.

If you're running traditional VM-based infrastructure, SaltStack is still, as of early 2017, an excellent choice. It's more powerful than Ansible, more accessible than raw Puppet, and the Python internals are readable when you need to debug something. The reactor and event systems give you capabilities that most config management tools don't have at all.

The tool was not the problem. The operational model was changing.

Honest Assessment

SaltStack is powerful and significantly underrated. The documentation is uneven — some parts are excellent, some parts are out of date or missing — and the learning curve is real. But teams that commit to learning it properly get substantial value: fast parallel execution, a flexible targeting system, reactive automation via the event bus, a mine system for distributed data collection, and Python all the way down when you need to extend it.

Teams that treat it as a YAML config applier, don't understand the ZeroMQ transport, don't learn requisites, and don't invest time in state organization hit walls and blame the tool. That's a fair observation about the tool's documentation and onboarding experience, but it's not an inherent flaw.

If I were starting a new VM-heavy deployment today I'd use it again. I'd also read everything I could about requisites and the pillar system before writing a single state, which is the opposite of how most people (including me, initially) approach it.