Blog
November 12, 2019 Marie H.

Using Ansible for Disaster Recovery Preparation on Windows

Using Ansible for Disaster Recovery Preparation on Windows

Photo by <a href="https://unsplash.com/@martinsanchez?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Martin Sanchez</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Using Ansible for Disaster Recovery Preparation on Windows

Before you can replicate a server, you need to know exactly what's on it. Disk layout, active network interfaces, which services are running and set to auto-start, which firewall rules are in place. For a single server you do this by hand. For 50+ Windows servers spread across US manufacturing sites and international locations, you don't.

This post covers the Ansible playbooks I wrote for the discovery and agent installation phases of a DR engagement at a large manufacturing company. All targets were Windows. The goal was twofold: document the pre-migration state thoroughly enough to build a DR runbook, and get the replication agent installed and verified on every machine.

The Discovery Problem

The DR runbook needs to answer questions like: when this server comes up in AWS after a recovery event, what should be running? What disk letters need to be there? What IP does it normally sit on? What firewall rules did it have that need to be represented as security group rules?

Manual collection doesn't scale and produces inconsistent results — someone forgets to check one tab in Server Manager, or copies the wrong column from the disk management view. Automated discovery via Ansible produces a consistent JSON file per server that you can diff, process, and use as authoritative input for runbook generation.

Inventory Organization

I organized inventory by site. Each site had its own inventory file:

inventory/
  us-midwest.yml
  us-southeast.yml
  apac-site.yml

Each file followed the same structure:

all:
  vars:
    ansible_connection: winrm
    ansible_winrm_transport: ntlm
    ansible_winrm_server_cert_validation: ignore
    ansible_winrm_operation_timeout_sec: 60
    ansible_winrm_read_timeout_sec: 70
  hosts:
    server-01:
      ansible_host: 10.10.1.11
    server-02:
      ansible_host: 10.10.1.12

Site-specific variables — the AWS region for that site's recovery target, the agent download URL — lived in group_vars/ keyed to the site group name. This meant the agent installation playbook could run against all sites with a single command and automatically use the right region per machine.

get_facts.yml

The baseline playbook: turn on gather_facts and write the full ansible_facts dict to a JSON file per host on the control node.

---
- name: Collect system facts
  hosts: all
  gather_facts: true
  tasks:
    - name: Write facts to file
      copy:
        content: "{{ ansible_facts | to_nice_json }}"
        dest: "facts/{{ inventory_hostname }}.json"
      delegate_to: localhost

delegate_to: localhost is the key line. Without it, copy would try to write the file on the remote host. With it, the file lands on your control node. After running this against all sites, you have one JSON file per server with OS version, IP configuration, hostname, memory, and processor info — everything gather_facts collects.

get_win_disk_info.yml

Ansible facts include basic disk info but not the full layout. For the DR runbook you need drive letters, sizes in GiB, and partition structure. community.windows.win_disk_facts gets it:

---
- name: Collect disk information
  hosts: all
  gather_facts: false
  tasks:
    - name: Get disk facts
      community.windows.win_disk_facts:

    - name: Write disk facts to file
      copy:
        content: "{{ ansible_facts.disks | to_nice_json }}"
        dest: "facts/{{ inventory_hostname }}_disks.json"
      delegate_to: localhost

The ansible_facts.disks structure gives you physical disk info including size in bytes. In the post-processing Python script, I converted these to GiB for the runbook spreadsheet. You want to document this before replication starts — the replication agent replicates what's there at enrollment time, and if a disk layout question comes up during recovery you want a timestamped record.

get_services.yml

Documents which services are running and set to auto-start. These are the services that need to be running post-recovery for the system to be considered healthy:

---
- name: Collect running auto-start services
  hosts: all
  gather_facts: false
  tasks:
    - name: Get service facts
      ansible.windows.win_service_facts:

    - name: Filter to auto-start running services
      set_fact:
        auto_services: >-
          {{ ansible_facts.services.values()
             | selectattr('start_mode', 'equalto', 'auto')
             | selectattr('state', 'equalto', 'running')
             | list }}

    - name: Write service list to file
      copy:
        content: "{{ auto_services | to_nice_json }}"
        dest: "facts/{{ inventory_hostname }}_services.json"
      delegate_to: localhost

This produces a list of services that are auto-start and currently running. Post-recovery, this list becomes your validation checklist: if these services aren't running, something is wrong.

get_fw_rules.yml

Firewall rules are needed for two purposes: documentation, and building the AWS security groups that the recovery instances will live behind. I enumerated them with win_shell rather than a dedicated module since win_firewall_rule doesn't have a facts-gathering mode:

---
- name: Collect firewall rules
  hosts: all
  gather_facts: false
  tasks:
    - name: Get enabled inbound firewall rules
      ansible.windows.win_shell: |
        Get-NetFirewallRule | Where-Object { $_.Enabled -eq 'True' -and $_.Direction -eq 'Inbound' } | ForEach-Object {
          $rule = $_
          $portFilter = $rule | Get-NetFirewallPortFilter
          [PSCustomObject]@{
            Name      = $rule.DisplayName
            Action    = $rule.Action.ToString()
            Protocol  = $portFilter.Protocol
            LocalPort = $portFilter.LocalPort
          }
        } | ConvertTo-Json -Depth 3
      register: fw_output

    - name: Write firewall rules to file
      copy:
        content: "{{ fw_output.stdout }}"
        dest: "facts/{{ inventory_hostname }}_firewall.json"
      delegate_to: localhost

The output from ConvertTo-Json lands in fw_output.stdout as a JSON string. Writing it directly without re-serializing through Ansible avoids any escaping issues.

Agent Installation

With discovery done, the next phase was installing the AWS replication agent. The playbook downloads the installer, runs it silently with credentials from Ansible Vault, and verifies the service is running:

---
- name: Install AWS replication agent
  hosts: all
  gather_facts: false
  tasks:
    - name: Download agent installer
      ansible.windows.win_powershell:
        script: |
          $ErrorActionPreference = 'Stop'
          $url = "{{ agent_url }}"
          Invoke-WebRequest -URI $url -OutFile "C:\Temp\AwsReplicationWindowsInstaller.exe"

    - name: Install agent
      ansible.windows.win_powershell:
        script: |
          $ErrorActionPreference = 'Stop'
          & "C:\Temp\AwsReplicationWindowsInstaller.exe" `
            --region {{ aws_region }} `
            --aws-access-key-id "{{ aws_access_id }}" `
            --aws-secret-access-key "{{ aws_secret_key }}" `
            --no-prompt
      register: install_output
      no_log: true

    - name: Verify agent service is running
      ansible.windows.win_service:
        name: "aws-replication-windows-agent"
        state: started

no_log: true on the installation task is not optional. The task parameters include the AWS secret key. Without no_log, Ansible prints the full task details including those values to stdout and to any configured logging.

The win_service task at the end does double duty: it verifies the service is present and running, and it fails the play if it isn't. If installation silently failed — wrong installer URL, permissions issue, installer returned nonzero but win_powershell didn't catch it — this step will catch it.

Checking Agent Logs

After installation, I ran a quick playbook to tail the agent log on each machine and confirm replication was actually starting:

- name: Check agent log tail
  ansible.windows.win_shell: >
    Get-Content 'C:\Program Files\AWS Replication Agent\log\agent.log' -Tail 20
  register: agent_log

- name: Print log output
  debug:
    msg: "{{ agent_log.stdout_lines }}"

You're looking for lines indicating the agent has connected and begun initial sync. Errors here usually mean network connectivity issues to the AWS replication endpoints, or a problem with the credentials.

WinRM Timeouts at Scale

Running playbooks across sites with varied network latency — some US domestic, some in Asia-Pacific — exposed WinRM timeout issues that don't appear in a single-site environment. The default ansible_winrm_operation_timeout_sec of 20 seconds isn't enough for international connections under load.

I settled on these values for international-site inventory:

ansible_winrm_operation_timeout_sec: 60
ansible_winrm_read_timeout_sec: 70

The read timeout should always be larger than the operation timeout. If they're equal or reversed, you get confusing timeout errors that look like connection resets rather than operation timeouts. The 10-second buffer isn't magic, but it was reliable across the Asia-Pacific site throughout the engagement.

For serial vs. parallel execution: Ansible's default forks setting runs tasks across hosts in parallel. For the discovery playbooks this was fine. For the agent installation, I ran it with --forks 5 to avoid hammering the installer download URL simultaneously from every machine. The international-site machines shared a slow pipe; parallel downloads would have hit rate limits.

Processing the Output

After running all discovery playbooks, the facts/ directory had four JSON files per server. A short Python script processed these into a spreadsheet for the DR runbook: disk layout, drive letters, IP addresses, OS version, and auto-start service list per machine. Each row in the spreadsheet was one server. Each column was a fact category.

That spreadsheet became the authoritative input for the recovery runbook and for the boto3 script that built AWS security groups — each server's firewall rules mapped to ingress rules on its dedicated security group in AWS.

The whole discovery phase — running all four playbooks across all sites — took about 45 minutes due to WinRM connection setup overhead on the international sites. Doing the equivalent manually would have taken days.