Blog
August 19, 2019 Marie H.

Disaster Recovery with AWS CloudEndure

Disaster Recovery with AWS CloudEndure

Photo by <a href="https://unsplash.com/@imgix?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">imgix</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Disaster Recovery with AWS CloudEndure

Updated March 2026: CloudEndure has been fully integrated into AWS as Elastic Disaster Recovery (DRS). The underlying concepts described here — continuous replication, staging areas, cutover testing — are identical in DRS. The agent installation and console experience have changed but the architecture is the same. If you're starting fresh today, go directly to AWS DRS.

At Penn Engineering I was brought in to build disaster recovery coverage for a footprint that had grown organically over years without it: three US sites and five international sites, a mix of on-premises VMware, Windows and Linux servers, and applications ranging from research computing infrastructure to administrative systems. RTO targets varied by system but the business-critical ones needed to be back up within four hours. RPO was one hour for critical systems. CloudEndure was the answer.

What CloudEndure Does

The core mechanism is continuous block-level replication. You install the CloudEndure agent on each source machine — physical or virtual, Windows or Linux, on-prem or cloud. The agent establishes an outbound connection to your staging area subnet in AWS and begins replicating changed disk blocks in near-real-time. This is not a snapshot-based approach. It's continuous. The RPO lag is typically measured in seconds to low minutes for a healthy replication session.

The key architectural distinction is between the staging area and the target environment. During normal operation, CloudEndure maintains lightweight staging instances in AWS — small instances whose sole job is to receive and apply incoming replication data. These are not your recovery machines. They're cheap, they run continuously, and they're invisible to your application. When you initiate a recovery (test or actual), CloudEndure provisions your actual target instances at the right size, boots them from the replicated disks, and you get your application back.

This separation means you're not paying for full-size recovery infrastructure at idle. You pay for the staging area (cheap) until you need to recover, at which point you pay for the full target environment only as long as you need it.

Agent Installation

The agent installation across eight sites was the first task I automated. CloudEndure provides a replication_settings.json that encodes the connection details:

{
  "token": "YOUR_PROJECT_TOKEN",
  "replicationServerHostname": "0.0.0.0",
  "usePrivateIP": false,
  "proxyHostname": "",
  "proxyPort": ""
}

The token is per-CloudEndure project. On Linux, installation is:

wget -O ./installer_linux.py https://console.cloudendure.com/installer_linux.py
sudo python ./installer_linux.py -t YOUR_PROJECT_TOKEN

On Windows it's an MSI. At eight sites with dozens of machines each, doing this by hand was not an option.

Ansible Playbook for Agent Deployment

I wrote a single playbook that handled both Windows and Linux targets. The inventory was grouped by OS type, populated from a CMDb export.

---
- name: Install CloudEndure agent - Linux
  hosts: linux_servers
  become: true
  vars:
    cloudendure_token: "{{ vault_cloudendure_token }}"
  tasks:
    - name: Download CloudEndure Linux installer
      get_url:
        url: https://console.cloudendure.com/installer_linux.py
        dest: /tmp/installer_linux.py
        mode: '0755'

    - name: Run CloudEndure installer
      command: >
        python /tmp/installer_linux.py -t {{ cloudendure_token }}
      args:
        creates: /etc/cloudendure/agent.env

    - name: Ensure CloudEndure service is running
      service:
        name: cloudendure
        state: started
        enabled: true

- name: Install CloudEndure agent - Windows
  hosts: windows_servers
  vars:
    cloudendure_token: "{{ vault_cloudendure_token }}"
  tasks:
    - name: Download CloudEndure Windows installer
      win_get_url:
        url: https://console.cloudendure.com/WIN32/installer_win.exe
        dest: C:\Temp\installer_win.exe

    - name: Run CloudEndure installer silently
      win_command: >
        C:\Temp\installer_win.exe /silent /token={{ cloudendure_token }}
      args:
        creates: C:\Program Files (x86)\CloudEndure\cloudendure.exe

    - name: Ensure CloudEndure service is running
      win_service:
        name: CloudEndure Agent
        state: started
        start_mode: auto

The token was stored in Ansible Vault. Running this against all eight sites' inventories brought every machine under replication within a day.

Monitoring Replication Lag

Once replication is running, the thing to watch is lag. CloudEndure's console shows a replication lag indicator per machine — how far behind the staging area is relative to the source. For a healthy replication session on a machine with normal I/O, this sits in the seconds-to-minutes range.

Lag spikes happen. Network interruptions, high I/O on the source, AWS staging area throttling. The CloudEndure API lets you pull this programmatically:

import requests

session = requests.Session()
session.headers.update({'X-XSRF-TOKEN': xsrf_token, 'Cookie': cookie})

machines = session.get(
    f'https://console.cloudendure.com/api/latest/projects/{project_id}/machines'
).json()

for machine in machines['items']:
    lag = machine.get('replicationInfo', {}).get('rescannedStorageBytes')
    last_consistency = machine.get('replicationInfo', {}).get('lastConsistencyDateTime')
    print(f"{machine['sourceProperties']['name']}: last consistency {last_consistency}")

We wired this into a monitoring script that ran every 15 minutes and alerted if any machine's last consistency timestamp was more than 30 minutes old. That was our early warning for replication health before it became an RTO problem.

The Cutover Process

Actual recovery — whether a real event or a drill — follows the same steps. In the CloudEndure console you select the machines you want to recover and choose the recovery point (you can go back to any point in the last few days, not just the latest). Then you launch target machines.

CloudEndure provisions the target instances, boots them, and they come up running. At this point you have two things to do: verify application health and update DNS. The DNS step is the one that bites you if you haven't planned it. Applications that were reached via hostname need their DNS records updated to point at the new IPs. We pre-documented every application's DNS dependencies in the runbook and assigned an owner to each update.

Health verification was application-specific. For web applications, a simple HTTP check against the expected URL was enough. For database servers, we had validation queries. These were all scripted and included in the runbook so the person doing the recovery didn't have to remember them under pressure.

Test vs. Recovery Mode

This is important: CloudEndure has a explicit distinction between test launches and recovery launches. In test mode, it spins up the target instances but leaves the source machines running and replicating. The test instances are isolated — they come up with modified network settings so they don't conflict with production. You can validate everything without touching production.

We ran quarterly DR drills using test mode. The drill covered: launch target machines, verify application health, document any issues, terminate test machines. Each drill took about two hours for the critical systems. The quarterly cadence was what our business continuity documentation required, and CloudEndure made it practical to actually do it rather than just declare it done on paper.

Runbook Format

The business continuity docs I wrote for Penn Engineering followed a structure I've reused since: each covered system had a one-page runbook with five sections. First, system overview: what it is, what depends on it, the RTO/RPO target. Second, replication health check: where to look and what healthy looks like. Third, cutover procedure: step by step, with the exact commands and console actions, named owner for each step. Fourth, post-cutover validation: the specific checks that confirm the application is functioning. Fifth, rollback procedure: if the recovery doesn't work, how do you get back to source.

The named owner section mattered. DR events happen at 2 AM. The runbook needs to say "DNS updates: contact network team, on-call pager X" rather than "update DNS records."

CloudEndure made the technical side of disaster recovery much more tractable than the alternatives we evaluated. The business continuity planning around it is where most of the work actually lives.