Blog
May 10, 2016 Marie H.

ChatOps from Scratch: Building Chatastic at DevOpsDays Austin

ChatOps from Scratch: Building Chatastic at DevOpsDays Austin

Photo by <a href="https://unsplash.com/@artturijalli?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Artturi Jalli</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

I built Chatastic in a few hours at the DevOpsDays Austin 2016 hackathon. It won. Here's what it does and why it exists — and why I'd just spent two weeks doing exactly the kind of work it was designed to prevent.

The Problem: Your Chat Provider Just Changed

A few months before DevOpsDays, my employer switched from HipChat to Slack. Straightforward business decision. Terrible week for whoever was responsible for integrations — me.

We had Nagios firing to HipChat. PagerDuty to HipChat. Deploy notifications to HipChat. Uptime monitors to HipChat. Custom alert scripts to HipChat. Every one of those had to be found, updated, tested, and redeployed. Some were in Chef cookbooks. Some were env vars in Elastic Beanstalk. Some were hardcoded webhook URLs buried in cron jobs. It took longer than it should have, and I missed two that kept firing into a dead channel for another week before someone noticed.

When it was done, I had a working Slack setup and a strong opinion: this should never require touching more than one thing.

The N×M Problem

The pattern that creates this pain is direct integration. You've got N monitoring tools and M chat/alerting destinations, so you're managing N×M integrations. When a destination changes — provider migration, new channel structure, new webhook URL — you touch N things. When a source changes format, you touch M things.

The fix is a hub. Wire each source to the hub once. Wire each destination to the hub once. Now N+M integrations instead of N×M. A provider migration means updating one destination config in one place, and every source keeps working without changes.

That's Chatastic.

The Hackathon

DevOpsDays Austin runs a full-day build session as part of the conference. You show up with an idea (or you don't, and you find one), you build it, and you demo it. The judging is informal — what did you actually ship, and does it solve a real problem?

I walked in with the HipChat→Slack migration still fresh. I had a Flask app skeleton in my head before I sat down. By mid-afternoon I had something running. By demo time it was wired up with a working AngularJS frontend and a live SQS queue.

How It Works

The architecture is straightforward:

  1. Flask API running on localhost:5000 — one endpoint per provider (/slack, /pagerduty, /hipchat, /victorops, /notification). Each endpoint supports GET (retrieve current config), POST (connect the provider), and PUT (send a notification).

  2. SQLAlchemy modelsAccount, Provider, ProviderSettings — store the connection config for each provider.

  3. AWS SQS as the message queue backbone. Incoming notifications get dropped onto the queue.

  4. sqs_listener.py — a daemon that polls the SQS queue and spawns worker threads to deliver messages to whichever output providers are configured.

  5. AngularJS frontend served by nginx — this is the UI where you connect your services.

Here's the Flask provider endpoint, roughly as it lived in the repo:

from flask import Blueprint, request, jsonify
from app.models import Provider, ProviderSettings, db
import boto3
import json

slack_bp = Blueprint('slack', __name__)

@slack_bp.route('/slack', methods=['GET'])
def get_slack():
    provider = Provider.query.filter_by(name='slack').first()
    if not provider:
        return jsonify({'error': 'Slack not configured'}), 404
    settings = ProviderSettings.query.filter_by(provider_id=provider.id).all()
    return jsonify({'settings': [s.to_dict() for s in settings]})

@slack_bp.route('/slack', methods=['POST'])
def connect_slack():
    data = request.get_json()
    provider = Provider.query.filter_by(name='slack').first()
    if not provider:
        provider = Provider(name='slack')
        db.session.add(provider)
        db.session.flush()
    settings = ProviderSettings(
        provider_id=provider.id,
        webhook_url=data['webhook_url'],
        channel=data.get('channel', '#general')
    )
    db.session.add(settings)
    db.session.commit()
    return jsonify({'status': 'connected'}), 201

@slack_bp.route('/slack', methods=['PUT'])
def send_slack():
    data = request.get_json()
    sqs = boto3.client('sqs', region_name='us-east-1')
    message = {
        'provider': 'slack',
        'payload': data
    }
    sqs.send_message(
        QueueUrl=data['queue_url'],
        MessageBody=json.dumps(message)
    )
    return jsonify({'status': 'queued'}), 202

And here's the SQS listener daemon that actually delivers the messages:

import boto3
import json
import threading
import requests
import time

QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/chatastic-queue'
NUM_WORKERS = 5

def deliver_message(message):
    body = json.loads(message['Body'])
    provider = body['provider']
    payload = body['payload']

    if provider == 'slack':
        webhook_url = payload['webhook_url']
        requests.post(webhook_url, json={
            'text': payload.get('text', ''),
            'channel': payload.get('channel', '#general')
        })
    elif provider == 'pagerduty':
        # PagerDuty Events API v1
        requests.post('https://events.pagerduty.com/generic/2010-04-15/create_event.json',
                      json=payload)
    # ... hipchat, victorops follow same pattern

def poll_queue():
    sqs = boto3.client('sqs', region_name='us-east-1')
    while True:
        response = sqs.receive_message(
            QueueUrl=QUEUE_URL,
            MaxNumberOfMessages=10,
            WaitTimeSeconds=20
        )
        messages = response.get('Messages', [])
        threads = []
        for msg in messages:
            t = threading.Thread(target=deliver_message, args=(msg,))
            t.start()
            threads.append((t, msg))
        for t, msg in threads:
            t.join()
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=msg['ReceiptHandle']
            )
        if not messages:
            time.sleep(1)

if __name__ == '__main__':
    poll_queue()

The SQS long-polling (WaitTimeSeconds=20) keeps the daemon efficient — it's not hammering the queue, it just sits and waits for messages to arrive.

What Won the Judges

The pitch was simple: I just spent two weeks updating integrations because my employer switched chat providers. Chatastic means that migration is one config change instead of twenty.

Everyone in the room had a version of this story. HipChat was already losing ground to Slack in 2016 — half the audience had either done this migration or knew it was coming. The concept landed immediately because it was solving a problem they'd personally felt, not a hypothetical one.

The AngularJS frontend helped too. It made it feel like a real product, not just a script. You could demo adding a Slack webhook, adding a PagerDuty integration, and showing that a notification sent to the hub came out both sides. Live, in a few clicks.

The underlying point — that an abstraction layer between your notification sources and destinations is worth building — was the thing that resonated. The judges weren't evaluating Flask code quality. They were evaluating whether the idea solved a real problem. It did, and I had a concrete example of the pain it would have prevented.

Honest Take

This is a hackathon project. The code is rough. There's no authentication on the API, which is a glaring hole. The worker thread model is naive — if a delivery fails, the message gets acked and disappears. There's no retry logic, no dead-letter queue, no metrics on delivery success.

The SQS queue URL is hardcoded in the listener. The provider settings are stored in SQLite. The AngularJS frontend has approximately zero error handling.

None of that is surprising for something built in a day. The concept is sound. The implementation is a proof of concept, and it knows it.

If I were building this for production, I'd swap the threading model for something like Celery, add proper authentication to the API, implement dead-letter queue handling in SQS, and build retry logic into each provider client. The abstraction layer would stay exactly the same.

The code is on GitHub: https://github.com/morissette/chatastic

If you're drowning in direct integrations between monitoring tools and chat systems, the pattern here is worth stealing even if you don't use the code directly. One hub, many spokes. It's not a novel idea — it's just one that ops teams consistently fail to apply to their alerting infrastructure.