Skip to main content

Command Palette

Search for a command to run...

How to Build a Self-Healing Container Infrastructure with Docker and Python

Auto-restart failing containers, monitor with Prometheus, and get Slack alerts — no Kubernetes needed.

Updated
14 min read
How to Build a Self-Healing Container Infrastructure with Docker and Python
G
Final-year EXTC engineer at SFIT Mumbai. I write about DevOps, backend engineering, and developer tooling. Open source contributor at githedgehog and opentofu

Modern production systems fail. A container crashes, a service hangs, a health check starts returning 503. The question isn't whether your infrastructure will break — it's whether it can recover on its own before anyone notices.

In this guide, you'll learn how to build a self-healing container infrastructure — a system that monitors its own health, detects failures, and automatically restarts unhealthy containers. We'll use Docker Compose for orchestration, Python for the health monitor logic, and Prometheus + Grafana for observability. By the end, you'll have a working system that heals itself — and alerts you on Slack when it does.


What You'll Build

A Docker Compose setup where:

  • A Flask sample app simulates a real service with a /health endpoint

  • A Python health monitor continuously polls every container's health endpoint

  • When a container fails 3 consecutive checks, the monitor automatically restarts it

  • Prometheus scrapes metrics from all containers

  • Slack alerts notify your team when a container is restarted

  • A GitHub Actions CI pipeline runs tests and builds images on every push

Here's the high-level architecture:


Prerequisites

  • Docker Desktop installed (includes Docker Compose v2)

  • Python 3.11+ installed locally

  • A Slack workspace where you can create a webhook (optional but recommended)

  • Basic familiarity with containers and HTTP


Step 1 — Understand What Self-Healing Actually Means

Docker already restarts crashed containers if you set restart: unless-stopped in your Compose file. But that default behaviour has real limits:

Scenario Docker's default restart Our Python monitor
Container process crashes ✅ Restarts ✅ Restarts
Container running but returning 500 errors ❌ Does nothing ✅ Detects & restarts
Container running but hanging (no response) ❌ Does nothing ✅ Times out & restarts
Custom backoff / alerting logic ❌ Not possible ✅ Fully configurable
Slack/PagerDuty alert on recovery ❌ Not possible ✅ Built-in

The gap between "container is alive" and "container is healthy" is where most production incidents live. A process can be running — consuming CPU, holding ports — while returning garbage to every real request. Docker has no idea. Our monitor does.

How the healing loop works

Every 15 seconds:
  for each container labelled monitored=true:
    GET /health
    if response == 200:
      reset failure counter
    else:
      increment failure counter
      if failure counter >= 3:
        docker restart container
        send Slack alert
        reset failure counter

Three consecutive failures before restarting is intentional — transient network blips shouldn't trigger a restart. You can tune this threshold via environment variables.


Step 2 — Project Structure

Self-Healing-Container-Infrastructure/
|── healer/
|   ├── health_monitor.py       ← Core monitor logic
|   ├── slack_alert.py          ← Slack webhook helper
|   ├── requirements.txt
|   └── Dockerfile
|── sample-app/
|   ├── app.py              ← Flask app with /break and /fix
|   ├── requirements.txt
    └── Dockerfile
|── prometheus/
|   └── prometheus.yml          ← Scrape config
|── scripts/
|   └── simulate_failure.sh     ← Demo script
|── .github/
|   └── workflows/
|       └── ci.yml              ← GitHub Actions pipeline
|── docker-compose.yml
|── README.md

Step 3 — Write the Python Health Monitor

This is the core of the project. The monitor uses the Docker Python SDK to find all containers labelled monitored=true, polls their /health endpoints, tracks consecutive failures, and restarts them after hitting the threshold.

Install dependencies

pip install docker==7.1.0 requests==2.32.3

healer/requirements.txt

docker==7.1.0
requests==2.32.3

healer/health_monitor.py

import os
import time
import logging
import requests
import docker
from slack_alert import send_slack_alert

# ── Configuration (tunable via environment variables) ─────────────────────────
CHECK_INTERVAL    = int(os.getenv("CHECK_INTERVAL",    15))  # seconds between polls
FAILURE_THRESHOLD = int(os.getenv("FAILURE_THRESHOLD",  3))  # failures before restart
REQUEST_TIMEOUT   = int(os.getenv("REQUEST_TIMEOUT",    5))  # seconds before timeout
HEALTH_PATH       = os.getenv("HEALTH_PATH", "/health")

# ── Logging setup ─────────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
log = logging.getLogger(__name__)

# ── Docker client ─────────────────────────────────────────────────────────────
# docker.from_env() reads DOCKER_HOST from the environment.
# When running inside a container, mount /var/run/docker.sock as a volume.
client = docker.from_env()

# In-memory failure counter: { container_name: int }
failure_counts: dict[str, int] = {}


def get_monitored_containers() -> list:
    """Return all running containers that have the label monitored=true."""
    return client.containers.list(
        filters={"label": "monitored=true", "status": "running"}
    )


def get_container_url(container) -> str | None:
    """
    Build the health check URL from the container's published port bindings.
    Returns None if no port bindings are found (skip the check).
    """
    for _proto, bindings in container.ports.items():
        if bindings:
            host_port = bindings[0]["HostPort"]
            return f"http://localhost:{host_port}{HEALTH_PATH}"
    return None


def check_health(container) -> bool:
    """
    Poll the container's /health endpoint.
    Returns True if healthy (200 OK), False for any other status or exception.
    """
    url = get_container_url(container)
    if not url:
        log.debug("No port binding for %s — skipping health check", container.name)
        return True  # don't penalise containers with no exposed port

    try:
        response = requests.get(url, timeout=REQUEST_TIMEOUT)
        if response.status_code == 200:
            return True
        log.warning(
            "%s returned HTTP %d", container.name, response.status_code
        )
        return False
    except requests.exceptions.ConnectionError:
        log.warning("%s — connection refused", container.name)
        return False
    except requests.exceptions.Timeout:
        log.warning("%s — timed out after %ds", container.name, REQUEST_TIMEOUT)
        return False
    except requests.exceptions.RequestException as exc:
        log.warning("%s — unexpected error: %s", container.name, exc)
        return False


def restart_container(container) -> None:
    """
    Restart a container that has exceeded the failure threshold.
    Resets its failure counter and sends a Slack alert.
    """
    name = container.name
    log.warning(
        "RESTARTING %s — failed %d consecutive health checks", name, FAILURE_THRESHOLD
    )
    try:
        container.restart(timeout=10)
        failure_counts[name] = 0
        log.info("Successfully restarted: %s", name)
        send_slack_alert(
            event="recovered",
            container_name=name,
            message=(
                f"⚠ Container *{name}* failed {FAILURE_THRESHOLD} "
                f"consecutive health checks and was automatically restarted."
            ),
        )
    except docker.errors.APIError as exc:
        log.error("Failed to restart %s: %s", name, exc)


def run_monitor_loop() -> None:
    """Main loop — polls all monitored containers on a fixed interval."""
    log.info(
        "Health monitor started — interval=%ds, threshold=%d failures",
        CHECK_INTERVAL,
        FAILURE_THRESHOLD,
    )
    while True:
        containers = get_monitored_containers()
        log.info("Checking %d monitored container(s)…", len(containers))

        for container in containers:
            name = container.name

            if check_health(container):
                if failure_counts.get(name, 0) > 0:
                    log.info("%s — recovered (resetting counter)", name)
                failure_counts[name] = 0
            else:
                failure_counts[name] = failure_counts.get(name, 0) + 1
                count = failure_counts[name]
                log.warning(
                    "%s — unhealthy (%d/%d)", name, count, FAILURE_THRESHOLD
                )
                if count >= FAILURE_THRESHOLD:
                    restart_container(container)

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    run_monitor_loop()

Key design decisions worth noting:

  • docker.from_env() reads DOCKER_HOST from the environment. Inside Docker, mount /var/run/docker.sock so the monitor can control the host daemon.

  • failure_counts is in-memory — it resets if the monitor itself restarts. For production, swap this with Redis or a persistent store.

  • Separate exceptions for ConnectionError vs Timeout — they mean different things. A timeout often means the app is overloaded; a connection refusal means it's completely down.

  • container.restart(timeout=10) gives the container 10 seconds to shut down gracefully before a SIGKILL.


Step 4 — Add Slack Alerts

When the monitor restarts a container, you want to know about it — especially at 3 AM when you're not watching logs.

Create a Slack Incoming Webhook

  1. Go to api.slack.com/appsCreate New AppFrom scratch

  2. Under Features, click Incoming Webhooks → toggle On

  3. Click Add New Webhook to Workspace → pick a channel → copy the webhook URL

  4. Set it as an environment variable: SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

healer/slack_alert.py

import os
import logging
import requests

log = logging.getLogger(__name__)

SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")


def send_slack_alert(event: str, container_name: str, message: str) -> None:
    """
    Send a formatted Slack message via an Incoming Webhook.
    Silently no-ops if SLACK_WEBHOOK_URL is not set.
    """
    if not SLACK_WEBHOOK_URL:
        log.debug("SLACK_WEBHOOK_URL not set — skipping Slack alert")
        return

    # Colour-code by event type
    colour_map = {
        "recovered":  "#36a64f",   # green
        "degraded":   "#ff9900",   # orange
        "critical":   "#ff0000",   # red
    }
    colour = colour_map.get(event, "#cccccc")

    payload = {
        "attachments": [
            {
                "color": colour,
                "title": f"🐳 Self-Healing Infrastructure — {event.upper()}",
                "text": message,
                "footer": "health-monitor",
                "ts": int(__import__("time").time()),
            }
        ]
    }

    try:
        resp = requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)
        resp.raise_for_status()
        log.info("Slack alert sent for container: %s", container_name)
    except requests.exceptions.RequestException as exc:
        log.error("Failed to send Slack alert: %s", exc)

If SLACK_WEBHOOK_URL isn't set, the function silently no-ops — no crashes, no noise. This makes it easy to run the project locally without Slack configured.


Step 5 — Build the Sample Flask App

The sample app simulates a real microservice. It exposes three endpoints:

Endpoint Method Description
/health GET Returns 200 if healthy, 503 if broken
/break POST Simulates a failure (flips health to unhealthy)
/fix POST Restores health (resets to healthy)
/metrics GET Prometheus metrics (via prometheus_flask_exporter)

sample-app/app.py

import os
import logging
from flask import Flask, jsonify
from prometheus_flask_exporter import PrometheusMetrics

app = Flask(__name__)
metrics = PrometheusMetrics(app)

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

# Mutable state — in a real app this would be a proper health-check result
_healthy = True


@app.get("/health")
def health():
    """Health endpoint polled by the monitor every 15 seconds."""
    if _healthy:
        return jsonify({"status": "ok"}), 200
    return jsonify({"status": "degraded", "reason": "manually broken"}), 503


@app.post("/break")
def break_app():
    """Simulate a failure — the next health check will return 503."""
    global _healthy
    _healthy = False
    log.warning("Health set to DEGRADED (manually triggered)")
    return jsonify({"message": "App is now unhealthy"}), 200


@app.post("/fix")
def fix_app():
    """Restore health — normally you wouldn't need this; restart does it."""
    global _healthy
    _healthy = True
    log.info("Health restored")
    return jsonify({"message": "App is now healthy"}), 200


@app.get("/")
def index():
    return jsonify({"service": "sample-app", "healthy": _healthy})


if __name__ == "__main__":
    port = int(os.getenv("PORT", 8080))
    app.run(host="0.0.0.0", port=port)

sample-app/requirements.txt

flask==3.0.3
prometheus-flask-exporter==0.23.1

sample-app/Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 8080
CMD ["python", "app.py"]

Step 6 — Docker Images for the Monitor

healer/Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY health_monitor.py slack_alert.py ./

# The monitor needs access to the Docker daemon.
# Mount /var/run/docker.sock at runtime (see docker-compose.yml).
CMD ["python", "health_monitor.py"]

Step 7 — Wire It Together with Docker Compose

# docker-compose.yml
version: "3.9"

services:

  sample-app:
    build: ./sample-app
    container_name: sample-app
    ports:
      - "8080:8080"
    labels:
      - "monitored=true"           # ← this is what the monitor looks for
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 5s

  health-monitor:
    build: ./healer
    container_name: health-monitor
    depends_on:
      - sample-app
    environment:
      - CHECK_INTERVAL=15
      - FAILURE_THRESHOLD=3
      - REQUEST_TIMEOUT=5
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL:-}
    volumes:
      # Give the monitor access to the Docker daemon
      - /var/run/docker.sock:/var/run/docker.sock
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
    restart: unless-stopped

Important: Mounting /var/run/docker.sock inside the monitor container gives it full access to the Docker daemon on the host — it can start, stop, and inspect any container. In production, consider using Docker's TCP socket with TLS or scoping access with Authz plugins.


Step 8 — Configure Prometheus

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "sample-app"
    static_configs:
      - targets: ["sample-app:8080"]
    metrics_path: "/metrics"

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Prometheus will now scrape the Flask app's /metrics endpoint every 15 seconds. The prometheus_flask_exporter library automatically exposes request counts, latencies, and status code breakdowns — no extra instrumentation required.


Step 9 — Run It

Start the full stack

docker compose up --build

You should see all four containers start up:

✔ Container sample-app       Started
✔ Container health-monitor   Started
✔ Container prometheus        Started
✔ Container grafana           Started

Verify everything is healthy

curl http://localhost:8080/health
# → {"status": "ok"}

curl http://localhost:9090/targets
# → sample-app should be UP

Trigger a failure

Open a second terminal:

curl -X POST http://localhost:8080/break
# → {"message": "App is now unhealthy"}

Switch back to the first terminal and watch the monitor logs:

2024-05-10 14:32:15  WARNING  sample-app — unhealthy (1/3)
2024-05-10 14:32:30  WARNING  sample-app — unhealthy (2/3)
2024-05-10 14:32:45  WARNING  sample-app — unhealthy (3/3)
2024-05-10 14:32:45  WARNING  RESTARTING sample-app — failed 3 consecutive health checks
2024-05-10 14:32:47  INFO     Successfully restarted: sample-app

Within 45 seconds (3 checks × 15 seconds), the monitor detects the failure and restarts the container. After restart, the Flask app comes back healthy and the counter resets.

If you configured a Slack webhook, you'll also see a notification in your channel:

⚠ Container sample-app failed 3 consecutive health checks and was automatically restarted.


Step 10 — Add a CI/CD Pipeline with GitHub Actions

Every push should validate that the images build and the monitor starts cleanly.

.github/workflows/ci.yml

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Python dependencies
        run: |
          pip install docker==7.1.0 requests==2.32.3 pytest

      - name: Run unit tests
        run: pytest tests/ -v
        # Unit tests mock the Docker SDK — no daemon required

      - name: Build Docker images
        run: docker compose build

      - name: Start stack and smoke test
        run: |
          docker compose up -d
          sleep 10
          curl --fail http://localhost:8080/health
          docker compose down

This pipeline:

  1. Runs Python unit tests (using mocked Docker SDK calls)

  2. Builds all Docker images from scratch

  3. Spins up the full stack and smoke tests the health endpoint

  4. Tears everything down cleanly


Step 11 — Simulate a Failure with the Script

For demos or manual testing, use the provided shell script:

scripts/simulate_failure.sh

#!/usr/bin/env bash
set -euo pipefail

APP_URL="http://localhost:8080"

echo "=== Self-Healing Demo ==="
echo ""

echo "1. Checking initial health..."
curl -s "$APP_URL/health" | python3 -m json.tool
echo ""

echo "2. Triggering failure..."
curl -s -X POST "$APP_URL/break" | python3 -m json.tool
echo ""

echo "3. Waiting for monitor to detect failure (up to 60s)..."
for i in $(seq 1 12); do
  sleep 5
  STATUS=\((curl -s -o /dev/null -w "%{http_code}" "\)APP_URL/health")
  echo "   Check \(i: HTTP \)STATUS"
  if [ "$STATUS" = "200" ]; then
    echo ""
    echo "✅ Container was automatically restarted and is healthy again!"
    exit 0
  fi
done

echo "❌ Container did not recover in time — check monitor logs"
exit 1

Run it with:

chmod +x scripts/simulate_failure.sh
./scripts/simulate_failure.sh

Security Considerations

Before you ship this to production, there are a few things worth thinking about:

Docker socket access — Mounting /var/run/docker.sock gives the monitor container root-level access to your host. Mitigate this by:

  • Running the monitor as a non-root user with only the Docker group

  • Using Docker Socket Proxy to whitelist only the API calls you need

  • In Kubernetes, use the Kubernetes API instead of Docker directly

Restart loops — If a container's startup logic itself crashes, the monitor could restart it repeatedly. Add exponential backoff:

import math

def get_backoff_seconds(failure_count: int) -> float:
    """Exponential backoff: 15s, 30s, 60s, 120s, max 300s."""
    return min(CHECK_INTERVAL * (2 ** (failure_count - FAILURE_THRESHOLD)), 300)

Alert fatigue — If you send a Slack message on every restart, a flapping service will flood your channel. Add a cooldown window — only alert if the container hasn't been restarted in the last N minutes.


What to Build Next

  • Prometheus alerting rules — trigger recovery based on metric thresholds (e.g., error rate > 5%) instead of just HTTP status

  • Grafana dashboard — visualise container restarts, health check latency, and uptime over time

  • Multi-container support — the monitor already supports multiple containers via the monitored=true label — just add more services to your Compose file

  • Kubernetes migration — replace Docker SDK calls with the Kubernetes Python client; use Deployment rolling restarts instead of container.restart()

  • PagerDuty integration — for on-call rotations, swap Slack for PagerDuty's Events API


Key Takeaways

Docker's built-in restart policy is a safety net, not a health system. The gap between "a process is running" and "a service is healthy" is where real incidents live.

A 60-line Python health monitor with configurable thresholds fills that gap — no Kubernetes, no service mesh, no complex tooling. You get genuine self-healing at the cost of a single extra container.

The full project is on GitHub: github.com/gajjuu/Self-Healing-Container-Infrastructure


Found this useful? Drop a ❤ on Hashnode and share it with your team. Questions or improvements? Open an issue on GitHub.

More from this blog

G

GajDev

2 posts

Building projects and writing about DevOps, cloud infrastructure, web development, cybersecurity, automation, and software engineering. This blog shares practical tutorials, real-world projects, technical documentation, and developer insights from hands-on learning and experimentation.