How to Build a Self-Healing Container Infrastructure

Modern production systems fail. A container crashes, a service hangs, a health check starts returning 503. The question isn't whether your infrastructure will break — it's whether it can recover on its own before anyone notices.

In this guide, you'll learn how to build a self-healing container infrastructure — a system that monitors its own health, detects failures, and automatically restarts unhealthy containers. We'll use Docker Compose for orchestration, Python for the health monitor logic, and Prometheus + Grafana for observability. By the end, you'll have a working system that heals itself — and alerts you on Slack when it does.

What You'll Build

A Docker Compose setup where:

A Flask sample app simulates a real service with a /health endpoint
A Python health monitor continuously polls every container's health endpoint
When a container fails 3 consecutive checks, the monitor automatically restarts it
Prometheus scrapes metrics from all containers
Slack alerts notify your team when a container is restarted
A GitHub Actions CI pipeline runs tests and builds images on every push

Here's the high-level architecture:

Prerequisites

Docker Desktop installed (includes Docker Compose v2)
Python 3.11+ installed locally
A Slack workspace where you can create a webhook (optional but recommended)
Basic familiarity with containers and HTTP

Step 1 — Understand What Self-Healing Actually Means

Docker already restarts crashed containers if you set restart: unless-stopped in your Compose file. But that default behaviour has real limits:

Scenario	Docker's default restart	Our Python monitor
Container process crashes	✅ Restarts	✅ Restarts
Container running but returning 500 errors	❌ Does nothing	✅ Detects & restarts
Container running but hanging (no response)	❌ Does nothing	✅ Times out & restarts
Custom backoff / alerting logic	❌ Not possible	✅ Fully configurable
Slack/PagerDuty alert on recovery	❌ Not possible	✅ Built-in

The gap between "container is alive" and "container is healthy" is where most production incidents live. A process can be running — consuming CPU, holding ports — while returning garbage to every real request. Docker has no idea. Our monitor does.

How the healing loop works

Every 15 seconds:
  for each container labelled monitored=true:
    GET /health
    if response == 200:
      reset failure counter
    else:
      increment failure counter
      if failure counter >= 3:
        docker restart container
        send Slack alert
        reset failure counter

Three consecutive failures before restarting is intentional — transient network blips shouldn't trigger a restart. You can tune this threshold via environment variables.

Step 2 — Project Structure

Self-Healing-Container-Infrastructure/
|── healer/
|   ├── health_monitor.py       ← Core monitor logic
|   ├── slack_alert.py          ← Slack webhook helper
|   ├── requirements.txt
|   └── Dockerfile
|── sample-app/
|   ├── app.py              ← Flask app with /break and /fix
|   ├── requirements.txt
    └── Dockerfile
|── prometheus/
|   └── prometheus.yml          ← Scrape config
|── scripts/
|   └── simulate_failure.sh     ← Demo script
|── .github/
|   └── workflows/
|       └── ci.yml              ← GitHub Actions pipeline
|── docker-compose.yml
|── README.md

Step 3 — Write the Python Health Monitor

This is the core of the project. The monitor uses the Docker Python SDK to find all containers labelled monitored=true, polls their /health endpoints, tracks consecutive failures, and restarts them after hitting the threshold.

Install dependencies

pip install docker==7.1.0 requests==2.32.3

healer/requirements.txt

docker==7.1.0
requests==2.32.3

healer/health_monitor.py

import os
import time
import logging
import requests
import docker
from slack_alert import send_slack_alert

# ── Configuration (tunable via environment variables) ─────────────────────────
CHECK_INTERVAL    = int(os.getenv("CHECK_INTERVAL",    15))  # seconds between polls
FAILURE_THRESHOLD = int(os.getenv("FAILURE_THRESHOLD",  3))  # failures before restart
REQUEST_TIMEOUT   = int(os.getenv("REQUEST_TIMEOUT",    5))  # seconds before timeout
HEALTH_PATH       = os.getenv("HEALTH_PATH", "/health")

# ── Logging setup ─────────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
log = logging.getLogger(__name__)

# ── Docker client ─────────────────────────────────────────────────────────────
# docker.from_env() reads DOCKER_HOST from the environment.
# When running inside a container, mount /var/run/docker.sock as a volume.
client = docker.from_env()

# In-memory failure counter: { container_name: int }
failure_counts: dict[str, int] = {}


def get_monitored_containers() -> list:
    """Return all running containers that have the label monitored=true."""
    return client.containers.list(
        filters={"label": "monitored=true", "status": "running"}
    )


def get_container_url(container) -> str | None:
    """
    Build the health check URL from the container's published port bindings.
    Returns None if no port bindings are found (skip the check).
    """
    for _proto, bindings in container.ports.items():
        if bindings:
            host_port = bindings[0]["HostPort"]
            return f"http://localhost:{host_port}{HEALTH_PATH}"
    return None


def check_health(container) -> bool:
    """
    Poll the container's /health endpoint.
    Returns True if healthy (200 OK), False for any other status or exception.
    """
    url = get_container_url(container)
    if not url:
        log.debug("No port binding for %s — skipping health check", container.name)
        return True  # don't penalise containers with no exposed port

    try:
        response = requests.get(url, timeout=REQUEST_TIMEOUT)
        if response.status_code == 200:
            return True
        log.warning(
            "%s returned HTTP %d", container.name, response.status_code
        )
        return False
    except requests.exceptions.ConnectionError:
        log.warning("%s — connection refused", container.name)
        return False
    except requests.exceptions.Timeout:
        log.warning("%s — timed out after %ds", container.name, REQUEST_TIMEOUT)
        return False
    except requests.exceptions.RequestException as exc:
        log.warning("%s — unexpected error: %s", container.name, exc)
        return False


def restart_container(container) -> None:
    """
    Restart a container that has exceeded the failure threshold.
    Resets its failure counter and sends a Slack alert.
    """
    name = container.name
    log.warning(
        "RESTARTING %s — failed %d consecutive health checks", name, FAILURE_THRESHOLD
    )
    try:
        container.restart(timeout=10)
        failure_counts[name] = 0
        log.info("Successfully restarted: %s", name)
        send_slack_alert(
            event="recovered",
            container_name=name,
            message=(
                f"⚠ Container *{name}* failed {FAILURE_THRESHOLD} "
                f"consecutive health checks and was automatically restarted."
            ),
        )
    except docker.errors.APIError as exc:
        log.error("Failed to restart %s: %s", name, exc)


def run_monitor_loop() -> None:
    """Main loop — polls all monitored containers on a fixed interval."""
    log.info(
        "Health monitor started — interval=%ds, threshold=%d failures",
        CHECK_INTERVAL,
        FAILURE_THRESHOLD,
    )
    while True:
        containers = get_monitored_containers()
        log.info("Checking %d monitored container(s)…", len(containers))

        for container in containers:
            name = container.name

            if check_health(container):
                if failure_counts.get(name, 0) > 0:
                    log.info("%s — recovered (resetting counter)", name)
                failure_counts[name] = 0
            else:
                failure_counts[name] = failure_counts.get(name, 0) + 1
                count = failure_counts[name]
                log.warning(
                    "%s — unhealthy (%d/%d)", name, count, FAILURE_THRESHOLD
                )
                if count >= FAILURE_THRESHOLD:
                    restart_container(container)

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    run_monitor_loop()

Key design decisions worth noting:

docker.from_env() reads DOCKER_HOST from the environment. Inside Docker, mount /var/run/docker.sock so the monitor can control the host daemon.
failure_counts is in-memory — it resets if the monitor itself restarts. For production, swap this with Redis or a persistent store.
Separate exceptions for ConnectionError vs Timeout — they mean different things. A timeout often means the app is overloaded; a connection refusal means it's completely down.
container.restart(timeout=10) gives the container 10 seconds to shut down gracefully before a SIGKILL.

Step 4 — Add Slack Alerts

When the monitor restarts a container, you want to know about it — especially at 3 AM when you're not watching logs.

Create a Slack Incoming Webhook

Go to api.slack.com/apps → Create New App → From scratch
Under Features, click Incoming Webhooks → toggle On
Click Add New Webhook to Workspace → pick a channel → copy the webhook URL
Set it as an environment variable: SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

healer/slack_alert.py

import os
import logging
import requests

log = logging.getLogger(__name__)

SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")


def send_slack_alert(event: str, container_name: str, message: str) -> None:
    """
    Send a formatted Slack message via an Incoming Webhook.
    Silently no-ops if SLACK_WEBHOOK_URL is not set.
    """
    if not SLACK_WEBHOOK_URL:
        log.debug("SLACK_WEBHOOK_URL not set — skipping Slack alert")
        return

    # Colour-code by event type
    colour_map = {
        "recovered":  "#36a64f",   # green
        "degraded":   "#ff9900",   # orange
        "critical":   "#ff0000",   # red
    }
    colour = colour_map.get(event, "#cccccc")

    payload = {
        "attachments": [
            {
                "color": colour,
                "title": f"🐳 Self-Healing Infrastructure — {event.upper()}",
                "text": message,
                "footer": "health-monitor",
                "ts": int(__import__("time").time()),
            }
        ]
    }

    try:
        resp = requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)
        resp.raise_for_status()
        log.info("Slack alert sent for container: %s", container_name)
    except requests.exceptions.RequestException as exc:
        log.error("Failed to send Slack alert: %s", exc)

If SLACK_WEBHOOK_URL isn't set, the function silently no-ops — no crashes, no noise. This makes it easy to run the project locally without Slack configured.

Step 5 — Build the Sample Flask App

The sample app simulates a real microservice. It exposes three endpoints:

Endpoint	Method	Description
`/health`	GET	Returns 200 if healthy, 503 if broken
`/break`	POST	Simulates a failure (flips health to unhealthy)
`/fix`	POST	Restores health (resets to healthy)
`/metrics`	GET	Prometheus metrics (via `prometheus_flask_exporter`)

sample-app/app.py

import os
import logging
from flask import Flask, jsonify
from prometheus_flask_exporter import PrometheusMetrics

app = Flask(__name__)
metrics = PrometheusMetrics(app)

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

# Mutable state — in a real app this would be a proper health-check result
_healthy = True


@app.get("/health")
def health():
    """Health endpoint polled by the monitor every 15 seconds."""
    if _healthy:
        return jsonify({"status": "ok"}), 200
    return jsonify({"status": "degraded", "reason": "manually broken"}), 503


@app.post("/break")
def break_app():
    """Simulate a failure — the next health check will return 503."""
    global _healthy
    _healthy = False
    log.warning("Health set to DEGRADED (manually triggered)")
    return jsonify({"message": "App is now unhealthy"}), 200


@app.post("/fix")
def fix_app():
    """Restore health — normally you wouldn't need this; restart does it."""
    global _healthy
    _healthy = True
    log.info("Health restored")
    return jsonify({"message": "App is now healthy"}), 200


@app.get("/")
def index():
    return jsonify({"service": "sample-app", "healthy": _healthy})


if __name__ == "__main__":
    port = int(os.getenv("PORT", 8080))
    app.run(host="0.0.0.0", port=port)

sample-app/requirements.txt

flask==3.0.3
prometheus-flask-exporter==0.23.1

sample-app/Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 8080
CMD ["python", "app.py"]

Step 6 — Docker Images for the Monitor

healer/Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY health_monitor.py slack_alert.py ./

# The monitor needs access to the Docker daemon.
# Mount /var/run/docker.sock at runtime (see docker-compose.yml).
CMD ["python", "health_monitor.py"]

Step 7 — Wire It Together with Docker Compose

# docker-compose.yml
version: "3.9"

services:

  sample-app:
    build: ./sample-app
    container_name: sample-app
    ports:
      - "8080:8080"
    labels:
      - "monitored=true"           # ← this is what the monitor looks for
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 5s

  health-monitor:
    build: ./healer
    container_name: health-monitor
    depends_on:
      - sample-app
    environment:
      - CHECK_INTERVAL=15
      - FAILURE_THRESHOLD=3
      - REQUEST_TIMEOUT=5
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL:-}
    volumes:
      # Give the monitor access to the Docker daemon
      - /var/run/docker.sock:/var/run/docker.sock
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
    restart: unless-stopped

Important: Mounting /var/run/docker.sock inside the monitor container gives it full access to the Docker daemon on the host — it can start, stop, and inspect any container. In production, consider using Docker's TCP socket with TLS or scoping access with Authz plugins.

Step 8 — Configure Prometheus

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "sample-app"
    static_configs:
      - targets: ["sample-app:8080"]
    metrics_path: "/metrics"

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Prometheus will now scrape the Flask app's /metrics endpoint every 15 seconds. The prometheus_flask_exporter library automatically exposes request counts, latencies, and status code breakdowns — no extra instrumentation required.

Step 9 — Run It

Start the full stack

docker compose up --build

You should see all four containers start up:

✔ Container sample-app       Started
✔ Container health-monitor   Started
✔ Container prometheus        Started
✔ Container grafana           Started

Verify everything is healthy

curl http://localhost:8080/health
# → {"status": "ok"}

curl http://localhost:9090/targets
# → sample-app should be UP

Trigger a failure

Open a second terminal:

curl -X POST http://localhost:8080/break
# → {"message": "App is now unhealthy"}

Switch back to the first terminal and watch the monitor logs:

2024-05-10 14:32:15  WARNING  sample-app — unhealthy (1/3)
2024-05-10 14:32:30  WARNING  sample-app — unhealthy (2/3)
2024-05-10 14:32:45  WARNING  sample-app — unhealthy (3/3)
2024-05-10 14:32:45  WARNING  RESTARTING sample-app — failed 3 consecutive health checks
2024-05-10 14:32:47  INFO     Successfully restarted: sample-app

Within 45 seconds (3 checks × 15 seconds), the monitor detects the failure and restarts the container. After restart, the Flask app comes back healthy and the counter resets.

If you configured a Slack webhook, you'll also see a notification in your channel:

⚠ Container sample-app failed 3 consecutive health checks and was automatically restarted.

Step 10 — Add a CI/CD Pipeline with GitHub Actions

Every push should validate that the images build and the monitor starts cleanly.

.github/workflows/ci.yml

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Python dependencies
        run: |
          pip install docker==7.1.0 requests==2.32.3 pytest

      - name: Run unit tests
        run: pytest tests/ -v
        # Unit tests mock the Docker SDK — no daemon required

      - name: Build Docker images
        run: docker compose build

      - name: Start stack and smoke test
        run: |
          docker compose up -d
          sleep 10
          curl --fail http://localhost:8080/health
          docker compose down

This pipeline:

Runs Python unit tests (using mocked Docker SDK calls)
Builds all Docker images from scratch
Spins up the full stack and smoke tests the health endpoint
Tears everything down cleanly

Step 11 — Simulate a Failure with the Script

For demos or manual testing, use the provided shell script:

scripts/simulate_failure.sh

#!/usr/bin/env bash
set -euo pipefail

APP_URL="http://localhost:8080"

echo "=== Self-Healing Demo ==="
echo ""

echo "1. Checking initial health..."
curl -s "$APP_URL/health" | python3 -m json.tool
echo ""

echo "2. Triggering failure..."
curl -s -X POST "$APP_URL/break" | python3 -m json.tool
echo ""

echo "3. Waiting for monitor to detect failure (up to 60s)..."
for i in $(seq 1 12); do
  sleep 5
  STATUS=\((curl -s -o /dev/null -w "%{http_code}" "\)APP_URL/health")
  echo "   Check \(i: HTTP \)STATUS"
  if [ "$STATUS" = "200" ]; then
    echo ""
    echo "✅ Container was automatically restarted and is healthy again!"
    exit 0
  fi
done

echo "❌ Container did not recover in time — check monitor logs"
exit 1

Run it with:

chmod +x scripts/simulate_failure.sh
./scripts/simulate_failure.sh

Security Considerations

Before you ship this to production, there are a few things worth thinking about:

Docker socket access — Mounting /var/run/docker.sock gives the monitor container root-level access to your host. Mitigate this by:

Running the monitor as a non-root user with only the Docker group
Using Docker Socket Proxy to whitelist only the API calls you need
In Kubernetes, use the Kubernetes API instead of Docker directly

Restart loops — If a container's startup logic itself crashes, the monitor could restart it repeatedly. Add exponential backoff:

import math

def get_backoff_seconds(failure_count: int) -> float:
    """Exponential backoff: 15s, 30s, 60s, 120s, max 300s."""
    return min(CHECK_INTERVAL * (2 ** (failure_count - FAILURE_THRESHOLD)), 300)

Alert fatigue — If you send a Slack message on every restart, a flapping service will flood your channel. Add a cooldown window — only alert if the container hasn't been restarted in the last N minutes.

What to Build Next

Prometheus alerting rules — trigger recovery based on metric thresholds (e.g., error rate > 5%) instead of just HTTP status
Grafana dashboard — visualise container restarts, health check latency, and uptime over time
Multi-container support — the monitor already supports multiple containers via the monitored=true label — just add more services to your Compose file
Kubernetes migration — replace Docker SDK calls with the Kubernetes Python client; use Deployment rolling restarts instead of container.restart()
PagerDuty integration — for on-call rotations, swap Slack for PagerDuty's Events API

Key Takeaways

Docker's built-in restart policy is a safety net, not a health system. The gap between "a process is running" and "a service is healthy" is where real incidents live.

A 60-line Python health monitor with configurable thresholds fills that gap — no Kubernetes, no service mesh, no complex tooling. You get genuine self-healing at the cost of a single extra container.

The full project is on GitHub: github.com/gajjuu/Self-Healing-Container-Infrastructure

Found this useful? Drop a ❤ on Hashnode and share it with your team. Questions or improvements? Open an issue on GitHub.

How to Build a Self-Healing Container Infrastructure with Docker and Python

What You'll Build

Prerequisites

Step 1 — Understand What Self-Healing Actually Means

How the healing loop works

Step 2 — Project Structure

Step 3 — Write the Python Health Monitor

Install dependencies

healer/requirements.txt

healer/health_monitor.py

Step 4 — Add Slack Alerts

Create a Slack Incoming Webhook

healer/slack_alert.py

Step 5 — Build the Sample Flask App

sample-app/app.py

sample-app/requirements.txt

sample-app/Dockerfile

Step 6 — Docker Images for the Monitor

healer/Dockerfile

Step 7 — Wire It Together with Docker Compose

Step 8 — Configure Prometheus

Step 9 — Run It

Start the full stack

Verify everything is healthy

Trigger a failure

Step 10 — Add a CI/CD Pipeline with GitHub Actions

.github/workflows/ci.yml

Step 11 — Simulate a Failure with the Script

scripts/simulate_failure.sh

Security Considerations

What to Build Next

Key Takeaways

Comments

More from this blog

What a Video Chat API Taught Me About Good API Design

Command Palette

What You'll Build

Prerequisites

Step 1 — Understand What Self-Healing Actually Means

How the healing loop works

Step 2 — Project Structure

Step 3 — Write the Python Health Monitor

Install dependencies

healer/requirements.txt

healer/health_monitor.py

Step 4 — Add Slack Alerts

Create a Slack Incoming Webhook

healer/slack_alert.py

Step 5 — Build the Sample Flask App

sample-app/app.py

sample-app/requirements.txt

sample-app/Dockerfile

Step 6 — Docker Images for the Monitor

healer/Dockerfile

Step 7 — Wire It Together with Docker Compose

Step 8 — Configure Prometheus

Step 9 — Run It

Start the full stack

Verify everything is healthy

Trigger a failure

Step 10 — Add a CI/CD Pipeline with GitHub Actions

.github/workflows/ci.yml

Step 11 — Simulate a Failure with the Script

scripts/simulate_failure.sh

Security Considerations

What to Build Next

Key Takeaways

Comments

More from this blog