How to Build a Self-Healing Container Infrastructure with Docker and Python
Auto-restart failing containers, monitor with Prometheus, and get Slack alerts — no Kubernetes needed.
Modern production systems fail. A container crashes, a service hangs, a health check starts returning 503. The question isn't whether your infrastructure will break — it's whether it can recover on its own before anyone notices.
In this guide, you'll learn how to build a self-healing container infrastructure — a system that monitors its own health, detects failures, and automatically restarts unhealthy containers. We'll use Docker Compose for orchestration, Python for the health monitor logic, and Prometheus + Grafana for observability. By the end, you'll have a working system that heals itself — and alerts you on Slack when it does.
What You'll Build
A Docker Compose setup where:
A Flask sample app simulates a real service with a
/healthendpointA Python health monitor continuously polls every container's health endpoint
When a container fails 3 consecutive checks, the monitor automatically restarts it
Prometheus scrapes metrics from all containers
Slack alerts notify your team when a container is restarted
A GitHub Actions CI pipeline runs tests and builds images on every push
Here's the high-level architecture:
Prerequisites
Docker Desktop installed (includes Docker Compose v2)
Python 3.11+ installed locally
A Slack workspace where you can create a webhook (optional but recommended)
Basic familiarity with containers and HTTP
Step 1 — Understand What Self-Healing Actually Means
Docker already restarts crashed containers if you set restart: unless-stopped in your Compose file. But that default behaviour has real limits:
| Scenario | Docker's default restart | Our Python monitor |
|---|---|---|
| Container process crashes | ✅ Restarts | ✅ Restarts |
| Container running but returning 500 errors | ❌ Does nothing | ✅ Detects & restarts |
| Container running but hanging (no response) | ❌ Does nothing | ✅ Times out & restarts |
| Custom backoff / alerting logic | ❌ Not possible | ✅ Fully configurable |
| Slack/PagerDuty alert on recovery | ❌ Not possible | ✅ Built-in |
The gap between "container is alive" and "container is healthy" is where most production incidents live. A process can be running — consuming CPU, holding ports — while returning garbage to every real request. Docker has no idea. Our monitor does.
How the healing loop works
Every 15 seconds:
for each container labelled monitored=true:
GET /health
if response == 200:
reset failure counter
else:
increment failure counter
if failure counter >= 3:
docker restart container
send Slack alert
reset failure counter
Three consecutive failures before restarting is intentional — transient network blips shouldn't trigger a restart. You can tune this threshold via environment variables.
Step 2 — Project Structure
Self-Healing-Container-Infrastructure/
|── healer/
| ├── health_monitor.py ← Core monitor logic
| ├── slack_alert.py ← Slack webhook helper
| ├── requirements.txt
| └── Dockerfile
|── sample-app/
| ├── app.py ← Flask app with /break and /fix
| ├── requirements.txt
└── Dockerfile
|── prometheus/
| └── prometheus.yml ← Scrape config
|── scripts/
| └── simulate_failure.sh ← Demo script
|── .github/
| └── workflows/
| └── ci.yml ← GitHub Actions pipeline
|── docker-compose.yml
|── README.md
Step 3 — Write the Python Health Monitor
This is the core of the project. The monitor uses the Docker Python SDK to find all containers labelled monitored=true, polls their /health endpoints, tracks consecutive failures, and restarts them after hitting the threshold.
Install dependencies
pip install docker==7.1.0 requests==2.32.3
healer/requirements.txt
docker==7.1.0
requests==2.32.3
healer/health_monitor.py
import os
import time
import logging
import requests
import docker
from slack_alert import send_slack_alert
# ── Configuration (tunable via environment variables) ─────────────────────────
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", 15)) # seconds between polls
FAILURE_THRESHOLD = int(os.getenv("FAILURE_THRESHOLD", 3)) # failures before restart
REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", 5)) # seconds before timeout
HEALTH_PATH = os.getenv("HEALTH_PATH", "/health")
# ── Logging setup ─────────────────────────────────────────────────────────────
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
log = logging.getLogger(__name__)
# ── Docker client ─────────────────────────────────────────────────────────────
# docker.from_env() reads DOCKER_HOST from the environment.
# When running inside a container, mount /var/run/docker.sock as a volume.
client = docker.from_env()
# In-memory failure counter: { container_name: int }
failure_counts: dict[str, int] = {}
def get_monitored_containers() -> list:
"""Return all running containers that have the label monitored=true."""
return client.containers.list(
filters={"label": "monitored=true", "status": "running"}
)
def get_container_url(container) -> str | None:
"""
Build the health check URL from the container's published port bindings.
Returns None if no port bindings are found (skip the check).
"""
for _proto, bindings in container.ports.items():
if bindings:
host_port = bindings[0]["HostPort"]
return f"http://localhost:{host_port}{HEALTH_PATH}"
return None
def check_health(container) -> bool:
"""
Poll the container's /health endpoint.
Returns True if healthy (200 OK), False for any other status or exception.
"""
url = get_container_url(container)
if not url:
log.debug("No port binding for %s — skipping health check", container.name)
return True # don't penalise containers with no exposed port
try:
response = requests.get(url, timeout=REQUEST_TIMEOUT)
if response.status_code == 200:
return True
log.warning(
"%s returned HTTP %d", container.name, response.status_code
)
return False
except requests.exceptions.ConnectionError:
log.warning("%s — connection refused", container.name)
return False
except requests.exceptions.Timeout:
log.warning("%s — timed out after %ds", container.name, REQUEST_TIMEOUT)
return False
except requests.exceptions.RequestException as exc:
log.warning("%s — unexpected error: %s", container.name, exc)
return False
def restart_container(container) -> None:
"""
Restart a container that has exceeded the failure threshold.
Resets its failure counter and sends a Slack alert.
"""
name = container.name
log.warning(
"RESTARTING %s — failed %d consecutive health checks", name, FAILURE_THRESHOLD
)
try:
container.restart(timeout=10)
failure_counts[name] = 0
log.info("Successfully restarted: %s", name)
send_slack_alert(
event="recovered",
container_name=name,
message=(
f"⚠ Container *{name}* failed {FAILURE_THRESHOLD} "
f"consecutive health checks and was automatically restarted."
),
)
except docker.errors.APIError as exc:
log.error("Failed to restart %s: %s", name, exc)
def run_monitor_loop() -> None:
"""Main loop — polls all monitored containers on a fixed interval."""
log.info(
"Health monitor started — interval=%ds, threshold=%d failures",
CHECK_INTERVAL,
FAILURE_THRESHOLD,
)
while True:
containers = get_monitored_containers()
log.info("Checking %d monitored container(s)…", len(containers))
for container in containers:
name = container.name
if check_health(container):
if failure_counts.get(name, 0) > 0:
log.info("%s — recovered (resetting counter)", name)
failure_counts[name] = 0
else:
failure_counts[name] = failure_counts.get(name, 0) + 1
count = failure_counts[name]
log.warning(
"%s — unhealthy (%d/%d)", name, count, FAILURE_THRESHOLD
)
if count >= FAILURE_THRESHOLD:
restart_container(container)
time.sleep(CHECK_INTERVAL)
if __name__ == "__main__":
run_monitor_loop()
Key design decisions worth noting:
docker.from_env()readsDOCKER_HOSTfrom the environment. Inside Docker, mount/var/run/docker.sockso the monitor can control the host daemon.failure_countsis in-memory — it resets if the monitor itself restarts. For production, swap this with Redis or a persistent store.Separate exceptions for
ConnectionErrorvsTimeout— they mean different things. A timeout often means the app is overloaded; a connection refusal means it's completely down.container.restart(timeout=10)gives the container 10 seconds to shut down gracefully before a SIGKILL.
Step 4 — Add Slack Alerts
When the monitor restarts a container, you want to know about it — especially at 3 AM when you're not watching logs.
Create a Slack Incoming Webhook
Go to api.slack.com/apps → Create New App → From scratch
Under Features, click Incoming Webhooks → toggle On
Click Add New Webhook to Workspace → pick a channel → copy the webhook URL
Set it as an environment variable:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
healer/slack_alert.py
import os
import logging
import requests
log = logging.getLogger(__name__)
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")
def send_slack_alert(event: str, container_name: str, message: str) -> None:
"""
Send a formatted Slack message via an Incoming Webhook.
Silently no-ops if SLACK_WEBHOOK_URL is not set.
"""
if not SLACK_WEBHOOK_URL:
log.debug("SLACK_WEBHOOK_URL not set — skipping Slack alert")
return
# Colour-code by event type
colour_map = {
"recovered": "#36a64f", # green
"degraded": "#ff9900", # orange
"critical": "#ff0000", # red
}
colour = colour_map.get(event, "#cccccc")
payload = {
"attachments": [
{
"color": colour,
"title": f"🐳 Self-Healing Infrastructure — {event.upper()}",
"text": message,
"footer": "health-monitor",
"ts": int(__import__("time").time()),
}
]
}
try:
resp = requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)
resp.raise_for_status()
log.info("Slack alert sent for container: %s", container_name)
except requests.exceptions.RequestException as exc:
log.error("Failed to send Slack alert: %s", exc)
If SLACK_WEBHOOK_URL isn't set, the function silently no-ops — no crashes, no noise. This makes it easy to run the project locally without Slack configured.
Step 5 — Build the Sample Flask App
The sample app simulates a real microservice. It exposes three endpoints:
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Returns 200 if healthy, 503 if broken |
/break |
POST | Simulates a failure (flips health to unhealthy) |
/fix |
POST | Restores health (resets to healthy) |
/metrics |
GET | Prometheus metrics (via prometheus_flask_exporter) |
sample-app/app.py
import os
import logging
from flask import Flask, jsonify
from prometheus_flask_exporter import PrometheusMetrics
app = Flask(__name__)
metrics = PrometheusMetrics(app)
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
# Mutable state — in a real app this would be a proper health-check result
_healthy = True
@app.get("/health")
def health():
"""Health endpoint polled by the monitor every 15 seconds."""
if _healthy:
return jsonify({"status": "ok"}), 200
return jsonify({"status": "degraded", "reason": "manually broken"}), 503
@app.post("/break")
def break_app():
"""Simulate a failure — the next health check will return 503."""
global _healthy
_healthy = False
log.warning("Health set to DEGRADED (manually triggered)")
return jsonify({"message": "App is now unhealthy"}), 200
@app.post("/fix")
def fix_app():
"""Restore health — normally you wouldn't need this; restart does it."""
global _healthy
_healthy = True
log.info("Health restored")
return jsonify({"message": "App is now healthy"}), 200
@app.get("/")
def index():
return jsonify({"service": "sample-app", "healthy": _healthy})
if __name__ == "__main__":
port = int(os.getenv("PORT", 8080))
app.run(host="0.0.0.0", port=port)
sample-app/requirements.txt
flask==3.0.3
prometheus-flask-exporter==0.23.1
sample-app/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8080
CMD ["python", "app.py"]
Step 6 — Docker Images for the Monitor
healer/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY health_monitor.py slack_alert.py ./
# The monitor needs access to the Docker daemon.
# Mount /var/run/docker.sock at runtime (see docker-compose.yml).
CMD ["python", "health_monitor.py"]
Step 7 — Wire It Together with Docker Compose
# docker-compose.yml
version: "3.9"
services:
sample-app:
build: ./sample-app
container_name: sample-app
ports:
- "8080:8080"
labels:
- "monitored=true" # ← this is what the monitor looks for
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 5s
health-monitor:
build: ./healer
container_name: health-monitor
depends_on:
- sample-app
environment:
- CHECK_INTERVAL=15
- FAILURE_THRESHOLD=3
- REQUEST_TIMEOUT=5
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL:-}
volumes:
# Give the monitor access to the Docker daemon
- /var/run/docker.sock:/var/run/docker.sock
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.2
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- prometheus
restart: unless-stopped
Important: Mounting /var/run/docker.sock inside the monitor container gives it full access to the Docker daemon on the host — it can start, stop, and inspect any container. In production, consider using Docker's TCP socket with TLS or scoping access with Authz plugins.
Step 8 — Configure Prometheus
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "sample-app"
static_configs:
- targets: ["sample-app:8080"]
metrics_path: "/metrics"
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Prometheus will now scrape the Flask app's /metrics endpoint every 15 seconds. The prometheus_flask_exporter library automatically exposes request counts, latencies, and status code breakdowns — no extra instrumentation required.
Step 9 — Run It
Start the full stack
docker compose up --build
You should see all four containers start up:
✔ Container sample-app Started
✔ Container health-monitor Started
✔ Container prometheus Started
✔ Container grafana Started
Verify everything is healthy
curl http://localhost:8080/health
# → {"status": "ok"}
curl http://localhost:9090/targets
# → sample-app should be UP
Trigger a failure
Open a second terminal:
curl -X POST http://localhost:8080/break
# → {"message": "App is now unhealthy"}
Switch back to the first terminal and watch the monitor logs:
2024-05-10 14:32:15 WARNING sample-app — unhealthy (1/3)
2024-05-10 14:32:30 WARNING sample-app — unhealthy (2/3)
2024-05-10 14:32:45 WARNING sample-app — unhealthy (3/3)
2024-05-10 14:32:45 WARNING RESTARTING sample-app — failed 3 consecutive health checks
2024-05-10 14:32:47 INFO Successfully restarted: sample-app
Within 45 seconds (3 checks × 15 seconds), the monitor detects the failure and restarts the container. After restart, the Flask app comes back healthy and the counter resets.
If you configured a Slack webhook, you'll also see a notification in your channel:
⚠ Container sample-app failed 3 consecutive health checks and was automatically restarted.
Step 10 — Add a CI/CD Pipeline with GitHub Actions
Every push should validate that the images build and the monitor starts cleanly.
.github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies
run: |
pip install docker==7.1.0 requests==2.32.3 pytest
- name: Run unit tests
run: pytest tests/ -v
# Unit tests mock the Docker SDK — no daemon required
- name: Build Docker images
run: docker compose build
- name: Start stack and smoke test
run: |
docker compose up -d
sleep 10
curl --fail http://localhost:8080/health
docker compose down
This pipeline:
Runs Python unit tests (using mocked Docker SDK calls)
Builds all Docker images from scratch
Spins up the full stack and smoke tests the health endpoint
Tears everything down cleanly
Step 11 — Simulate a Failure with the Script
For demos or manual testing, use the provided shell script:
scripts/simulate_failure.sh
#!/usr/bin/env bash
set -euo pipefail
APP_URL="http://localhost:8080"
echo "=== Self-Healing Demo ==="
echo ""
echo "1. Checking initial health..."
curl -s "$APP_URL/health" | python3 -m json.tool
echo ""
echo "2. Triggering failure..."
curl -s -X POST "$APP_URL/break" | python3 -m json.tool
echo ""
echo "3. Waiting for monitor to detect failure (up to 60s)..."
for i in $(seq 1 12); do
sleep 5
STATUS=\((curl -s -o /dev/null -w "%{http_code}" "\)APP_URL/health")
echo " Check \(i: HTTP \)STATUS"
if [ "$STATUS" = "200" ]; then
echo ""
echo "✅ Container was automatically restarted and is healthy again!"
exit 0
fi
done
echo "❌ Container did not recover in time — check monitor logs"
exit 1
Run it with:
chmod +x scripts/simulate_failure.sh
./scripts/simulate_failure.sh
Security Considerations
Before you ship this to production, there are a few things worth thinking about:
Docker socket access — Mounting /var/run/docker.sock gives the monitor container root-level access to your host. Mitigate this by:
Running the monitor as a non-root user with only the Docker group
Using Docker Socket Proxy to whitelist only the API calls you need
In Kubernetes, use the Kubernetes API instead of Docker directly
Restart loops — If a container's startup logic itself crashes, the monitor could restart it repeatedly. Add exponential backoff:
import math
def get_backoff_seconds(failure_count: int) -> float:
"""Exponential backoff: 15s, 30s, 60s, 120s, max 300s."""
return min(CHECK_INTERVAL * (2 ** (failure_count - FAILURE_THRESHOLD)), 300)
Alert fatigue — If you send a Slack message on every restart, a flapping service will flood your channel. Add a cooldown window — only alert if the container hasn't been restarted in the last N minutes.
What to Build Next
Prometheus alerting rules — trigger recovery based on metric thresholds (e.g., error rate > 5%) instead of just HTTP status
Grafana dashboard — visualise container restarts, health check latency, and uptime over time
Multi-container support — the monitor already supports multiple containers via the
monitored=truelabel — just add more services to your Compose fileKubernetes migration — replace Docker SDK calls with the Kubernetes Python client; use
Deploymentrolling restarts instead ofcontainer.restart()PagerDuty integration — for on-call rotations, swap Slack for PagerDuty's Events API
Key Takeaways
Docker's built-in restart policy is a safety net, not a health system. The gap between "a process is running" and "a service is healthy" is where real incidents live.
A 60-line Python health monitor with configurable thresholds fills that gap — no Kubernetes, no service mesh, no complex tooling. You get genuine self-healing at the cost of a single extra container.
The full project is on GitHub: github.com/gajjuu/Self-Healing-Container-Infrastructure
Found this useful? Drop a ❤ on Hashnode and share it with your team. Questions or improvements? Open an issue on GitHub.

