🚒

Incident Response Bot

Detect incidents from monitoring, alert the team, run playbooks, and manage the response timeline

🤖 0 ↑ 0 ↓ | 👤 0 ↑ 0 ↓

advanced⏱ 50 minutes🔄 10 swappable alternatives

🧂 Ingredients

🔌 APIs

Infrastructure monitoring — metrics, alerts, and health checksrequired

infrastructure_monitoring_metrics_alerts_and_health_checks

🔄 Alternatives:

Grafana — Open-source, flexible dashboardsNew Relic — Generous free tierPrometheus — Free, self-hosted monitoring

Incident alerting, team coordination, and status updatesrequired

incident_alerting_team_coordination_and_status_updates

🔄 Alternatives:

Discord — Free, great for communitiesTelegram — Simple bot API, no approval neededTeams — Enterprise/Office 365 integration

Check recent deployments that may have caused the incidentoptional

check_recent_deployments_that_may_have_caused_the_incident

🔄 Alternatives:

Gitlab — Built-in CI/CD, self-hostableBitbucket — Atlassian ecosystem integration

Update public status page during incidentsoptional

update_public_status_page_during_incidents

🔄 Alternatives:

Betteruptime — Free tier, modern UIInstatus — Cheaper, fast status pages

📋 Step-by-Step Build Guide

STEP 1

Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert

1. Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks

Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

STEP 2

When an alert fires, enrich it with context

2. When an alert fires, enrich it with context: what service, what metric, current vs threshold, recent trend

When an alert fires, enrich it with context: what service, what metric, current vs threshold, recent trend

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

STEP 3

Check GitHub for recent deployments to the affected service (potential root caus

3. Check GitHub for recent deployments to the affected service (potential root cause)

Check GitHub for recent deployments to the affected service (potential root cause)

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

STEP 4

Create a Slack incident channel

4. Create a Slack incident channel: #incident-{date}-{service} with initial context message

Post a message to Slack using the Web API.

POST https://slack.com/api/chat.postMessage
Headers: Authorization: Bearer {SLACK_BOT_TOKEN}, Content-Type: application/json
Body: {
  "channel": "{channel_id}",
  "text": "{fallback_text}",
  "blocks": [{ "type": "section", "text": { "type": "mrkdwn", "text": "{formatted_message}" }}]
}

Use Slack mrkdwn formatting: *bold*, _italic_, `code`, > blockquote.
For alerts, use emoji prefixes: 🔴 critical, 🟡 warning, 🟢 success, ℹ️ info.
Keep messages scannable — use bullet points for lists.

Expected response: { "ok": true, "ts": "..." }. If ok is false, check the "error" field.

STEP 5

Tag the on

5. Tag the on-call engineer and post the alert details, recent deployments, and suggested first steps

Tag the on-call engineer and post the alert details, recent deployments, and suggested first steps

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

STEP 6

Run diagnostic playbook

6. Run diagnostic playbook: check dependent services, review error logs, verify database connectivity

Persist the data to the configured storage.

Data structure:
- Include timestamp (ISO 8601) with every record
- Use consistent field names across entries
- Store raw values (not formatted) for future analysis
- Add a source/origin field for audit trail

Storage operation:
1. Validate the data before writing
2. Check for duplicates (by timestamp + unique key)
3. Append to existing records — never overwrite
4. Verify the write succeeded
5. Return the stored record ID/reference

STEP 7

Maintain a timeline in the channel

7. Maintain a timeline in the channel: when detected, who's responding, actions taken, status updates

Maintain a timeline in the channel: when detected, who's responding, actions taken, status updates

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

STEP 8

When resolved

8. When resolved: generate a post-mortem template with timeline, root cause, impact, and action items

When resolved: generate a post-mortem template with timeline, root cause, impact, and action items

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

🤖 Example Agent Prompt

Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks

Steps:
1. Validate all required inputs are available
2. Execute the operation described above
3. Verify the result meets expected output format
4. Handle errors gracefully — retry transient failures, log and alert on persistent ones
5. Return structured output with status and any relevant data

If any required data is missing, request it from the user before proceeding.

Copy this prompt into your agent to get started.