Incident Response Bot
Detect incidents from monitoring, alert the team, run playbooks, and manage the response timeline
đ§ Ingredients
đ APIs
infrastructure_monitoring_metrics_alerts_and_health_checks
đ Alternatives:
incident_alerting_team_coordination_and_status_updates
đ Alternatives:
check_recent_deployments_that_may_have_caused_the_incident
đ Alternatives:
update_public_status_page_during_incidents
đ Alternatives:
đ Step-by-Step Build Guide
Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert
1. Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks
Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
When an alert fires, enrich it with context
2. When an alert fires, enrich it with context: what service, what metric, current vs threshold, recent trend
When an alert fires, enrich it with context: what service, what metric, current vs threshold, recent trend Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
Check GitHub for recent deployments to the affected service (potential root caus
3. Check GitHub for recent deployments to the affected service (potential root cause)
Check GitHub for recent deployments to the affected service (potential root cause) Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
Create a Slack incident channel
4. Create a Slack incident channel: #incident-{date}-{service} with initial context message
Post a message to Slack using the Web API.
POST https://slack.com/api/chat.postMessage
Headers: Authorization: Bearer {SLACK_BOT_TOKEN}, Content-Type: application/json
Body: {
"channel": "{channel_id}",
"text": "{fallback_text}",
"blocks": [{ "type": "section", "text": { "type": "mrkdwn", "text": "{formatted_message}" }}]
}
Use Slack mrkdwn formatting: *bold*, _italic_, `code`, > blockquote.
For alerts, use emoji prefixes: đ´ critical, đĄ warning, đĸ success, âšī¸ info.
Keep messages scannable â use bullet points for lists.
Expected response: { "ok": true, "ts": "..." }. If ok is false, check the "error" field.Tag the on
5. Tag the on-call engineer and post the alert details, recent deployments, and suggested first steps
Tag the on-call engineer and post the alert details, recent deployments, and suggested first steps Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
Run diagnostic playbook
6. Run diagnostic playbook: check dependent services, review error logs, verify database connectivity
Persist the data to the configured storage. Data structure: - Include timestamp (ISO 8601) with every record - Use consistent field names across entries - Store raw values (not formatted) for future analysis - Add a source/origin field for audit trail Storage operation: 1. Validate the data before writing 2. Check for duplicates (by timestamp + unique key) 3. Append to existing records â never overwrite 4. Verify the write succeeded 5. Return the stored record ID/reference
Maintain a timeline in the channel
7. Maintain a timeline in the channel: when detected, who's responding, actions taken, status updates
Maintain a timeline in the channel: when detected, who's responding, actions taken, status updates Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
When resolved
8. When resolved: generate a post-mortem template with timeline, root cause, impact, and action items
When resolved: generate a post-mortem template with timeline, root cause, impact, and action items Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
đ¤ Example Agent Prompt
Connect to your monitoring platform (Datadog, PagerDuty) and subscribe to alert webhooks Steps: 1. Validate all required inputs are available 2. Execute the operation described above 3. Verify the result meets expected output format 4. Handle errors gracefully â retry transient failures, log and alert on persistent ones 5. Return structured output with status and any relevant data If any required data is missing, request it from the user before proceeding.
Copy this prompt into your agent to get started.