Setting Up Team Alerts

Effective alerting is crucial for maintaining service reliability. Learn how to configure smart alerting that keeps your team informed without causing alert fatigue.

Notification Channels

OpsKitty supports multiple notification channels to ensure your team gets alerted through their preferred communication platform.

Slack Integration

Connect your Slack workspace to receive real-time alerts:

Install OpsKitty App to your Slack workspace
Choose channels for different alert types
Configure alert formatting and mentions
Test the integration before deploying

Features:

Rich message formatting with status indicators
Thread replies for alert updates
Direct mentions for critical alerts
Incident timeline in threads

Discord Webhooks

Set up Discord notifications for your DevOps channels:

Create webhook URL in Discord channel settings
Add webhook to OpsKitty alert configuration
Customize message content and embeds
Test webhook before going live

Best for:

Gaming companies
Community-driven projects
Teams already using Discord

Email Notifications

Traditional but reliable email alerting:

Configure multiple email recipients
HTML formatted messages with alert details
Attach monitoring graphs and logs
Group similar alerts in digest format

Use cases:

Non-technical stakeholders
Audit trail requirements
Backup notification method

PagerDuty Escalation

Integrate with PagerDuty for on-call management:

Automatic incident creation
Integration with on-call schedules
Escalation policy support
Incident acknowledgment sync

Webhook Integration

Custom webhooks for any system:

{
  "event": "alert.triggered",
  "service": "api.example.com",
  "status": "down",
  "timestamp": "2024-01-15T10:30:00Z",
  "details": {
    "response_time": null,
    "status_code": null,
    "error": "Connection timeout"
  }
}

Alert Rules

Configure when and how alerts are triggered to balance notification frequency with incident response.

Severity Levels

Define different severity levels for various incident types:

Critical

Service completely down
Data loss occurring
Security breach detected
Action: Immediate notification to all channels

Warning

Elevated response times
Partial service degradation
Resource usage high
Action: Notification during business hours

Info

Service recovered
Maintenance completed
Configuration changed
Action: Log only, optional notification

Alert Conditions

Set specific conditions that trigger alerts:

# Example alert configuration
alert:
  name: "API Response Time"
  condition:
    metric: response_time
    operator: greater_than
    threshold: 1000ms
    duration: 5m
  actions:
    - notify: slack
    - create_incident: true

Threshold Configuration

Fine-tune when alerts fire:

Single threshold: Alert when metric crosses threshold
Multiple thresholds: Warning at 500ms, critical at 1000ms
Rate of change: Alert on sudden changes
Consecutive failures: Require multiple failed checks

Escalation Policies

Set up escalation rules for critical incidents.

Time-Based Escalation

Automatically escalate unacknowledged alerts:

0min  → Notify primary on-call via Slack
5min  → If not acknowledged, send SMS
15min → Escalate to secondary on-call
30min → Escalate to engineering manager
60min → Alert executive team

Follow-the-Sun Coverage

Distribute on-call across time zones:

APAC Team: 00:00 - 08:00 UTC
EMEA Team: 08:00 - 16:00 UTC
Americas Team: 16:00 - 00:00 UTC

Backup Escalation

Always have a backup plan:

Primary on-call unavailable → Secondary on-call
Entire team unavailable → Escalate to management
System-wide outage → All-hands notification

Alert Fatigue Prevention

Avoid overwhelming your team with too many alerts.

Alert Grouping

Combine related alerts:

Multiple endpoints on same service → Single “Service Down” alert
Database connection issues → Group all DB-related alerts
Time-window grouping → Batch alerts within 5-minute windows

Maintenance Windows

Silence alerts during planned maintenance:

maintenance:
  start: "2024-01-15T02:00:00Z"
  end: "2024-01-15T06:00:00Z"
  services:
    - api-server
    - database
  reason: "Database migration"

Smart Throttling

Prevent duplicate alerts:

Suppress repeat alerts for same issue
Exponential backoff for recurring issues
Auto-acknowledge after resolution

Alert Quality

Improve alert actionability:

Context: Include relevant metrics and logs
Runbooks: Link to troubleshooting guides
History: Show past similar incidents
Impact: Explain business impact

Alert Response Workflow

Define clear processes for handling alerts.

1. Acknowledge

Team member acknowledges alert:

Stops escalation
Assigns owner
Starts timer for resolution

2. Investigate

Gather information:

Check monitoring dashboards
Review recent changes
Analyze logs
Test affected endpoints

3. Mitigate

Take immediate action:

Rollback recent deployments
Scale resources
Enable fallback systems
Communicate with stakeholders

4. Resolve

Fix root cause:

Deploy fix
Verify resolution
Monitor for recurrence
Update documentation

5. Post-Mortem

Learn from incident:

Document timeline
Identify root cause
List action items
Update alerting rules

Testing Your Alerts

Always test before deploying to production.

Test Checklist

Alerts trigger correctly for failures
All notification channels receive alerts
Alert messages are clear and actionable
Escalation timing works as expected
Alerts auto-resolve when service recovers
Maintenance windows suppress alerts
Team members know how to respond

Simulate Incidents

Create test scenarios:

# Simulate service outage
curl -X POST https://api.opskitty.com/simulate/outage 
  -H "Authorization: Bearer $TOKEN" 
  -d '{"service": "test-api", "duration": "5m"}'

Best Practices

Start Conservative: Begin with fewer, more critical alerts
Iterate: Adjust thresholds based on actual incidents
Document: Keep runbooks updated
Review: Weekly alert review meetings
Measure: Track MTTA (Mean Time To Acknowledge) and MTTR (Mean Time To Resolve)

Common Mistakes

Too Many Alerts

Problem: Team ignores alerts
Solution: Increase thresholds, group alerts

Too Few Alerts

Problem: Incidents go unnoticed
Solution: Lower thresholds, add more checks

Unclear Alerts

Problem: Team doesn’t know what to do
Solution: Add context, link to runbooks

No Escalation

Problem: Alerts get ignored during off-hours
Solution: Implement escalation policies

Next Steps

Ready to set up smart alerting? Get started with OpsKitty!