Featured

Setting Up Team Alerts

Configure smart alerting for your team with escalation rules and notification channels.

By OpsKitty Team
alertsteamnotifications

Effective alerting is crucial for maintaining service reliability. Learn how to configure smart alerting that keeps your team informed without causing alert fatigue.

Notification Channels

OpsKitty supports multiple notification channels to ensure your team gets alerted through their preferred communication platform.

Slack Integration

Connect your Slack workspace to receive real-time alerts:

  1. Install OpsKitty App to your Slack workspace
  2. Choose channels for different alert types
  3. Configure alert formatting and mentions
  4. Test the integration before deploying

Features:

  • Rich message formatting with status indicators
  • Thread replies for alert updates
  • Direct mentions for critical alerts
  • Incident timeline in threads

Discord Webhooks

Set up Discord notifications for your DevOps channels:

  • Create webhook URL in Discord channel settings
  • Add webhook to OpsKitty alert configuration
  • Customize message content and embeds
  • Test webhook before going live

Best for:

  • Gaming companies
  • Community-driven projects
  • Teams already using Discord

Email Notifications

Traditional but reliable email alerting:

  • Configure multiple email recipients
  • HTML formatted messages with alert details
  • Attach monitoring graphs and logs
  • Group similar alerts in digest format

Use cases:

  • Non-technical stakeholders
  • Audit trail requirements
  • Backup notification method

PagerDuty Escalation

Integrate with PagerDuty for on-call management:

  • Automatic incident creation
  • Integration with on-call schedules
  • Escalation policy support
  • Incident acknowledgment sync

Webhook Integration

Custom webhooks for any system:

{
  "event": "alert.triggered",
  "service": "api.example.com",
  "status": "down",
  "timestamp": "2024-01-15T10:30:00Z",
  "details": {
    "response_time": null,
    "status_code": null,
    "error": "Connection timeout"
  }
}

Alert Rules

Configure when and how alerts are triggered to balance notification frequency with incident response.

Severity Levels

Define different severity levels for various incident types:

Critical

  • Service completely down
  • Data loss occurring
  • Security breach detected
  • Action: Immediate notification to all channels

Warning

  • Elevated response times
  • Partial service degradation
  • Resource usage high
  • Action: Notification during business hours

Info

  • Service recovered
  • Maintenance completed
  • Configuration changed
  • Action: Log only, optional notification

Alert Conditions

Set specific conditions that trigger alerts:

# Example alert configuration
alert:
  name: "API Response Time"
  condition:
    metric: response_time
    operator: greater_than
    threshold: 1000ms
    duration: 5m
  actions:
    - notify: slack
    - create_incident: true

Threshold Configuration

Fine-tune when alerts fire:

  • Single threshold: Alert when metric crosses threshold
  • Multiple thresholds: Warning at 500ms, critical at 1000ms
  • Rate of change: Alert on sudden changes
  • Consecutive failures: Require multiple failed checks

Escalation Policies

Set up escalation rules for critical incidents.

Time-Based Escalation

Automatically escalate unacknowledged alerts:

0min  → Notify primary on-call via Slack
5min  → If not acknowledged, send SMS
15min → Escalate to secondary on-call
30min → Escalate to engineering manager
60min → Alert executive team

Follow-the-Sun Coverage

Distribute on-call across time zones:

  • APAC Team: 00:00 - 08:00 UTC
  • EMEA Team: 08:00 - 16:00 UTC
  • Americas Team: 16:00 - 00:00 UTC

Backup Escalation

Always have a backup plan:

  • Primary on-call unavailable → Secondary on-call
  • Entire team unavailable → Escalate to management
  • System-wide outage → All-hands notification

Alert Fatigue Prevention

Avoid overwhelming your team with too many alerts.

Alert Grouping

Combine related alerts:

  • Multiple endpoints on same service → Single “Service Down” alert
  • Database connection issues → Group all DB-related alerts
  • Time-window grouping → Batch alerts within 5-minute windows

Maintenance Windows

Silence alerts during planned maintenance:

maintenance:
  start: "2024-01-15T02:00:00Z"
  end: "2024-01-15T06:00:00Z"
  services:
    - api-server
    - database
  reason: "Database migration"

Smart Throttling

Prevent duplicate alerts:

  • Suppress repeat alerts for same issue
  • Exponential backoff for recurring issues
  • Auto-acknowledge after resolution

Alert Quality

Improve alert actionability:

  • Context: Include relevant metrics and logs
  • Runbooks: Link to troubleshooting guides
  • History: Show past similar incidents
  • Impact: Explain business impact

Alert Response Workflow

Define clear processes for handling alerts.

1. Acknowledge

Team member acknowledges alert:

  • Stops escalation
  • Assigns owner
  • Starts timer for resolution

2. Investigate

Gather information:

  • Check monitoring dashboards
  • Review recent changes
  • Analyze logs
  • Test affected endpoints

3. Mitigate

Take immediate action:

  • Rollback recent deployments
  • Scale resources
  • Enable fallback systems
  • Communicate with stakeholders

4. Resolve

Fix root cause:

  • Deploy fix
  • Verify resolution
  • Monitor for recurrence
  • Update documentation

5. Post-Mortem

Learn from incident:

  • Document timeline
  • Identify root cause
  • List action items
  • Update alerting rules

Testing Your Alerts

Always test before deploying to production.

Test Checklist

  • Alerts trigger correctly for failures
  • All notification channels receive alerts
  • Alert messages are clear and actionable
  • Escalation timing works as expected
  • Alerts auto-resolve when service recovers
  • Maintenance windows suppress alerts
  • Team members know how to respond

Simulate Incidents

Create test scenarios:

# Simulate service outage
curl -X POST https://api.opskitty.com/simulate/outage 
  -H "Authorization: Bearer $TOKEN" 
  -d '{"service": "test-api", "duration": "5m"}'

Best Practices

  1. Start Conservative: Begin with fewer, more critical alerts
  2. Iterate: Adjust thresholds based on actual incidents
  3. Document: Keep runbooks updated
  4. Review: Weekly alert review meetings
  5. Measure: Track MTTA (Mean Time To Acknowledge) and MTTR (Mean Time To Resolve)

Common Mistakes

Too Many Alerts

  • Problem: Team ignores alerts
  • Solution: Increase thresholds, group alerts

Too Few Alerts

  • Problem: Incidents go unnoticed
  • Solution: Lower thresholds, add more checks

Unclear Alerts

  • Problem: Team doesn’t know what to do
  • Solution: Add context, link to runbooks

No Escalation

  • Problem: Alerts get ignored during off-hours
  • Solution: Implement escalation policies

Next Steps

Ready to set up smart alerting? Get started with OpsKitty!