mirror of https://github.com/puppetlabs/vmpooler.git synced 2026-01-26 01:58:41 -05:00

Mahima Singh b3be210f99 Add DLQ, auto-purge, and health checks for Redis queues

- Implement dead-letter queue (DLQ) to capture failed VM operations
- Implement auto-purge to clean up stale queue entries
- Implement health checks to monitor queue health
- Add comprehensive tests and documentation

Features:
- DLQ captures failures from pending, clone, and ready queues
- Auto-purge removes stale VMs with configurable thresholds
- Health checks expose metrics for monitoring and alerting
- All features opt-in via configuration (backward compatible)

2025-12-19 13:17:02 +05:30

12 KiB

Raw Blame History

Queue Reliability Features - Operator Guide

Overview

This guide covers the Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features added to VMPooler for improved queue reliability and observability.

Features

1. Dead-Letter Queue (DLQ)

The DLQ captures failed VM creation attempts and queue transitions, providing visibility into failures without losing data.

What gets captured:

VMs that fail during clone operations
VMs that timeout in pending queue
VMs that become unreachable in ready queue
Any permanent errors (template not found, permission denied, etc.)

Benefits:

Failed VMs are not lost - they're moved to DLQ for analysis
Complete failure context (error message, timestamp, retry count, request ID)
TTL-based expiration prevents unbounded growth
Size limiting prevents memory issues

Configuration:

:config:
  dlq_enabled: true
  dlq_ttl: 168  # hours (7 days)
  dlq_max_entries: 10000  # per DLQ queue

Querying DLQ via Redis CLI:

# View all pending DLQ entries
redis-cli ZRANGE vmpooler__dlq__pending 0 -1

# View DLQ entries with scores (timestamps)
redis-cli ZRANGE vmpooler__dlq__pending 0 -1 WITHSCORES

# Get DLQ size
redis-cli ZCARD vmpooler__dlq__pending

# View recent failures (last 10)
redis-cli ZREVRANGE vmpooler__dlq__clone 0 9

# View entries older than 1 hour (timestamp in seconds)
redis-cli ZRANGEBYSCORE vmpooler__dlq__pending -inf $(date -d '1 hour ago' +%s)

DLQ Keys:

vmpooler__dlq__pending - Failed pending VMs
vmpooler__dlq__clone - Failed clone operations
vmpooler__dlq__ready - Failed ready queue VMs
vmpooler__dlq__tasks - Failed tasks

Entry Format: Each DLQ entry contains:

{
  "vm": "pooler-happy-elephant",
  "pool": "centos-7-x86_64",
  "queue_from": "pending",
  "error_class": "StandardError",
  "error_message": "template centos-7-template does not exist",
  "failed_at": "2024-01-15T10:30:00Z",
  "retry_count": 3,
  "request_id": "req-abc123",
  "pool_alias": "centos-7"
}

2. Auto-Purge

Automatically removes stale entries from queues to prevent resource leaks and maintain queue health.

What gets purged:

Pending VMs: Stuck in pending queue longer than max_pending_age
Ready VMs: Idle in ready queue longer than max_ready_age
Completed VMs: In completed queue longer than max_completed_age
Orphaned Metadata: VM metadata without corresponding queue entry

Benefits:

Prevents queue bloat from stuck/forgotten VMs
Automatically cleans up after process crashes or bugs
Configurable thresholds per environment
Dry-run mode for safe testing

Configuration:

:config:
  purge_enabled: true
  purge_interval: 3600  # seconds (1 hour) - how often to run
  purge_dry_run: false  # set to true to log but not purge
  
  # Age thresholds (in seconds)
  max_pending_age: 7200   # 2 hours
  max_ready_age: 86400    # 24 hours
  max_completed_age: 3600 # 1 hour
  max_orphaned_age: 86400 # 24 hours

Testing Purge (Dry-Run Mode):

:config:
  purge_enabled: true
  purge_dry_run: true  # Logs what would be purged without actually purging
  max_pending_age: 600  # Use shorter thresholds for testing

Watch logs for:

[*] [purge][dry-run] Would purge stale pending VM 'pooler-happy-elephant' (age: 3650s, max: 600s)

Monitoring Purge: Check logs for purge cycles:

[*] [purge] Starting stale queue entry purge cycle
[!] [purge] Purged stale pending VM 'pooler-sad-dog' from 'centos-7-x86_64' (age: 7250s)
[!] [purge] Moved stale ready VM 'pooler-angry-cat' from 'ubuntu-2004-x86_64' to completed (age: 90000s)
[*] [purge] Completed purge cycle in 2.34s: 12 entries purged

3. Health Checks

Monitors queue health and exposes metrics for alerting and dashboards.

What gets monitored:

Queue sizes (pending, ready, completed)
Queue ages (oldest VM, average age)
Stuck VMs (VMs in pending queue longer than threshold)
DLQ size
Orphaned metadata count
Task queue sizes (clone, on-demand)
Overall health status (healthy/degraded/unhealthy)

Benefits:

Proactive detection of queue issues
Metrics for alerting and dashboards
Historical health tracking
API endpoint for health status

Configuration:

:config:
  health_check_enabled: true
  health_check_interval: 300  # seconds (5 minutes)
  
  health_thresholds:
    pending_queue_max: 100
    ready_queue_max: 500
    dlq_max_warning: 100
    dlq_max_critical: 1000
    stuck_vm_age_threshold: 7200  # 2 hours
    stuck_vm_max_warning: 10
    stuck_vm_max_critical: 50

Health Status Levels:

Healthy: All metrics within normal thresholds
Degraded: Some metrics elevated but functional (DLQ > warning, queue sizes elevated)
Unhealthy: Critical thresholds exceeded (DLQ > critical, many stuck VMs, queues backed up)

Viewing Health Status:

Via Redis:

# Get current health status
redis-cli HGETALL vmpooler__health

# Get specific health metric
redis-cli HGET vmpooler__health status
redis-cli HGET vmpooler__health last_check

Via Logs:

[*] [health] Status: HEALTHY | Queues: P=45 R=230 C=12 | DLQ=25 | Stuck=3 | Orphaned=5

Exposed Metrics:

The following metrics are pushed to the metrics system (Prometheus, Graphite, etc.):

# Health status (0=healthy, 1=degraded, 2=unhealthy)
vmpooler.health.status

# Error metrics
vmpooler.health.dlq.total_size
vmpooler.health.stuck_vms.count
vmpooler.health.orphaned_metadata.count

# Per-pool queue metrics
vmpooler.health.queue.<pool_name>.pending.size
vmpooler.health.queue.<pool_name>.pending.oldest_age
vmpooler.health.queue.<pool_name>.pending.stuck_count
vmpooler.health.queue.<pool_name>.ready.size
vmpooler.health.queue.<pool_name>.ready.oldest_age
vmpooler.health.queue.<pool_name>.completed.size

# DLQ metrics
vmpooler.health.dlq.<queue_type>.size

# Task metrics
vmpooler.health.tasks.clone.active
vmpooler.health.tasks.ondemand.active
vmpooler.health.tasks.ondemand.pending

Common Scenarios

Scenario 1: Investigating Failed VM Requests

Problem: User reports VM request failed.

Steps:

Check DLQ for the request:

redis-cli ZRANGE vmpooler__dlq__pending 0 -1 | grep "req-abc123"
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123"

Parse the JSON entry to see failure details:

redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" | jq .

Common failure reasons:
- template does not exist - Template missing or renamed in provider
- permission denied - VMPooler lacks permissions to clone template
- timeout - VM failed to become ready within timeout period
- failed to obtain IP - Network/DHCP issue

Scenario 2: Queue Backup

Problem: Pending queue growing, VMs not moving to ready.

Steps:

Check health status:
```
redis-cli HGET vmpooler__health status
```

Check pending queue metrics:

# View stuck VMs
redis-cli HGET vmpooler__health stuck_vm_count

# Check oldest VM age
redis-cli SMEMBERS vmpooler__pending__centos-7-x86_64 | head -1 | xargs -I {} redis-cli HGET vmpooler__vm__{} clone

Check DLQ for recent failures:

redis-cli ZREVRANGE vmpooler__dlq__clone 0 9

Common causes:
- Provider errors (vCenter unreachable, no resources)
- Network issues (can't reach VMs, no DHCP)
- Configuration issues (wrong template name, bad credentials)

Scenario 3: High DLQ Size

Problem: DLQ size growing, indicating persistent failures.

Steps:

Check DLQ size:

redis-cli ZCARD vmpooler__dlq__pending
redis-cli ZCARD vmpooler__dlq__clone

Identify common failure patterns:

redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | jq -r '.error_message' | sort | uniq -c | sort -rn

Fix underlying issues (template exists, permissions, network)
If issues resolved, DLQ entries will expire after TTL (default 7 days)

Scenario 4: Testing Configuration Changes

Problem: Want to test new purge thresholds without affecting production.

Steps:

Enable dry-run mode:

:config:
  purge_dry_run: true
  max_pending_age: 3600  # Test with 1 hour

Monitor logs for purge detections:

tail -f vmpooler.log | grep "purge.*dry-run"

Verify detection is correct
Disable dry-run when ready:
```
:config:
  purge_dry_run: false
```

Scenario 5: Alerting on Queue Health

Problem: Want to be notified when queues are unhealthy.

Steps:

Set up Prometheus alerts based on health metrics:

- alert: VMPoolerUnhealthy
  expr: vmpooler_health_status >= 2
  for: 10m
  annotations:
    summary: "VMPooler is unhealthy"

- alert: VMPoolerHighDLQ
  expr: vmpooler_health_dlq_total_size > 500
  for: 30m
  annotations:
    summary: "VMPooler DLQ size is high"

- alert: VMPoolerStuckVMs
  expr: vmpooler_health_stuck_vms_count > 20
  for: 15m
  annotations:
    summary: "Many VMs stuck in pending queue"

Troubleshooting

DLQ Not Capturing Failures

Check:

Is DLQ enabled? redis-cli HGET vmpooler__config dlq_enabled
Are failures actually occurring? Check logs for error messages
Is Redis accessible? redis-cli PING

Purge Not Running

Check:

Is purge enabled? Check config purge_enabled: true
Check logs for purge thread startup: [*] [purge] Starting stale queue entry purge cycle
Is purge interval too long? Default is 1 hour
Check thread status in logs: [!] [queue_purge] worker thread died

Health Check Not Updating

Check:

Is health check enabled? Check config health_check_enabled: true
Check last update time: redis-cli HGET vmpooler__health last_check
Check logs for health check runs: [*] [health] Status:
Check thread status: [!] [health_check] worker thread died

Metrics Not Appearing

Check:

Is metrics system configured? Check :statsd or :graphite config
Are metrics being sent? Check logs for metric sends
Check firewall/network to metrics server
Test metrics manually: redis-cli HGETALL vmpooler__health

Best Practices

Development/Testing Environments

Enable DLQ with shorter TTL (24-48 hours)
Enable purge with dry-run mode initially
Use aggressive purge thresholds (30min pending, 6hr ready)
Enable health checks with 1-minute interval
Monitor logs closely for issues

Production Environments

Enable DLQ with 7-day TTL
Enable purge after testing in dev
Use conservative purge thresholds (2hr pending, 24hr ready)
Enable health checks with 5-minute interval
Set up alerting based on health metrics
Monitor DLQ size and set alerts (>500 = investigate)

Capacity Planning

Monitor queue sizes during peak times
Adjust thresholds based on actual usage patterns
Review DLQ entries weekly for systemic issues
Track purge counts to identify resource leaks

Debugging

Keep DLQ TTL long enough for investigation (7+ days)
Use dry-run mode when testing threshold changes
Correlate DLQ entries with provider logs
Check health metrics before and after changes

Migration Guide

Enabling Features in Existing Deployment

Phase 1: Enable DLQ
- Add DLQ config with conservative TTL
- Monitor DLQ size and entry patterns
- Verify no performance impact
- Adjust TTL as needed
Phase 2: Enable Health Checks
- Add health check config
- Verify metrics are exposed
- Set up dashboards
- Configure alerting
Phase 3: Enable Purge (Dry-Run)
- Add purge config with purge_dry_run: true
- Monitor logs for purge detections
- Verify thresholds are appropriate
- Adjust thresholds based on observations
Phase 4: Enable Purge (Live)
- Set purge_dry_run: false
- Monitor queue sizes and purge counts
- Watch for unexpected VM removal
- Adjust thresholds if needed

Performance Considerations

DLQ: Minimal overhead, uses Redis sorted sets
Purge: Runs in background thread, iterates through queues
Health Checks: Lightweight, caches metrics between runs

Expected impact:

Redis memory: +1-5MB for DLQ (depends on DLQ size)
CPU: +1-2% during purge/health check cycles
Network: Minimal, only metric pushes

Support

For issues or questions:

Check logs for error messages
Review DLQ entries for failure patterns
Check health status and metrics
Open issue on GitHub with logs and config

12 KiB Raw Blame History

Queue Reliability Features - Operator Guide

Overview

Features

1. Dead-Letter Queue (DLQ)

2. Auto-Purge

3. Health Checks

Common Scenarios

Scenario 1: Investigating Failed VM Requests

Scenario 2: Queue Backup

Scenario 3: High DLQ Size

Scenario 4: Testing Configuration Changes

Scenario 5: Alerting on Queue Health

Troubleshooting

DLQ Not Capturing Failures

Purge Not Running

Health Check Not Updating

Metrics Not Appearing

Best Practices

Development/Testing Environments

Production Environments

Capacity Planning

Debugging

Migration Guide

Enabling Features in Existing Deployment

Performance Considerations

Support

12 KiB

Raw Blame History