- Implement dead-letter queue (DLQ) to capture failed VM operations - Implement auto-purge to clean up stale queue entries - Implement health checks to monitor queue health - Add comprehensive tests and documentation Features: - DLQ captures failures from pending, clone, and ready queues - Auto-purge removes stale VMs with configurable thresholds - Health checks expose metrics for monitoring and alerting - All features opt-in via configuration (backward compatible)
12 KiB
Queue Reliability Features - Operator Guide
Overview
This guide covers the Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features added to VMPooler for improved queue reliability and observability.
Features
1. Dead-Letter Queue (DLQ)
The DLQ captures failed VM creation attempts and queue transitions, providing visibility into failures without losing data.
What gets captured:
- VMs that fail during clone operations
- VMs that timeout in pending queue
- VMs that become unreachable in ready queue
- Any permanent errors (template not found, permission denied, etc.)
Benefits:
- Failed VMs are not lost - they're moved to DLQ for analysis
- Complete failure context (error message, timestamp, retry count, request ID)
- TTL-based expiration prevents unbounded growth
- Size limiting prevents memory issues
Configuration:
:config:
dlq_enabled: true
dlq_ttl: 168 # hours (7 days)
dlq_max_entries: 10000 # per DLQ queue
Querying DLQ via Redis CLI:
# View all pending DLQ entries
redis-cli ZRANGE vmpooler__dlq__pending 0 -1
# View DLQ entries with scores (timestamps)
redis-cli ZRANGE vmpooler__dlq__pending 0 -1 WITHSCORES
# Get DLQ size
redis-cli ZCARD vmpooler__dlq__pending
# View recent failures (last 10)
redis-cli ZREVRANGE vmpooler__dlq__clone 0 9
# View entries older than 1 hour (timestamp in seconds)
redis-cli ZRANGEBYSCORE vmpooler__dlq__pending -inf $(date -d '1 hour ago' +%s)
DLQ Keys:
vmpooler__dlq__pending- Failed pending VMsvmpooler__dlq__clone- Failed clone operationsvmpooler__dlq__ready- Failed ready queue VMsvmpooler__dlq__tasks- Failed tasks
Entry Format: Each DLQ entry contains:
{
"vm": "pooler-happy-elephant",
"pool": "centos-7-x86_64",
"queue_from": "pending",
"error_class": "StandardError",
"error_message": "template centos-7-template does not exist",
"failed_at": "2024-01-15T10:30:00Z",
"retry_count": 3,
"request_id": "req-abc123",
"pool_alias": "centos-7"
}
2. Auto-Purge
Automatically removes stale entries from queues to prevent resource leaks and maintain queue health.
What gets purged:
- Pending VMs: Stuck in pending queue longer than
max_pending_age - Ready VMs: Idle in ready queue longer than
max_ready_age - Completed VMs: In completed queue longer than
max_completed_age - Orphaned Metadata: VM metadata without corresponding queue entry
Benefits:
- Prevents queue bloat from stuck/forgotten VMs
- Automatically cleans up after process crashes or bugs
- Configurable thresholds per environment
- Dry-run mode for safe testing
Configuration:
:config:
purge_enabled: true
purge_interval: 3600 # seconds (1 hour) - how often to run
purge_dry_run: false # set to true to log but not purge
# Age thresholds (in seconds)
max_pending_age: 7200 # 2 hours
max_ready_age: 86400 # 24 hours
max_completed_age: 3600 # 1 hour
max_orphaned_age: 86400 # 24 hours
Testing Purge (Dry-Run Mode):
:config:
purge_enabled: true
purge_dry_run: true # Logs what would be purged without actually purging
max_pending_age: 600 # Use shorter thresholds for testing
Watch logs for:
[*] [purge][dry-run] Would purge stale pending VM 'pooler-happy-elephant' (age: 3650s, max: 600s)
Monitoring Purge: Check logs for purge cycles:
[*] [purge] Starting stale queue entry purge cycle
[!] [purge] Purged stale pending VM 'pooler-sad-dog' from 'centos-7-x86_64' (age: 7250s)
[!] [purge] Moved stale ready VM 'pooler-angry-cat' from 'ubuntu-2004-x86_64' to completed (age: 90000s)
[*] [purge] Completed purge cycle in 2.34s: 12 entries purged
3. Health Checks
Monitors queue health and exposes metrics for alerting and dashboards.
What gets monitored:
- Queue sizes (pending, ready, completed)
- Queue ages (oldest VM, average age)
- Stuck VMs (VMs in pending queue longer than threshold)
- DLQ size
- Orphaned metadata count
- Task queue sizes (clone, on-demand)
- Overall health status (healthy/degraded/unhealthy)
Benefits:
- Proactive detection of queue issues
- Metrics for alerting and dashboards
- Historical health tracking
- API endpoint for health status
Configuration:
:config:
health_check_enabled: true
health_check_interval: 300 # seconds (5 minutes)
health_thresholds:
pending_queue_max: 100
ready_queue_max: 500
dlq_max_warning: 100
dlq_max_critical: 1000
stuck_vm_age_threshold: 7200 # 2 hours
stuck_vm_max_warning: 10
stuck_vm_max_critical: 50
Health Status Levels:
- Healthy: All metrics within normal thresholds
- Degraded: Some metrics elevated but functional (DLQ > warning, queue sizes elevated)
- Unhealthy: Critical thresholds exceeded (DLQ > critical, many stuck VMs, queues backed up)
Viewing Health Status:
Via Redis:
# Get current health status
redis-cli HGETALL vmpooler__health
# Get specific health metric
redis-cli HGET vmpooler__health status
redis-cli HGET vmpooler__health last_check
Via Logs:
[*] [health] Status: HEALTHY | Queues: P=45 R=230 C=12 | DLQ=25 | Stuck=3 | Orphaned=5
Exposed Metrics:
The following metrics are pushed to the metrics system (Prometheus, Graphite, etc.):
# Health status (0=healthy, 1=degraded, 2=unhealthy)
vmpooler.health.status
# Error metrics
vmpooler.health.dlq.total_size
vmpooler.health.stuck_vms.count
vmpooler.health.orphaned_metadata.count
# Per-pool queue metrics
vmpooler.health.queue.<pool_name>.pending.size
vmpooler.health.queue.<pool_name>.pending.oldest_age
vmpooler.health.queue.<pool_name>.pending.stuck_count
vmpooler.health.queue.<pool_name>.ready.size
vmpooler.health.queue.<pool_name>.ready.oldest_age
vmpooler.health.queue.<pool_name>.completed.size
# DLQ metrics
vmpooler.health.dlq.<queue_type>.size
# Task metrics
vmpooler.health.tasks.clone.active
vmpooler.health.tasks.ondemand.active
vmpooler.health.tasks.ondemand.pending
Common Scenarios
Scenario 1: Investigating Failed VM Requests
Problem: User reports VM request failed.
Steps:
-
Check DLQ for the request:
redis-cli ZRANGE vmpooler__dlq__pending 0 -1 | grep "req-abc123" redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" -
Parse the JSON entry to see failure details:
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" | jq . -
Common failure reasons:
template does not exist- Template missing or renamed in providerpermission denied- VMPooler lacks permissions to clone templatetimeout- VM failed to become ready within timeout periodfailed to obtain IP- Network/DHCP issue
Scenario 2: Queue Backup
Problem: Pending queue growing, VMs not moving to ready.
Steps:
-
Check health status:
redis-cli HGET vmpooler__health status -
Check pending queue metrics:
# View stuck VMs redis-cli HGET vmpooler__health stuck_vm_count # Check oldest VM age redis-cli SMEMBERS vmpooler__pending__centos-7-x86_64 | head -1 | xargs -I {} redis-cli HGET vmpooler__vm__{} clone -
Check DLQ for recent failures:
redis-cli ZREVRANGE vmpooler__dlq__clone 0 9 -
Common causes:
- Provider errors (vCenter unreachable, no resources)
- Network issues (can't reach VMs, no DHCP)
- Configuration issues (wrong template name, bad credentials)
Scenario 3: High DLQ Size
Problem: DLQ size growing, indicating persistent failures.
Steps:
-
Check DLQ size:
redis-cli ZCARD vmpooler__dlq__pending redis-cli ZCARD vmpooler__dlq__clone -
Identify common failure patterns:
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | jq -r '.error_message' | sort | uniq -c | sort -rn -
Fix underlying issues (template exists, permissions, network)
-
If issues resolved, DLQ entries will expire after TTL (default 7 days)
Scenario 4: Testing Configuration Changes
Problem: Want to test new purge thresholds without affecting production.
Steps:
-
Enable dry-run mode:
:config: purge_dry_run: true max_pending_age: 3600 # Test with 1 hour -
Monitor logs for purge detections:
tail -f vmpooler.log | grep "purge.*dry-run" -
Verify detection is correct
-
Disable dry-run when ready:
:config: purge_dry_run: false
Scenario 5: Alerting on Queue Health
Problem: Want to be notified when queues are unhealthy.
Steps:
- Set up Prometheus alerts based on health metrics:
- alert: VMPoolerUnhealthy expr: vmpooler_health_status >= 2 for: 10m annotations: summary: "VMPooler is unhealthy" - alert: VMPoolerHighDLQ expr: vmpooler_health_dlq_total_size > 500 for: 30m annotations: summary: "VMPooler DLQ size is high" - alert: VMPoolerStuckVMs expr: vmpooler_health_stuck_vms_count > 20 for: 15m annotations: summary: "Many VMs stuck in pending queue"
Troubleshooting
DLQ Not Capturing Failures
Check:
- Is DLQ enabled?
redis-cli HGET vmpooler__config dlq_enabled - Are failures actually occurring? Check logs for error messages
- Is Redis accessible?
redis-cli PING
Purge Not Running
Check:
- Is purge enabled? Check config
purge_enabled: true - Check logs for purge thread startup:
[*] [purge] Starting stale queue entry purge cycle - Is purge interval too long? Default is 1 hour
- Check thread status in logs:
[!] [queue_purge] worker thread died
Health Check Not Updating
Check:
- Is health check enabled? Check config
health_check_enabled: true - Check last update time:
redis-cli HGET vmpooler__health last_check - Check logs for health check runs:
[*] [health] Status: - Check thread status:
[!] [health_check] worker thread died
Metrics Not Appearing
Check:
- Is metrics system configured? Check
:statsdor:graphiteconfig - Are metrics being sent? Check logs for metric sends
- Check firewall/network to metrics server
- Test metrics manually:
redis-cli HGETALL vmpooler__health
Best Practices
Development/Testing Environments
- Enable DLQ with shorter TTL (24-48 hours)
- Enable purge with dry-run mode initially
- Use aggressive purge thresholds (30min pending, 6hr ready)
- Enable health checks with 1-minute interval
- Monitor logs closely for issues
Production Environments
- Enable DLQ with 7-day TTL
- Enable purge after testing in dev
- Use conservative purge thresholds (2hr pending, 24hr ready)
- Enable health checks with 5-minute interval
- Set up alerting based on health metrics
- Monitor DLQ size and set alerts (>500 = investigate)
Capacity Planning
- Monitor queue sizes during peak times
- Adjust thresholds based on actual usage patterns
- Review DLQ entries weekly for systemic issues
- Track purge counts to identify resource leaks
Debugging
- Keep DLQ TTL long enough for investigation (7+ days)
- Use dry-run mode when testing threshold changes
- Correlate DLQ entries with provider logs
- Check health metrics before and after changes
Migration Guide
Enabling Features in Existing Deployment
-
Phase 1: Enable DLQ
- Add DLQ config with conservative TTL
- Monitor DLQ size and entry patterns
- Verify no performance impact
- Adjust TTL as needed
-
Phase 2: Enable Health Checks
- Add health check config
- Verify metrics are exposed
- Set up dashboards
- Configure alerting
-
Phase 3: Enable Purge (Dry-Run)
- Add purge config with
purge_dry_run: true - Monitor logs for purge detections
- Verify thresholds are appropriate
- Adjust thresholds based on observations
- Add purge config with
-
Phase 4: Enable Purge (Live)
- Set
purge_dry_run: false - Monitor queue sizes and purge counts
- Watch for unexpected VM removal
- Adjust thresholds if needed
- Set
Performance Considerations
- DLQ: Minimal overhead, uses Redis sorted sets
- Purge: Runs in background thread, iterates through queues
- Health Checks: Lightweight, caches metrics between runs
Expected impact:
- Redis memory: +1-5MB for DLQ (depends on DLQ size)
- CPU: +1-2% during purge/health check cycles
- Network: Minimal, only metric pushes
Support
For issues or questions:
- Check logs for error messages
- Review DLQ entries for failure patterns
- Check health status and metrics
- Open issue on GitHub with logs and config