- Implement dead-letter queue (DLQ) to capture failed VM operations - Implement auto-purge to clean up stale queue entries - Implement health checks to monitor queue health - Add comprehensive tests and documentation Features: - DLQ captures failures from pending, clone, and ready queues - Auto-purge removes stale VMs with configurable thresholds - Health checks expose metrics for monitoring and alerting - All features opt-in via configuration (backward compatible)
11 KiB
Redis Queue Reliability Features
Overview
This document describes the implementation of dead-letter queues (DLQ), auto-purge mechanisms, and health checks for VMPooler Redis queues.
Background
Current Queue Structure
VMPooler uses Redis sets and sorted sets for queue management:
- Pool Queues (Sets):
vmpooler__pending__#{pool},vmpooler__ready__#{pool},vmpooler__running__#{pool},vmpooler__completed__#{pool},vmpooler__discovered__#{pool},vmpooler__migrating__#{pool} - Task Queues (Sorted Sets):
vmpooler__odcreate__task(on-demand creation tasks),vmpooler__provisioning__processing - Task Queues (Sets):
vmpooler__tasks__disk,vmpooler__tasks__snapshot,vmpooler__tasks__snapshot-revert - VM Metadata (Hashes):
vmpooler__vm__#{vm}- contains clone time, IP, template, pool, domain, request_id, pool_alias, error details - Request Metadata (Hashes):
vmpooler__odrequest__#{request_id}- contains status, retry_count, token info
Current Error Handling
- Permanent errors (e.g., template not found) are detected in
_clone_vmrescue block - Failed VMs are removed from pending queue
- Request status is set to 'failed' and re-queue is prevented in outer
clone_vmrescue block - VM metadata expires after data_ttl hours
Problem Areas
- Lost visibility: Failed messages are removed but no centralized tracking
- Stale data: VMs stuck in queues due to process crashes or bugs
- No monitoring: No automated way to detect queue health issues
- Manual cleanup: Operators must manually identify and clean stale entries
Feature Requirements
1. Dead-Letter Queue (DLQ)
Purpose
Capture failed VM creation requests for visibility, debugging, and potential retry/recovery.
Design
DLQ Structure:
vmpooler__dlq__pending # Failed pending VMs (sorted set, scored by failure timestamp)
vmpooler__dlq__clone # Failed clone operations (sorted set)
vmpooler__dlq__ready # Failed ready queue VMs (sorted set)
vmpooler__dlq__tasks # Failed tasks (hash of task_type -> failed items)
DLQ Entry Format:
{
"vm": "vm-name-abc123",
"pool": "pool-name",
"queue_from": "pending",
"error_class": "StandardError",
"error_message": "template does not exist",
"failed_at": "2024-01-15T10:30:00Z",
"retry_count": 3,
"request_id": "req-123456",
"pool_alias": "centos-7"
}
Configuration:
:redis:
dlq_enabled: true
dlq_ttl: 168 # hours (7 days)
dlq_max_entries: 10000 # per DLQ queue
Implementation Points:
fail_pending_vm: Move to DLQ when VM fails during pending checks_clone_vmrescue: Move to DLQ on clone failure_check_ready_vm: Move to DLQ when ready VM becomes unreachable_destroy_vmrescue: Log destroy failures to DLQ
Acceptance Criteria:
- Failed VMs are automatically moved to appropriate DLQ
- DLQ entries contain complete failure context (error, timestamp, retry count)
- DLQ entries expire after configurable TTL
- DLQ size is limited to prevent unbounded growth
- DLQ entries are queryable via Redis CLI or API
2. Auto-Purge Mechanism
Purpose
Automatically remove stale entries from queues to prevent resource leaks and improve queue health.
Design
Purge Targets:
- Pending VMs: Stuck in pending > max_pending_age (e.g., 2 hours)
- Ready VMs: Idle in ready queue > max_ready_age (e.g., 24 hours for on-demand, 48 hours for pool)
- Completed VMs: In completed queue > max_completed_age (e.g., 1 hour)
- Orphaned VM Metadata: VM hash exists but VM not in any queue
- Expired Requests: On-demand requests > max_request_age (e.g., 24 hours)
Configuration:
:config:
purge_enabled: true
purge_interval: 3600 # seconds (1 hour)
max_pending_age: 7200 # seconds (2 hours)
max_ready_age: 86400 # seconds (24 hours)
max_completed_age: 3600 # seconds (1 hour)
max_orphaned_age: 86400 # seconds (24 hours)
max_request_age: 86400 # seconds (24 hours)
purge_dry_run: false # if true, log what would be purged but don't purge
Purge Process:
- Scan each queue for stale entries (based on age thresholds)
- Check if VM still exists in provider (optional validation)
- Move stale entries to DLQ with reason
- Remove from original queue
- Log purge metrics
Implementation:
- New method:
purge_stale_queue_entries- main purge loop - Helper methods:
check_pending_age,check_ready_age,check_completed_age,find_orphaned_metadata - Scheduled task: Run every
purge_intervalseconds
Acceptance Criteria:
- Stale pending VMs are detected and moved to DLQ
- Stale ready VMs are detected and moved to completed queue
- Stale completed VMs are removed from queue
- Orphaned VM metadata is detected and expired
- Purge metrics are logged (count, age, reason)
- Dry-run mode available for testing
- Purge runs on configurable interval
3. Health Checks
Purpose
Monitor Redis queue health and expose metrics for alerting and dashboards.
Design
Health Metrics:
{
queues: {
pending: {
pool_name: {
size: 10,
oldest_age: 3600, # seconds
avg_age: 1200,
stuck_count: 2 # VMs older than threshold
}
},
ready: { ... },
completed: { ... },
dlq: { ... }
},
tasks: {
clone: { active: 5, pending: 10 },
ondemand: { active: 2, pending: 5 }
},
processing_rate: {
clone_rate: 10.5, # VMs per minute
destroy_rate: 8.2
},
errors: {
dlq_size: 150,
stuck_vm_count: 5,
orphaned_metadata_count: 12
},
status: "healthy|degraded|unhealthy"
}
Health Status Criteria:
- Healthy: All queues within normal thresholds, DLQ size < 100, no stuck VMs
- Degraded: Some queues elevated but functional, DLQ size < 1000, few stuck VMs
- Unhealthy: Queues critically backed up, DLQ size > 1000, many stuck VMs
Configuration:
:config:
health_check_enabled: true
health_check_interval: 300 # seconds (5 minutes)
health_thresholds:
pending_queue_max: 100
ready_queue_max: 500
dlq_max_warning: 100
dlq_max_critical: 1000
stuck_vm_age_threshold: 7200 # 2 hours
stuck_vm_max_warning: 10
stuck_vm_max_critical: 50
Implementation:
- New method:
check_queue_health- main health check - Helper methods:
calculate_queue_metrics,calculate_processing_rate,determine_health_status - Expose via:
- Redis hash:
vmpooler__health(for API consumption) - Metrics: Push to existing $metrics system
- Logs: Periodic health summary in logs
- Redis hash:
Acceptance Criteria:
- Queue sizes are monitored per pool
- Queue ages are calculated (oldest, average)
- Stuck VMs are detected (age > threshold)
- DLQ size is monitored
- Processing rates are calculated
- Overall health status is determined
- Health metrics are exposed via Redis, metrics, and logs
- Health check runs on configurable interval
Implementation Plan
Phase 1: Dead-Letter Queue
- Add DLQ configuration parsing
- Implement
move_to_dlqhelper method - Update
fail_pending_vmto use DLQ - Update
_clone_vmrescue block to use DLQ - Update
_check_ready_vmto use DLQ - Add DLQ TTL enforcement
- Add DLQ size limiting
- Unit tests for DLQ operations
Phase 2: Auto-Purge
- Add purge configuration parsing
- Implement
purge_stale_queue_entriesmain loop - Implement age-checking helper methods
- Implement orphan detection
- Add purge metrics logging
- Add dry-run mode
- Unit tests for purge logic
- Integration test for full purge cycle
Phase 3: Health Checks
- Add health check configuration parsing
- Implement
check_queue_healthmain method - Implement metric calculation helpers
- Implement health status determination
- Expose metrics via Redis hash
- Expose metrics via $metrics system
- Add periodic health logging
- Unit tests for health check logic
Phase 4: Integration & Documentation
- Update configuration examples
- Update operator documentation
- Update API documentation (if exposing health endpoint)
- Add troubleshooting guide for DLQ/purge
- Create runbook for operators
- Update TESTING.md with DLQ/purge/health check testing
Migration & Rollout
Backward Compatibility
- All features are opt-in via configuration
- Default:
dlq_enabled: false,purge_enabled: false,health_check_enabled: false - Existing behavior unchanged when features disabled
Rollout Strategy
- Deploy with features disabled
- Enable DLQ first, monitor for issues
- Enable health checks, validate metrics
- Enable auto-purge in dry-run mode, validate detection
- Enable auto-purge in live mode, monitor impact
Monitoring During Rollout
- Monitor DLQ growth rate
- Monitor purge counts and reasons
- Monitor health status changes
- Watch for unexpected VM removal
- Check for performance impact (Redis load, memory)
Testing Strategy
Unit Tests
- DLQ capture for various error scenarios
- DLQ TTL enforcement
- DLQ size limiting
- Age calculation for purge detection
- Orphan detection logic
- Health metric calculations
- Health status determination
Integration Tests
- End-to-end VM failure → DLQ flow
- End-to-end purge cycle
- Health check with real queue data
- DLQ + purge interaction (purge should respect DLQ entries)
Manual Testing
- Create VM with invalid template → verify DLQ entry
- Let VM sit in pending too long → verify purge detection
- Check health endpoint → verify metrics accuracy
- Run purge in dry-run → verify correct detection without deletion
- Run purge in live mode → verify stale entries removed
API Changes (Optional)
If exposing to API:
GET /api/v1/queue/health
Returns: Health metrics JSON
GET /api/v1/queue/dlq?queue=pending&limit=50
Returns: DLQ entries for specified queue
POST /api/v1/queue/purge?dry_run=true
Returns: Purge simulation results (admin only)
Metrics
New metrics to add:
vmpooler.dlq.pending.size
vmpooler.dlq.clone.size
vmpooler.dlq.ready.size
vmpooler.dlq.tasks.size
vmpooler.purge.pending.count
vmpooler.purge.ready.count
vmpooler.purge.completed.count
vmpooler.purge.orphaned.count
vmpooler.health.status # 0=healthy, 1=degraded, 2=unhealthy
vmpooler.health.stuck_vms.count
vmpooler.health.queue.#{queue_name}.size
vmpooler.health.queue.#{queue_name}.oldest_age
Configuration Example
---
:config:
# Existing config...
# Dead-Letter Queue
dlq_enabled: true
dlq_ttl: 168 # hours (7 days)
dlq_max_entries: 10000
# Auto-Purge
purge_enabled: true
purge_interval: 3600 # seconds (1 hour)
purge_dry_run: false
max_pending_age: 7200 # seconds (2 hours)
max_ready_age: 86400 # seconds (24 hours)
max_completed_age: 3600 # seconds (1 hour)
max_orphaned_age: 86400 # seconds (24 hours)
# Health Checks
health_check_enabled: true
health_check_interval: 300 # seconds (5 minutes)
health_thresholds:
pending_queue_max: 100
ready_queue_max: 500
dlq_max_warning: 100
dlq_max_critical: 1000
stuck_vm_age_threshold: 7200 # 2 hours
stuck_vm_max_warning: 10
stuck_vm_max_critical: 50
:redis:
# Existing redis config...