# Queue Reliability Features - Operator Guide ## Overview This guide covers the Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features added to VMPooler for improved queue reliability and observability. ## Features ### 1. Dead-Letter Queue (DLQ) The DLQ captures failed VM creation attempts and queue transitions, providing visibility into failures without losing data. **What gets captured:** - VMs that fail during clone operations - VMs that timeout in pending queue - VMs that become unreachable in ready queue - Any permanent errors (template not found, permission denied, etc.) **Benefits:** - Failed VMs are not lost - they're moved to DLQ for analysis - Complete failure context (error message, timestamp, retry count, request ID) - TTL-based expiration prevents unbounded growth - Size limiting prevents memory issues **Configuration:** ```yaml :config: dlq_enabled: true dlq_ttl: 168 # hours (7 days) dlq_max_entries: 10000 # per DLQ queue ``` **Querying DLQ via Redis CLI:** ```bash # View all pending DLQ entries redis-cli ZRANGE vmpooler__dlq__pending 0 -1 # View DLQ entries with scores (timestamps) redis-cli ZRANGE vmpooler__dlq__pending 0 -1 WITHSCORES # Get DLQ size redis-cli ZCARD vmpooler__dlq__pending # View recent failures (last 10) redis-cli ZREVRANGE vmpooler__dlq__clone 0 9 # View entries older than 1 hour (timestamp in seconds) redis-cli ZRANGEBYSCORE vmpooler__dlq__pending -inf $(date -d '1 hour ago' +%s) ``` **DLQ Keys:** - `vmpooler__dlq__pending` - Failed pending VMs - `vmpooler__dlq__clone` - Failed clone operations - `vmpooler__dlq__ready` - Failed ready queue VMs - `vmpooler__dlq__tasks` - Failed tasks **Entry Format:** Each DLQ entry contains: ```json { "vm": "pooler-happy-elephant", "pool": "centos-7-x86_64", "queue_from": "pending", "error_class": "StandardError", "error_message": "template centos-7-template does not exist", "failed_at": "2024-01-15T10:30:00Z", "retry_count": 3, "request_id": "req-abc123", "pool_alias": "centos-7" } ``` ### 2. Auto-Purge Automatically removes stale entries from queues to prevent resource leaks and maintain queue health. **What gets purged:** - **Pending VMs**: Stuck in pending queue longer than `max_pending_age` - **Ready VMs**: Idle in ready queue longer than `max_ready_age` - **Completed VMs**: In completed queue longer than `max_completed_age` - **Orphaned Metadata**: VM metadata without corresponding queue entry **Benefits:** - Prevents queue bloat from stuck/forgotten VMs - Automatically cleans up after process crashes or bugs - Configurable thresholds per environment - Dry-run mode for safe testing **Configuration:** ```yaml :config: purge_enabled: true purge_interval: 3600 # seconds (1 hour) - how often to run purge_dry_run: false # set to true to log but not purge # Age thresholds (in seconds) max_pending_age: 7200 # 2 hours max_ready_age: 86400 # 24 hours max_completed_age: 3600 # 1 hour max_orphaned_age: 86400 # 24 hours ``` **Testing Purge (Dry-Run Mode):** ```yaml :config: purge_enabled: true purge_dry_run: true # Logs what would be purged without actually purging max_pending_age: 600 # Use shorter thresholds for testing ``` Watch logs for: ``` [*] [purge][dry-run] Would purge stale pending VM 'pooler-happy-elephant' (age: 3650s, max: 600s) ``` **Monitoring Purge:** Check logs for purge cycles: ``` [*] [purge] Starting stale queue entry purge cycle [!] [purge] Purged stale pending VM 'pooler-sad-dog' from 'centos-7-x86_64' (age: 7250s) [!] [purge] Moved stale ready VM 'pooler-angry-cat' from 'ubuntu-2004-x86_64' to completed (age: 90000s) [*] [purge] Completed purge cycle in 2.34s: 12 entries purged ``` ### 3. Health Checks Monitors queue health and exposes metrics for alerting and dashboards. **What gets monitored:** - Queue sizes (pending, ready, completed) - Queue ages (oldest VM, average age) - Stuck VMs (VMs in pending queue longer than threshold) - DLQ size - Orphaned metadata count - Task queue sizes (clone, on-demand) - Overall health status (healthy/degraded/unhealthy) **Benefits:** - Proactive detection of queue issues - Metrics for alerting and dashboards - Historical health tracking - API endpoint for health status **Configuration:** ```yaml :config: health_check_enabled: true health_check_interval: 300 # seconds (5 minutes) health_thresholds: pending_queue_max: 100 ready_queue_max: 500 dlq_max_warning: 100 dlq_max_critical: 1000 stuck_vm_age_threshold: 7200 # 2 hours stuck_vm_max_warning: 10 stuck_vm_max_critical: 50 ``` **Health Status Levels:** - **Healthy**: All metrics within normal thresholds - **Degraded**: Some metrics elevated but functional (DLQ > warning, queue sizes elevated) - **Unhealthy**: Critical thresholds exceeded (DLQ > critical, many stuck VMs, queues backed up) **Viewing Health Status:** Via Redis: ```bash # Get current health status redis-cli HGETALL vmpooler__health # Get specific health metric redis-cli HGET vmpooler__health status redis-cli HGET vmpooler__health last_check ``` Via Logs: ``` [*] [health] Status: HEALTHY | Queues: P=45 R=230 C=12 | DLQ=25 | Stuck=3 | Orphaned=5 ``` **Exposed Metrics:** The following metrics are pushed to the metrics system (Prometheus, Graphite, etc.): ``` # Health status (0=healthy, 1=degraded, 2=unhealthy) vmpooler.health.status # Error metrics vmpooler.health.dlq.total_size vmpooler.health.stuck_vms.count vmpooler.health.orphaned_metadata.count # Per-pool queue metrics vmpooler.health.queue..pending.size vmpooler.health.queue..pending.oldest_age vmpooler.health.queue..pending.stuck_count vmpooler.health.queue..ready.size vmpooler.health.queue..ready.oldest_age vmpooler.health.queue..completed.size # DLQ metrics vmpooler.health.dlq..size # Task metrics vmpooler.health.tasks.clone.active vmpooler.health.tasks.ondemand.active vmpooler.health.tasks.ondemand.pending ``` ## Common Scenarios ### Scenario 1: Investigating Failed VM Requests **Problem:** User reports VM request failed. **Steps:** 1. Check DLQ for the request: ```bash redis-cli ZRANGE vmpooler__dlq__pending 0 -1 | grep "req-abc123" redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" ``` 2. Parse the JSON entry to see failure details: ```bash redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" | jq . ``` 3. Common failure reasons: - `template does not exist` - Template missing or renamed in provider - `permission denied` - VMPooler lacks permissions to clone template - `timeout` - VM failed to become ready within timeout period - `failed to obtain IP` - Network/DHCP issue ### Scenario 2: Queue Backup **Problem:** Pending queue growing, VMs not moving to ready. **Steps:** 1. Check health status: ```bash redis-cli HGET vmpooler__health status ``` 2. Check pending queue metrics: ```bash # View stuck VMs redis-cli HGET vmpooler__health stuck_vm_count # Check oldest VM age redis-cli SMEMBERS vmpooler__pending__centos-7-x86_64 | head -1 | xargs -I {} redis-cli HGET vmpooler__vm__{} clone ``` 3. Check DLQ for recent failures: ```bash redis-cli ZREVRANGE vmpooler__dlq__clone 0 9 ``` 4. Common causes: - Provider errors (vCenter unreachable, no resources) - Network issues (can't reach VMs, no DHCP) - Configuration issues (wrong template name, bad credentials) ### Scenario 3: High DLQ Size **Problem:** DLQ size growing, indicating persistent failures. **Steps:** 1. Check DLQ size: ```bash redis-cli ZCARD vmpooler__dlq__pending redis-cli ZCARD vmpooler__dlq__clone ``` 2. Identify common failure patterns: ```bash redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | jq -r '.error_message' | sort | uniq -c | sort -rn ``` 3. Fix underlying issues (template exists, permissions, network) 4. If issues resolved, DLQ entries will expire after TTL (default 7 days) ### Scenario 4: Testing Configuration Changes **Problem:** Want to test new purge thresholds without affecting production. **Steps:** 1. Enable dry-run mode: ```yaml :config: purge_dry_run: true max_pending_age: 3600 # Test with 1 hour ``` 2. Monitor logs for purge detections: ```bash tail -f vmpooler.log | grep "purge.*dry-run" ``` 3. Verify detection is correct 4. Disable dry-run when ready: ```yaml :config: purge_dry_run: false ``` ### Scenario 5: Alerting on Queue Health **Problem:** Want to be notified when queues are unhealthy. **Steps:** 1. Set up Prometheus alerts based on health metrics: ```yaml - alert: VMPoolerUnhealthy expr: vmpooler_health_status >= 2 for: 10m annotations: summary: "VMPooler is unhealthy" - alert: VMPoolerHighDLQ expr: vmpooler_health_dlq_total_size > 500 for: 30m annotations: summary: "VMPooler DLQ size is high" - alert: VMPoolerStuckVMs expr: vmpooler_health_stuck_vms_count > 20 for: 15m annotations: summary: "Many VMs stuck in pending queue" ``` ## Troubleshooting ### DLQ Not Capturing Failures **Check:** 1. Is DLQ enabled? `redis-cli HGET vmpooler__config dlq_enabled` 2. Are failures actually occurring? Check logs for error messages 3. Is Redis accessible? `redis-cli PING` ### Purge Not Running **Check:** 1. Is purge enabled? Check config `purge_enabled: true` 2. Check logs for purge thread startup: `[*] [purge] Starting stale queue entry purge cycle` 3. Is purge interval too long? Default is 1 hour 4. Check thread status in logs: `[!] [queue_purge] worker thread died` ### Health Check Not Updating **Check:** 1. Is health check enabled? Check config `health_check_enabled: true` 2. Check last update time: `redis-cli HGET vmpooler__health last_check` 3. Check logs for health check runs: `[*] [health] Status:` 4. Check thread status: `[!] [health_check] worker thread died` ### Metrics Not Appearing **Check:** 1. Is metrics system configured? Check `:statsd` or `:graphite` config 2. Are metrics being sent? Check logs for metric sends 3. Check firewall/network to metrics server 4. Test metrics manually: `redis-cli HGETALL vmpooler__health` ## Best Practices ### Development/Testing Environments - Enable DLQ with shorter TTL (24-48 hours) - Enable purge with dry-run mode initially - Use aggressive purge thresholds (30min pending, 6hr ready) - Enable health checks with 1-minute interval - Monitor logs closely for issues ### Production Environments - Enable DLQ with 7-day TTL - Enable purge after testing in dev - Use conservative purge thresholds (2hr pending, 24hr ready) - Enable health checks with 5-minute interval - Set up alerting based on health metrics - Monitor DLQ size and set alerts (>500 = investigate) ### Capacity Planning - Monitor queue sizes during peak times - Adjust thresholds based on actual usage patterns - Review DLQ entries weekly for systemic issues - Track purge counts to identify resource leaks ### Debugging - Keep DLQ TTL long enough for investigation (7+ days) - Use dry-run mode when testing threshold changes - Correlate DLQ entries with provider logs - Check health metrics before and after changes ## Migration Guide ### Enabling Features in Existing Deployment 1. **Phase 1: Enable DLQ** - Add DLQ config with conservative TTL - Monitor DLQ size and entry patterns - Verify no performance impact - Adjust TTL as needed 2. **Phase 2: Enable Health Checks** - Add health check config - Verify metrics are exposed - Set up dashboards - Configure alerting 3. **Phase 3: Enable Purge (Dry-Run)** - Add purge config with `purge_dry_run: true` - Monitor logs for purge detections - Verify thresholds are appropriate - Adjust thresholds based on observations 4. **Phase 4: Enable Purge (Live)** - Set `purge_dry_run: false` - Monitor queue sizes and purge counts - Watch for unexpected VM removal - Adjust thresholds if needed ## Performance Considerations - **DLQ**: Minimal overhead, uses Redis sorted sets - **Purge**: Runs in background thread, iterates through queues - **Health Checks**: Lightweight, caches metrics between runs Expected impact: - Redis memory: +1-5MB for DLQ (depends on DLQ size) - CPU: +1-2% during purge/health check cycles - Network: Minimal, only metric pushes ## Support For issues or questions: 1. Check logs for error messages 2. Review DLQ entries for failure patterns 3. Check health status and metrics 4. Open issue on GitHub with logs and config