diff --git a/Gemfile b/Gemfile
index 122d6b5..0313b80 100644
--- a/Gemfile
+++ b/Gemfile
@@ -3,11 +3,11 @@ source ENV['GEM_SOURCE'] || 'https://rubygems.org'
 gemspec
 
 # Evaluate Gemfile.local if it exists
-if File.exists? "#{__FILE__}.local"
+if File.exist? "#{__FILE__}.local"
   instance_eval(File.read("#{__FILE__}.local"))
 end
 
 # Evaluate ~/.gemfile if it exists
-if File.exists?(File.join(Dir.home, '.gemfile'))
+if File.exist?(File.join(Dir.home, '.gemfile'))
   instance_eval(File.read(File.join(Dir.home, '.gemfile')))
 end
diff --git a/Gemfile.lock b/Gemfile.lock
index 418f24d..2099da1 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -197,6 +197,7 @@ GEM
 PLATFORMS
   arm64-darwin-22
   arm64-darwin-23
+  arm64-darwin-25
   universal-java-11
   universal-java-17
   x86_64-darwin-22
diff --git a/IMPLEMENTATION_SUMMARY.md b/IMPLEMENTATION_SUMMARY.md
deleted file mode 100644
index 0e5e432..0000000
--- a/IMPLEMENTATION_SUMMARY.md
+++ /dev/null
@@ -1,375 +0,0 @@
-# Implementation Summary: Redis Queue Reliability Features
-
-## Overview
-Successfully implemented Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features for VMPooler to improve Redis queue reliability and observability.
-
-## Branch
-- **Repository**: `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler`
-- **Branch**: `P4DEVOPS-8567` (created from main)
-- **Status**: Implementation complete, ready for testing
-
-## What Was Implemented
-
-### 1. Dead-Letter Queue (DLQ)
-**Purpose**: Capture and track failed VM operations for visibility and debugging.
-
-**Files Modified**:
-- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
-  - Added `dlq_enabled?`, `dlq_ttl`, `dlq_max_entries` helper methods
-  - Added `move_to_dlq` method to capture failures
-  - Updated `handle_timed_out_vm` to use DLQ
-  - Updated `_clone_vm` rescue block to use DLQ
-  - Updated `vm_still_ready?` rescue block to use DLQ
-
-**Features**:
-- ✅ Captures failures from pending, clone, and ready queues
-- ✅ Stores complete failure context (VM, pool, error, timestamp, retry count, request ID)
-- ✅ Uses Redis sorted sets (scored by timestamp) for easy age-based queries
-- ✅ Enforces TTL-based expiration (default 7 days)
-- ✅ Enforces max entries limit to prevent unbounded growth
-- ✅ Automatically trims oldest entries when limit reached
-- ✅ Increments metrics for DLQ operations
-
-**DLQ Keys**:
-- `vmpooler__dlq__pending` - Failed pending VMs
-- `vmpooler__dlq__clone` - Failed clone operations  
-- `vmpooler__dlq__ready` - Failed ready queue VMs
-
-### 2. Auto-Purge Mechanism
-**Purpose**: Automatically remove stale entries from queues to prevent resource leaks.
-
-**Files Modified**:
-- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
-  - Added `purge_enabled?`, `purge_dry_run?` helper methods
-  - Added age threshold methods: `max_pending_age`, `max_ready_age`, `max_completed_age`, `max_orphaned_age`
-  - Added `purge_stale_queue_entries` main loop
-  - Added `purge_pending_queue`, `purge_ready_queue`, `purge_completed_queue` methods
-  - Added `purge_orphaned_metadata` method
-  - Integrated purge thread into main execution loop
-
-**Features**:
-- ✅ Purges pending VMs stuck longer than threshold (default 2 hours)
-- ✅ Purges ready VMs idle longer than threshold (default 24 hours)
-- ✅ Purges completed VMs older than threshold (default 1 hour)
-- ✅ Detects and expires orphaned VM metadata
-- ✅ Moves purged pending VMs to DLQ for visibility
-- ✅ Dry-run mode for testing (logs without purging)
-- ✅ Configurable purge interval (default 1 hour)
-- ✅ Increments per-pool purge metrics
-- ✅ Runs in background thread
-
-### 3. Health Checks
-**Purpose**: Monitor queue health and expose metrics for alerting and dashboards.
-
-**Files Modified**:
-- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
-  - Added `health_check_enabled?`, `health_thresholds` helper methods
-  - Added `check_queue_health` main method
-  - Added `calculate_health_metrics` to gather queue metrics
-  - Added `calculate_queue_ages` helper
-  - Added `count_orphaned_metadata` helper
-  - Added `determine_health_status` to classify health (healthy/degraded/unhealthy)
-  - Added `log_health_summary` for log output
-  - Added `push_health_metrics` to expose metrics
-  - Integrated health check thread into main execution loop
-
-**Features**:
-- ✅ Monitors per-pool queue sizes (pending, ready, completed)
-- ✅ Calculates queue ages (oldest, average)
-- ✅ Detects stuck VMs (age > threshold)
-- ✅ Monitors DLQ sizes
-- ✅ Counts orphaned metadata
-- ✅ Monitors task queue sizes (clone, on-demand)
-- ✅ Determines overall health status (healthy/degraded/unhealthy)
-- ✅ Stores metrics in Redis for API consumption (`vmpooler__health`)
-- ✅ Pushes metrics to metrics system (Prometheus, Graphite)
-- ✅ Logs periodic health summary
-- ✅ Configurable thresholds and intervals
-- ✅ Runs in background thread
-
-## Configuration
-
-**Files Created**:
-- [`vmpooler.yml.example`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler.yml.example) - Example configuration showing all options
-
-**Configuration Options**:
-
-```yaml
-:config:
-  # Dead-Letter Queue
-  dlq_enabled: false  # Set to true to enable
-  dlq_ttl: 168  # hours (7 days)
-  dlq_max_entries: 10000
-  
-  # Auto-Purge
-  purge_enabled: false  # Set to true to enable
-  purge_interval: 3600  # seconds (1 hour)
-  purge_dry_run: false  # Set to true for testing
-  max_pending_age: 7200  # 2 hours
-  max_ready_age: 86400  # 24 hours
-  max_completed_age: 3600  # 1 hour
-  max_orphaned_age: 86400  # 24 hours
-  
-  # Health Checks
-  health_check_enabled: false  # Set to true to enable
-  health_check_interval: 300  # seconds (5 minutes)
-  health_thresholds:
-    pending_queue_max: 100
-    ready_queue_max: 500
-    dlq_max_warning: 100
-    dlq_max_critical: 1000
-    stuck_vm_age_threshold: 7200
-    stuck_vm_max_warning: 10
-    stuck_vm_max_critical: 50
-```
-
-## Documentation
-
-**Files Created**:
-1. [`REDIS_QUEUE_RELIABILITY.md`](/Users/mahima.singh/vmpooler-projects/Vmpooler/REDIS_QUEUE_RELIABILITY.md)
-   - Comprehensive design document
-   - Feature requirements with acceptance criteria
-   - Implementation plan and phases
-   - Configuration examples
-   - Metrics definitions
-
-2. [`QUEUE_RELIABILITY_OPERATOR_GUIDE.md`](/Users/mahima.singh/vmpooler-projects/Vmpooler/QUEUE_RELIABILITY_OPERATOR_GUIDE.md)
-   - Complete operator guide
-   - Feature descriptions and benefits
-   - Configuration examples
-   - Common scenarios and troubleshooting
-   - Best practices
-   - Migration guide
-
-## Testing
-
-**Files Created**:
-- [`spec/unit/queue_reliability_spec.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/spec/unit/queue_reliability_spec.rb)
-  - 30+ unit tests covering:
-    - DLQ helper methods and operations
-    - Purge helper methods and queue operations
-    - Health check calculations and status determination
-    - Metric push operations
-
-**Test Coverage**:
-- ✅ DLQ enabled/disabled states
-- ✅ DLQ TTL and max entries configuration
-- ✅ DLQ entry creation with all fields
-- ✅ DLQ max entries enforcement
-- ✅ Purge enabled/disabled states
-- ✅ Purge dry-run mode
-- ✅ Purge age threshold configuration
-- ✅ Purge pending, ready, completed queues
-- ✅ Purge orphaned metadata detection
-- ✅ Health check enabled/disabled states
-- ✅ Health threshold configuration
-- ✅ Queue age calculations
-- ✅ Health status determination (healthy/degraded/unhealthy)
-- ✅ Metric push operations
-
-## Code Quality
-
-**Validation**:
-- ✅ Ruby syntax check passed: `ruby -c lib/vmpooler/pool_manager.rb` → Syntax OK
-- ✅ No compilation errors
-- ✅ Follows existing VMPooler code patterns
-- ✅ Proper error handling with rescue blocks
-- ✅ Logging at appropriate levels ('s' for significant, 'd' for debug)
-- ✅ Metrics increments and gauges
-
-## Metrics
-
-**New Metrics Added**:
-
-```
-# DLQ metrics
-vmpooler.dlq.pending.count
-vmpooler.dlq.clone.count
-vmpooler.dlq.ready.count
-
-# Purge metrics
-vmpooler.purge.pending.<pool>.count
-vmpooler.purge.ready.<pool>.count
-vmpooler.purge.completed.<pool>.count
-vmpooler.purge.orphaned.count
-vmpooler.purge.cycle.duration
-vmpooler.purge.total.count
-
-# Health metrics
-vmpooler.health.status  # 0=healthy, 1=degraded, 2=unhealthy
-vmpooler.health.dlq.total_size
-vmpooler.health.stuck_vms.count
-vmpooler.health.orphaned_metadata.count
-vmpooler.health.queue.<pool>.pending.size
-vmpooler.health.queue.<pool>.pending.oldest_age
-vmpooler.health.queue.<pool>.pending.stuck_count
-vmpooler.health.queue.<pool>.ready.size
-vmpooler.health.queue.<pool>.ready.oldest_age
-vmpooler.health.queue.<pool>.completed.size
-vmpooler.health.dlq.<type>.size
-vmpooler.health.tasks.clone.active
-vmpooler.health.tasks.ondemand.active
-vmpooler.health.tasks.ondemand.pending
-vmpooler.health.check.duration
-```
-
-## Next Steps
-
-### 1. Local Testing
-```bash
-cd /Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler
-
-# Run unit tests
-bundle exec rspec spec/unit/queue_reliability_spec.rb
-
-# Run all tests
-bundle exec rspec
-```
-
-### 2. Enable Features in Development
-Update your vmpooler configuration:
-```yaml
-:config:
-  # Start with DLQ only
-  dlq_enabled: true
-  dlq_ttl: 24  # Short TTL for dev
-  
-  # Enable purge in dry-run mode first
-  purge_enabled: true
-  purge_dry_run: true
-  purge_interval: 600  # Check every 10 minutes
-  max_pending_age: 1800  # 30 minutes
-  
-  # Enable health checks
-  health_check_enabled: true
-  health_check_interval: 60  # Check every minute
-```
-
-### 3. Monitor Logs
-Watch for:
-```bash
-# DLQ operations
-grep "dlq" vmpooler.log
-
-# Purge operations (dry-run)
-grep "purge.*dry-run" vmpooler.log
-
-# Health checks
-grep "health" vmpooler.log
-```
-
-### 4. Query Redis
-```bash
-# Check DLQ entries
-redis-cli ZCARD vmpooler__dlq__pending
-redis-cli ZRANGE vmpooler__dlq__pending 0 9
-
-# Check health status
-redis-cli HGETALL vmpooler__health
-```
-
-### 5. Deployment Plan
-1. **Dev Environment**:
-   - Enable all features with aggressive thresholds
-   - Monitor for 1 week
-   - Verify DLQ captures failures correctly
-   - Verify purge detects stale entries (dry-run)
-   - Verify health status is accurate
-
-2. **Staging Environment**:
-   - Enable DLQ and health checks
-   - Enable purge in dry-run mode
-   - Monitor for 1 week
-   - Review DLQ patterns
-   - Tune thresholds based on actual usage
-
-3. **Production Environment**:
-   - Enable DLQ and health checks
-   - Enable purge in dry-run mode initially
-   - Monitor for 2 weeks
-   - Verify no false positives
-   - Enable purge in live mode
-   - Set up alerting based on health metrics
-
-### 6. Testing Checklist
-- [ ] Run unit tests: `bundle exec rspec spec/unit/queue_reliability_spec.rb`
-- [ ] Run full test suite: `bundle exec rspec`
-- [ ] Start VMPooler with features enabled
-- [ ] Create a VM with invalid template → verify DLQ capture
-- [ ] Let VM sit in pending too long → verify purge detection (dry-run)
-- [ ] Query `vmpooler__health` → verify metrics present
-- [ ] Check Prometheus/Graphite → verify metrics pushed
-- [ ] Enable purge live mode → verify stale entries removed
-- [ ] Monitor logs for thread startup/health
-
-## Files Changed/Created
-
-### Modified Files:
-1. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb`
-   - Added ~350 lines of code
-   - 3 major features implemented
-   - Integrated into main execution loop
-
-### New Files:
-1. `/Users/mahima.singh/vmpooler-projects/Vmpooler/REDIS_QUEUE_RELIABILITY.md` (290 lines)
-2. `/Users/mahima.singh/vmpooler-projects/Vmpooler/QUEUE_RELIABILITY_OPERATOR_GUIDE.md` (600+ lines)
-3. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler.yml.example` (100+ lines)
-4. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/spec/unit/queue_reliability_spec.rb` (500+ lines)
-
-## Backward Compatibility
-
-✅ **All features are opt-in** via configuration:
-- Default: All features disabled (`dlq_enabled: false`, `purge_enabled: false`, `health_check_enabled: false`)
-- Existing behavior unchanged when features are disabled
-- No breaking changes to existing code or APIs
-
-## Performance Impact
-
-**Expected**:
-- Redis memory: +1-5MB (depends on DLQ size)
-- CPU: +1-2% during purge/health check cycles
-- Network: Minimal (metric pushes only)
-
-**Mitigation**:
-- Background threads prevent blocking main pool operations
-- Configurable intervals allow tuning based on load
-- DLQ max entries limit prevents unbounded growth
-- Purge targets only stale entries (age-based)
-
-## Known Limitations
-
-1. **DLQ Querying**: Currently requires Redis CLI or custom tooling. Future: Add API endpoints for DLQ queries.
-2. **Purge Validation**: Does not check provider to confirm VM still exists before purging. Relies on age thresholds only.
-3. **Health Status**: Stored in Redis only, no persistent history. Consider exporting to time-series DB for trending.
-
-## Future Enhancements
-
-1. **API Endpoints**:
-   - `GET /api/v1/queue/dlq` - Query DLQ entries
-   - `GET /api/v1/queue/health` - Get health metrics
-   - `POST /api/v1/queue/purge` - Trigger manual purge (admin only)
-
-2. **Advanced Purge**:
-   - Provider validation before purging
-   - Purge on-demand requests that are too old
-   - Purge VMs without corresponding provider VM
-
-3. **Advanced Health**:
-   - Processing rate calculations (VMs/minute)
-   - Trend analysis (queue size over time)
-   - Predictive alerting (queue will hit threshold in X minutes)
-
-## Summary
-
-Successfully implemented comprehensive queue reliability features for VMPooler:
-- **DLQ**: Capture and track all failures
-- **Auto-Purge**: Automatically clean up stale entries
-- **Health Checks**: Monitor queue health and expose metrics
-
-All features are:
-- ✅ Fully implemented and tested
-- ✅ Backward compatible (opt-in)
-- ✅ Well documented
-- ✅ Ready for testing in development environment
-
-Total lines of code added: ~1,500 lines (code + tests + docs)
diff --git a/QUEUE_RELIABILITY_OPERATOR_GUIDE.md b/QUEUE_RELIABILITY_OPERATOR_GUIDE.md
deleted file mode 100644
index 77f383f..0000000
--- a/QUEUE_RELIABILITY_OPERATOR_GUIDE.md
+++ /dev/null
@@ -1,444 +0,0 @@
-# Queue Reliability Features - Operator Guide
-
-## Overview
-
-This guide covers the Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features added to VMPooler for improved queue reliability and observability.
-
-## Features
-
-### 1. Dead-Letter Queue (DLQ)
-
-The DLQ captures failed VM creation attempts and queue transitions, providing visibility into failures without losing data.
-
-**What gets captured:**
-- VMs that fail during clone operations
-- VMs that timeout in pending queue
-- VMs that become unreachable in ready queue
-- Any permanent errors (template not found, permission denied, etc.)
-
-**Benefits:**
-- Failed VMs are not lost - they're moved to DLQ for analysis
-- Complete failure context (error message, timestamp, retry count, request ID)
-- TTL-based expiration prevents unbounded growth
-- Size limiting prevents memory issues
-
-**Configuration:**
-```yaml
-:config:
-  dlq_enabled: true
-  dlq_ttl: 168  # hours (7 days)
-  dlq_max_entries: 10000  # per DLQ queue
-```
-
-**Querying DLQ via Redis CLI:**
-```bash
-# View all pending DLQ entries
-redis-cli ZRANGE vmpooler__dlq__pending 0 -1
-
-# View DLQ entries with scores (timestamps)
-redis-cli ZRANGE vmpooler__dlq__pending 0 -1 WITHSCORES
-
-# Get DLQ size
-redis-cli ZCARD vmpooler__dlq__pending
-
-# View recent failures (last 10)
-redis-cli ZREVRANGE vmpooler__dlq__clone 0 9
-
-# View entries older than 1 hour (timestamp in seconds)
-redis-cli ZRANGEBYSCORE vmpooler__dlq__pending -inf $(date -d '1 hour ago' +%s)
-```
-
-**DLQ Keys:**
-- `vmpooler__dlq__pending` - Failed pending VMs
-- `vmpooler__dlq__clone` - Failed clone operations
-- `vmpooler__dlq__ready` - Failed ready queue VMs
-- `vmpooler__dlq__tasks` - Failed tasks
-
-**Entry Format:**
-Each DLQ entry contains:
-```json
-{
-  "vm": "pooler-happy-elephant",
-  "pool": "centos-7-x86_64",
-  "queue_from": "pending",
-  "error_class": "StandardError",
-  "error_message": "template centos-7-template does not exist",
-  "failed_at": "2024-01-15T10:30:00Z",
-  "retry_count": 3,
-  "request_id": "req-abc123",
-  "pool_alias": "centos-7"
-}
-```
-
-### 2. Auto-Purge
-
-Automatically removes stale entries from queues to prevent resource leaks and maintain queue health.
-
-**What gets purged:**
-- **Pending VMs**: Stuck in pending queue longer than `max_pending_age`
-- **Ready VMs**: Idle in ready queue longer than `max_ready_age`
-- **Completed VMs**: In completed queue longer than `max_completed_age`
-- **Orphaned Metadata**: VM metadata without corresponding queue entry
-
-**Benefits:**
-- Prevents queue bloat from stuck/forgotten VMs
-- Automatically cleans up after process crashes or bugs
-- Configurable thresholds per environment
-- Dry-run mode for safe testing
-
-**Configuration:**
-```yaml
-:config:
-  purge_enabled: true
-  purge_interval: 3600  # seconds (1 hour) - how often to run
-  purge_dry_run: false  # set to true to log but not purge
-  
-  # Age thresholds (in seconds)
-  max_pending_age: 7200   # 2 hours
-  max_ready_age: 86400    # 24 hours
-  max_completed_age: 3600 # 1 hour
-  max_orphaned_age: 86400 # 24 hours
-```
-
-**Testing Purge (Dry-Run Mode):**
-```yaml
-:config:
-  purge_enabled: true
-  purge_dry_run: true  # Logs what would be purged without actually purging
-  max_pending_age: 600  # Use shorter thresholds for testing
-```
-
-Watch logs for:
-```
-[*] [purge][dry-run] Would purge stale pending VM 'pooler-happy-elephant' (age: 3650s, max: 600s)
-```
-
-**Monitoring Purge:**
-Check logs for purge cycles:
-```
-[*] [purge] Starting stale queue entry purge cycle
-[!] [purge] Purged stale pending VM 'pooler-sad-dog' from 'centos-7-x86_64' (age: 7250s)
-[!] [purge] Moved stale ready VM 'pooler-angry-cat' from 'ubuntu-2004-x86_64' to completed (age: 90000s)
-[*] [purge] Completed purge cycle in 2.34s: 12 entries purged
-```
-
-### 3. Health Checks
-
-Monitors queue health and exposes metrics for alerting and dashboards.
-
-**What gets monitored:**
-- Queue sizes (pending, ready, completed)
-- Queue ages (oldest VM, average age)
-- Stuck VMs (VMs in pending queue longer than threshold)
-- DLQ size
-- Orphaned metadata count
-- Task queue sizes (clone, on-demand)
-- Overall health status (healthy/degraded/unhealthy)
-
-**Benefits:**
-- Proactive detection of queue issues
-- Metrics for alerting and dashboards
-- Historical health tracking
-- API endpoint for health status
-
-**Configuration:**
-```yaml
-:config:
-  health_check_enabled: true
-  health_check_interval: 300  # seconds (5 minutes)
-  
-  health_thresholds:
-    pending_queue_max: 100
-    ready_queue_max: 500
-    dlq_max_warning: 100
-    dlq_max_critical: 1000
-    stuck_vm_age_threshold: 7200  # 2 hours
-    stuck_vm_max_warning: 10
-    stuck_vm_max_critical: 50
-```
-
-**Health Status Levels:**
-- **Healthy**: All metrics within normal thresholds
-- **Degraded**: Some metrics elevated but functional (DLQ > warning, queue sizes elevated)
-- **Unhealthy**: Critical thresholds exceeded (DLQ > critical, many stuck VMs, queues backed up)
-
-**Viewing Health Status:**
-
-Via Redis:
-```bash
-# Get current health status
-redis-cli HGETALL vmpooler__health
-
-# Get specific health metric
-redis-cli HGET vmpooler__health status
-redis-cli HGET vmpooler__health last_check
-```
-
-Via Logs:
-```
-[*] [health] Status: HEALTHY | Queues: P=45 R=230 C=12 | DLQ=25 | Stuck=3 | Orphaned=5
-```
-
-**Exposed Metrics:**
-
-The following metrics are pushed to the metrics system (Prometheus, Graphite, etc.):
-
-```
-# Health status (0=healthy, 1=degraded, 2=unhealthy)
-vmpooler.health.status
-
-# Error metrics
-vmpooler.health.dlq.total_size
-vmpooler.health.stuck_vms.count
-vmpooler.health.orphaned_metadata.count
-
-# Per-pool queue metrics
-vmpooler.health.queue.<pool_name>.pending.size
-vmpooler.health.queue.<pool_name>.pending.oldest_age
-vmpooler.health.queue.<pool_name>.pending.stuck_count
-vmpooler.health.queue.<pool_name>.ready.size
-vmpooler.health.queue.<pool_name>.ready.oldest_age
-vmpooler.health.queue.<pool_name>.completed.size
-
-# DLQ metrics
-vmpooler.health.dlq.<queue_type>.size
-
-# Task metrics
-vmpooler.health.tasks.clone.active
-vmpooler.health.tasks.ondemand.active
-vmpooler.health.tasks.ondemand.pending
-```
-
-## Common Scenarios
-
-### Scenario 1: Investigating Failed VM Requests
-
-**Problem:** User reports VM request failed.
-
-**Steps:**
-1. Check DLQ for the request:
-   ```bash
-   redis-cli ZRANGE vmpooler__dlq__pending 0 -1 | grep "req-abc123"
-   redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123"
-   ```
-
-2. Parse the JSON entry to see failure details:
-   ```bash
-   redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" | jq .
-   ```
-
-3. Common failure reasons:
-   - `template does not exist` - Template missing or renamed in provider
-   - `permission denied` - VMPooler lacks permissions to clone template
-   - `timeout` - VM failed to become ready within timeout period
-   - `failed to obtain IP` - Network/DHCP issue
-
-### Scenario 2: Queue Backup
-
-**Problem:** Pending queue growing, VMs not moving to ready.
-
-**Steps:**
-1. Check health status:
-   ```bash
-   redis-cli HGET vmpooler__health status
-   ```
-
-2. Check pending queue metrics:
-   ```bash
-   # View stuck VMs
-   redis-cli HGET vmpooler__health stuck_vm_count
-   
-   # Check oldest VM age
-   redis-cli SMEMBERS vmpooler__pending__centos-7-x86_64 | head -1 | xargs -I {} redis-cli HGET vmpooler__vm__{} clone
-   ```
-
-3. Check DLQ for recent failures:
-   ```bash
-   redis-cli ZREVRANGE vmpooler__dlq__clone 0 9
-   ```
-
-4. Common causes:
-   - Provider errors (vCenter unreachable, no resources)
-   - Network issues (can't reach VMs, no DHCP)
-   - Configuration issues (wrong template name, bad credentials)
-
-### Scenario 3: High DLQ Size
-
-**Problem:** DLQ size growing, indicating persistent failures.
-
-**Steps:**
-1. Check DLQ size:
-   ```bash
-   redis-cli ZCARD vmpooler__dlq__pending
-   redis-cli ZCARD vmpooler__dlq__clone
-   ```
-
-2. Identify common failure patterns:
-   ```bash
-   redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | jq -r '.error_message' | sort | uniq -c | sort -rn
-   ```
-
-3. Fix underlying issues (template exists, permissions, network)
-
-4. If issues resolved, DLQ entries will expire after TTL (default 7 days)
-
-### Scenario 4: Testing Configuration Changes
-
-**Problem:** Want to test new purge thresholds without affecting production.
-
-**Steps:**
-1. Enable dry-run mode:
-   ```yaml
-   :config:
-     purge_dry_run: true
-     max_pending_age: 3600  # Test with 1 hour
-   ```
-
-2. Monitor logs for purge detections:
-   ```bash
-   tail -f vmpooler.log | grep "purge.*dry-run"
-   ```
-
-3. Verify detection is correct
-
-4. Disable dry-run when ready:
-   ```yaml
-   :config:
-     purge_dry_run: false
-   ```
-
-### Scenario 5: Alerting on Queue Health
-
-**Problem:** Want to be notified when queues are unhealthy.
-
-**Steps:**
-1. Set up Prometheus alerts based on health metrics:
-   ```yaml
-   - alert: VMPoolerUnhealthy
-     expr: vmpooler_health_status >= 2
-     for: 10m
-     annotations:
-       summary: "VMPooler is unhealthy"
-   
-   - alert: VMPoolerHighDLQ
-     expr: vmpooler_health_dlq_total_size > 500
-     for: 30m
-     annotations:
-       summary: "VMPooler DLQ size is high"
-   
-   - alert: VMPoolerStuckVMs
-     expr: vmpooler_health_stuck_vms_count > 20
-     for: 15m
-     annotations:
-       summary: "Many VMs stuck in pending queue"
-   ```
-
-## Troubleshooting
-
-### DLQ Not Capturing Failures
-
-**Check:**
-1. Is DLQ enabled? `redis-cli HGET vmpooler__config dlq_enabled`
-2. Are failures actually occurring? Check logs for error messages
-3. Is Redis accessible? `redis-cli PING`
-
-### Purge Not Running
-
-**Check:**
-1. Is purge enabled? Check config `purge_enabled: true`
-2. Check logs for purge thread startup: `[*] [purge] Starting stale queue entry purge cycle`
-3. Is purge interval too long? Default is 1 hour
-4. Check thread status in logs: `[!] [queue_purge] worker thread died`
-
-### Health Check Not Updating
-
-**Check:**
-1. Is health check enabled? Check config `health_check_enabled: true`
-2. Check last update time: `redis-cli HGET vmpooler__health last_check`
-3. Check logs for health check runs: `[*] [health] Status:`
-4. Check thread status: `[!] [health_check] worker thread died`
-
-### Metrics Not Appearing
-
-**Check:**
-1. Is metrics system configured? Check `:statsd` or `:graphite` config
-2. Are metrics being sent? Check logs for metric sends
-3. Check firewall/network to metrics server
-4. Test metrics manually: `redis-cli HGETALL vmpooler__health`
-
-## Best Practices
-
-### Development/Testing Environments
-- Enable DLQ with shorter TTL (24-48 hours)
-- Enable purge with dry-run mode initially
-- Use aggressive purge thresholds (30min pending, 6hr ready)
-- Enable health checks with 1-minute interval
-- Monitor logs closely for issues
-
-### Production Environments
-- Enable DLQ with 7-day TTL
-- Enable purge after testing in dev
-- Use conservative purge thresholds (2hr pending, 24hr ready)
-- Enable health checks with 5-minute interval
-- Set up alerting based on health metrics
-- Monitor DLQ size and set alerts (>500 = investigate)
-
-### Capacity Planning
-- Monitor queue sizes during peak times
-- Adjust thresholds based on actual usage patterns
-- Review DLQ entries weekly for systemic issues
-- Track purge counts to identify resource leaks
-
-### Debugging
-- Keep DLQ TTL long enough for investigation (7+ days)
-- Use dry-run mode when testing threshold changes
-- Correlate DLQ entries with provider logs
-- Check health metrics before and after changes
-
-## Migration Guide
-
-### Enabling Features in Existing Deployment
-
-1. **Phase 1: Enable DLQ**
-   - Add DLQ config with conservative TTL
-   - Monitor DLQ size and entry patterns
-   - Verify no performance impact
-   - Adjust TTL as needed
-
-2. **Phase 2: Enable Health Checks**
-   - Add health check config
-   - Verify metrics are exposed
-   - Set up dashboards
-   - Configure alerting
-
-3. **Phase 3: Enable Purge (Dry-Run)**
-   - Add purge config with `purge_dry_run: true`
-   - Monitor logs for purge detections
-   - Verify thresholds are appropriate
-   - Adjust thresholds based on observations
-
-4. **Phase 4: Enable Purge (Live)**
-   - Set `purge_dry_run: false`
-   - Monitor queue sizes and purge counts
-   - Watch for unexpected VM removal
-   - Adjust thresholds if needed
-
-## Performance Considerations
-
-- **DLQ**: Minimal overhead, uses Redis sorted sets
-- **Purge**: Runs in background thread, iterates through queues
-- **Health Checks**: Lightweight, caches metrics between runs
-
-Expected impact:
-- Redis memory: +1-5MB for DLQ (depends on DLQ size)
-- CPU: +1-2% during purge/health check cycles
-- Network: Minimal, only metric pushes
-
-## Support
-
-For issues or questions:
-1. Check logs for error messages
-2. Review DLQ entries for failure patterns
-3. Check health status and metrics
-4. Open issue on GitHub with logs and config
-
diff --git a/REDIS_QUEUE_RELIABILITY.md b/REDIS_QUEUE_RELIABILITY.md
deleted file mode 100644
index a8f7afe..0000000
--- a/REDIS_QUEUE_RELIABILITY.md
+++ /dev/null
@@ -1,362 +0,0 @@
-# Redis Queue Reliability Features
-
-## Overview
-This document describes the implementation of dead-letter queues (DLQ), auto-purge mechanisms, and health checks for VMPooler Redis queues.
-
-## Background
-
-### Current Queue Structure
-VMPooler uses Redis sets and sorted sets for queue management:
-
-- **Pool Queues** (Sets): `vmpooler__pending__#{pool}`, `vmpooler__ready__#{pool}`, `vmpooler__running__#{pool}`, `vmpooler__completed__#{pool}`, `vmpooler__discovered__#{pool}`, `vmpooler__migrating__#{pool}`
-- **Task Queues** (Sorted Sets): `vmpooler__odcreate__task` (on-demand creation tasks), `vmpooler__provisioning__processing`
-- **Task Queues** (Sets): `vmpooler__tasks__disk`, `vmpooler__tasks__snapshot`, `vmpooler__tasks__snapshot-revert`
-- **VM Metadata** (Hashes): `vmpooler__vm__#{vm}` - contains clone time, IP, template, pool, domain, request_id, pool_alias, error details
-- **Request Metadata** (Hashes): `vmpooler__odrequest__#{request_id}` - contains status, retry_count, token info
-
-### Current Error Handling
-- Permanent errors (e.g., template not found) are detected in `_clone_vm` rescue block
-- Failed VMs are removed from pending queue
-- Request status is set to 'failed' and re-queue is prevented in outer `clone_vm` rescue block
-- VM metadata expires after data_ttl hours
-
-### Problem Areas
-1. **Lost visibility**: Failed messages are removed but no centralized tracking
-2. **Stale data**: VMs stuck in queues due to process crashes or bugs
-3. **No monitoring**: No automated way to detect queue health issues
-4. **Manual cleanup**: Operators must manually identify and clean stale entries
-
-## Feature Requirements
-
-### 1. Dead-Letter Queue (DLQ)
-
-#### Purpose
-Capture failed VM creation requests for visibility, debugging, and potential retry/recovery.
-
-#### Design
-
-**DLQ Structure:**
-```
-vmpooler__dlq__pending       # Failed pending VMs (sorted set, scored by failure timestamp)
-vmpooler__dlq__clone         # Failed clone operations (sorted set)
-vmpooler__dlq__ready         # Failed ready queue VMs (sorted set)
-vmpooler__dlq__tasks         # Failed tasks (hash of task_type -> failed items)
-```
-
-**DLQ Entry Format:**
-```json
-{
-  "vm": "vm-name-abc123",
-  "pool": "pool-name",
-  "queue_from": "pending",
-  "error_class": "StandardError",
-  "error_message": "template does not exist",
-  "failed_at": "2024-01-15T10:30:00Z",
-  "retry_count": 3,
-  "request_id": "req-123456",
-  "pool_alias": "centos-7"
-}
-```
-
-**Configuration:**
-```yaml
-:redis:
-  dlq_enabled: true
-  dlq_ttl: 168  # hours (7 days)
-  dlq_max_entries: 10000  # per DLQ queue
-```
-
-**Implementation Points:**
-- `fail_pending_vm`: Move to DLQ when VM fails during pending checks
-- `_clone_vm` rescue: Move to DLQ on clone failure
-- `_check_ready_vm`: Move to DLQ when ready VM becomes unreachable
-- `_destroy_vm` rescue: Log destroy failures to DLQ
-
-**Acceptance Criteria:**
-- [ ] Failed VMs are automatically moved to appropriate DLQ
-- [ ] DLQ entries contain complete failure context (error, timestamp, retry count)
-- [ ] DLQ entries expire after configurable TTL
-- [ ] DLQ size is limited to prevent unbounded growth
-- [ ] DLQ entries are queryable via Redis CLI or API
-
-### 2. Auto-Purge Mechanism
-
-#### Purpose
-Automatically remove stale entries from queues to prevent resource leaks and improve queue health.
-
-#### Design
-
-**Purge Targets:**
-1. **Pending VMs**: Stuck in pending > max_pending_age (e.g., 2 hours)
-2. **Ready VMs**: Idle in ready queue > max_ready_age (e.g., 24 hours for on-demand, 48 hours for pool)
-3. **Completed VMs**: In completed queue > max_completed_age (e.g., 1 hour)
-4. **Orphaned VM Metadata**: VM hash exists but VM not in any queue
-5. **Expired Requests**: On-demand requests > max_request_age (e.g., 24 hours)
-
-**Configuration:**
-```yaml
-:config:
-  purge_enabled: true
-  purge_interval: 3600  # seconds (1 hour)
-  max_pending_age: 7200  # seconds (2 hours)
-  max_ready_age: 86400  # seconds (24 hours)
-  max_completed_age: 3600  # seconds (1 hour)
-  max_orphaned_age: 86400  # seconds (24 hours)
-  max_request_age: 86400  # seconds (24 hours)
-  purge_dry_run: false  # if true, log what would be purged but don't purge
-```
-
-**Purge Process:**
-1. Scan each queue for stale entries (based on age thresholds)
-2. Check if VM still exists in provider (optional validation)
-3. Move stale entries to DLQ with reason
-4. Remove from original queue
-5. Log purge metrics
-
-**Implementation:**
-- New method: `purge_stale_queue_entries` - main purge loop
-- Helper methods: `check_pending_age`, `check_ready_age`, `check_completed_age`, `find_orphaned_metadata`
-- Scheduled task: Run every `purge_interval` seconds
-
-**Acceptance Criteria:**
-- [ ] Stale pending VMs are detected and moved to DLQ
-- [ ] Stale ready VMs are detected and moved to completed queue
-- [ ] Stale completed VMs are removed from queue
-- [ ] Orphaned VM metadata is detected and expired
-- [ ] Purge metrics are logged (count, age, reason)
-- [ ] Dry-run mode available for testing
-- [ ] Purge runs on configurable interval
-
-### 3. Health Checks
-
-#### Purpose
-Monitor Redis queue health and expose metrics for alerting and dashboards.
-
-#### Design
-
-**Health Metrics:**
-```ruby
-{
-  queues: {
-    pending: {
-      pool_name: {
-        size: 10,
-        oldest_age: 3600,  # seconds
-        avg_age: 1200,
-        stuck_count: 2  # VMs older than threshold
-      }
-    },
-    ready: { ... },
-    completed: { ... },
-    dlq: { ... }
-  },
-  tasks: {
-    clone: { active: 5, pending: 10 },
-    ondemand: { active: 2, pending: 5 }
-  },
-  processing_rate: {
-    clone_rate: 10.5,  # VMs per minute
-    destroy_rate: 8.2
-  },
-  errors: {
-    dlq_size: 150,
-    stuck_vm_count: 5,
-    orphaned_metadata_count: 12
-  },
-  status: "healthy|degraded|unhealthy"
-}
-```
-
-**Health Status Criteria:**
-- **Healthy**: All queues within normal thresholds, DLQ size < 100, no stuck VMs
-- **Degraded**: Some queues elevated but functional, DLQ size < 1000, few stuck VMs
-- **Unhealthy**: Queues critically backed up, DLQ size > 1000, many stuck VMs
-
-**Configuration:**
-```yaml
-:config:
-  health_check_enabled: true
-  health_check_interval: 300  # seconds (5 minutes)
-  health_thresholds:
-    pending_queue_max: 100
-    ready_queue_max: 500
-    dlq_max_warning: 100
-    dlq_max_critical: 1000
-    stuck_vm_age_threshold: 7200  # 2 hours
-    stuck_vm_max_warning: 10
-    stuck_vm_max_critical: 50
-```
-
-**Implementation:**
-- New method: `check_queue_health` - main health check
-- Helper methods: `calculate_queue_metrics`, `calculate_processing_rate`, `determine_health_status`
-- Expose via:
-  - Redis hash: `vmpooler__health` (for API consumption)
-  - Metrics: Push to existing $metrics system
-  - Logs: Periodic health summary in logs
-
-**Acceptance Criteria:**
-- [ ] Queue sizes are monitored per pool
-- [ ] Queue ages are calculated (oldest, average)
-- [ ] Stuck VMs are detected (age > threshold)
-- [ ] DLQ size is monitored
-- [ ] Processing rates are calculated
-- [ ] Overall health status is determined
-- [ ] Health metrics are exposed via Redis, metrics, and logs
-- [ ] Health check runs on configurable interval
-
-## Implementation Plan
-
-### Phase 1: Dead-Letter Queue
-1. Add DLQ configuration parsing
-2. Implement `move_to_dlq` helper method
-3. Update `fail_pending_vm` to use DLQ
-4. Update `_clone_vm` rescue block to use DLQ
-5. Update `_check_ready_vm` to use DLQ
-6. Add DLQ TTL enforcement
-7. Add DLQ size limiting
-8. Unit tests for DLQ operations
-
-### Phase 2: Auto-Purge
-1. Add purge configuration parsing
-2. Implement `purge_stale_queue_entries` main loop
-3. Implement age-checking helper methods
-4. Implement orphan detection
-5. Add purge metrics logging
-6. Add dry-run mode
-7. Unit tests for purge logic
-8. Integration test for full purge cycle
-
-### Phase 3: Health Checks
-1. Add health check configuration parsing
-2. Implement `check_queue_health` main method
-3. Implement metric calculation helpers
-4. Implement health status determination
-5. Expose metrics via Redis hash
-6. Expose metrics via $metrics system
-7. Add periodic health logging
-8. Unit tests for health check logic
-
-### Phase 4: Integration & Documentation
-1. Update configuration examples
-2. Update operator documentation
-3. Update API documentation (if exposing health endpoint)
-4. Add troubleshooting guide for DLQ/purge
-5. Create runbook for operators
-6. Update TESTING.md with DLQ/purge/health check testing
-
-## Migration & Rollout
-
-### Backward Compatibility
-- All features are opt-in via configuration
-- Default: `dlq_enabled: false`, `purge_enabled: false`, `health_check_enabled: false`
-- Existing behavior unchanged when features disabled
-
-### Rollout Strategy
-1. Deploy with features disabled
-2. Enable DLQ first, monitor for issues
-3. Enable health checks, validate metrics
-4. Enable auto-purge in dry-run mode, validate detection
-5. Enable auto-purge in live mode, monitor impact
-
-### Monitoring During Rollout
-- Monitor DLQ growth rate
-- Monitor purge counts and reasons
-- Monitor health status changes
-- Watch for unexpected VM removal
-- Check for performance impact (Redis load, memory)
-
-## Testing Strategy
-
-### Unit Tests
-- DLQ capture for various error scenarios
-- DLQ TTL enforcement
-- DLQ size limiting
-- Age calculation for purge detection
-- Orphan detection logic
-- Health metric calculations
-- Health status determination
-
-### Integration Tests
-- End-to-end VM failure → DLQ flow
-- End-to-end purge cycle
-- Health check with real queue data
-- DLQ + purge interaction (purge should respect DLQ entries)
-
-### Manual Testing
-1. Create VM with invalid template → verify DLQ entry
-2. Let VM sit in pending too long → verify purge detection
-3. Check health endpoint → verify metrics accuracy
-4. Run purge in dry-run → verify correct detection without deletion
-5. Run purge in live mode → verify stale entries removed
-
-## API Changes (Optional)
-
-If exposing to API:
-```
-GET /api/v1/queue/health
-Returns: Health metrics JSON
-
-GET /api/v1/queue/dlq?queue=pending&limit=50
-Returns: DLQ entries for specified queue
-
-POST /api/v1/queue/purge?dry_run=true
-Returns: Purge simulation results (admin only)
-```
-
-## Metrics
-
-New metrics to add:
-```
-vmpooler.dlq.pending.size
-vmpooler.dlq.clone.size
-vmpooler.dlq.ready.size
-vmpooler.dlq.tasks.size
-
-vmpooler.purge.pending.count
-vmpooler.purge.ready.count
-vmpooler.purge.completed.count
-vmpooler.purge.orphaned.count
-
-vmpooler.health.status  # 0=healthy, 1=degraded, 2=unhealthy
-vmpooler.health.stuck_vms.count
-vmpooler.health.queue.#{queue_name}.size
-vmpooler.health.queue.#{queue_name}.oldest_age
-```
-
-## Configuration Example
-
-```yaml
----
-:config:
-  # Existing config...
-  
-  # Dead-Letter Queue
-  dlq_enabled: true
-  dlq_ttl: 168  # hours (7 days)
-  dlq_max_entries: 10000
-  
-  # Auto-Purge
-  purge_enabled: true
-  purge_interval: 3600  # seconds (1 hour)
-  purge_dry_run: false
-  max_pending_age: 7200  # seconds (2 hours)
-  max_ready_age: 86400  # seconds (24 hours)
-  max_completed_age: 3600  # seconds (1 hour)
-  max_orphaned_age: 86400  # seconds (24 hours)
-  
-  # Health Checks
-  health_check_enabled: true
-  health_check_interval: 300  # seconds (5 minutes)
-  health_thresholds:
-    pending_queue_max: 100
-    ready_queue_max: 500
-    dlq_max_warning: 100
-    dlq_max_critical: 1000
-    stuck_vm_age_threshold: 7200  # 2 hours
-    stuck_vm_max_warning: 10
-    stuck_vm_max_critical: 50
-
-:redis:
-  # Existing redis config...
-```
diff --git a/lib/vmpooler/metrics/promstats.rb b/lib/vmpooler/metrics/promstats.rb
index f24f9b9..d0e1ab9 100644
--- a/lib/vmpooler/metrics/promstats.rb
+++ b/lib/vmpooler/metrics/promstats.rb
@@ -329,6 +329,30 @@ module Vmpooler
             buckets: REDIS_CONNECT_BUCKETS,
             docstring: 'vmpooler redis connection wait time',
             param_labels: %i[type provider]
+          },
+          vmpooler_health: {
+            mtype: M_GAUGE,
+            torun: %i[manager],
+            docstring: 'vmpooler health check metrics',
+            param_labels: %i[metric_path]
+          },
+          vmpooler_purge: {
+            mtype: M_GAUGE,
+            torun: %i[manager],
+            docstring: 'vmpooler purge metrics',
+            param_labels: %i[metric_path]
+          },
+          vmpooler_destroy: {
+            mtype: M_GAUGE,
+            torun: %i[manager],
+            docstring: 'vmpooler destroy metrics',
+            param_labels: %i[poolname]
+          },
+          vmpooler_clone: {
+            mtype: M_GAUGE,
+            torun: %i[manager],
+            docstring: 'vmpooler clone metrics',
+            param_labels: %i[poolname]
           }
         }
       end
diff --git a/lib/vmpooler/pool_manager.rb b/lib/vmpooler/pool_manager.rb
index a7f2ddd..e4f653d 100644
--- a/lib/vmpooler/pool_manager.rb
+++ b/lib/vmpooler/pool_manager.rb
@@ -200,11 +200,11 @@ module Vmpooler
             redis.hset("vmpooler__odrequest__#{request_id}", 'status', 'failed')
             redis.hset("vmpooler__odrequest__#{request_id}", 'failure_reason', failure_reason)
             $logger.log('s', "[!] [#{pool}] '#{vm}' permanently failed: #{failure_reason}")
-            $metrics.increment("errors.permanently_failed.#{pool}")
+            $metrics.increment("vmpooler_errors.permanently_failed.#{pool}")
           end
         end
       end
-      $metrics.increment("errors.markedasfailed.#{pool}")
+      $metrics.increment("vmpooler_errors.markedasfailed.#{pool}")
       open_socket_error || clone_error
     end
 
@@ -477,7 +477,7 @@ module Vmpooler
       ttl_seconds = dlq_ttl * 3600
       redis.expire(dlq_key, ttl_seconds)
 
-      $metrics.increment("dlq.#{queue_type}.count") unless skip_metrics
+      $metrics.increment("vmpooler_dlq.#{queue_type}.count") unless skip_metrics
       $logger.log('d', "[!] [dlq] Moved '#{vm}' from '#{queue_type}' queue to DLQ: #{error_message}")
     rescue StandardError => e
       $logger.log('s', "[!] [dlq] Failed to move '#{vm}' to DLQ: #{e}")
@@ -551,10 +551,10 @@ module Vmpooler
         hostname_retries += 1
 
         if !hostname_available
-          $metrics.increment("errors.duplicatehostname.#{pool_name}")
+          $metrics.increment("vmpooler_errors.duplicatehostname.#{pool_name}")
           $logger.log('s', "[!] [#{pool_name}] Generated hostname #{fqdn} was not unique (attempt \##{hostname_retries} of #{max_hostname_retries})")
         elsif !dns_available
-          $metrics.increment("errors.staledns.#{pool_name}")
+          $metrics.increment("vmpooler_errors.staledns.#{pool_name}")
           $logger.log('s', "[!] [#{pool_name}] Generated hostname #{fqdn} already exists in DNS records (#{dns_ip}), stale DNS")
         end
       end
@@ -600,7 +600,7 @@ module Vmpooler
           provider.create_vm(pool_name, new_vmname)
           finish = format('%<time>.2f', time: Time.now - start)
           $logger.log('s', "[+] [#{pool_name}] '#{new_vmname}' cloned in #{finish} seconds")
-          $metrics.timing("clone.#{pool_name}", finish)
+          $metrics.gauge("vmpooler_clone.#{pool_name}", finish)
 
           $logger.log('d', "[ ] [#{pool_name}] Obtaining IP for '#{new_vmname}'")
           ip_start = Time.now
@@ -714,7 +714,7 @@ module Vmpooler
 
           finish = format('%<time>.2f', time: Time.now - start)
           $logger.log('s', "[-] [#{pool}] '#{vm}' destroyed in #{finish} seconds")
-          $metrics.timing("destroy.#{pool}", finish)
+          $metrics.gauge("vmpooler_destroy.#{pool}", finish)
         end
       end
       dereference_mutex(vm)
@@ -809,8 +809,8 @@ module Vmpooler
 
             purge_duration = Time.now - purge_start
             $logger.log('s', "[*] [purge] Completed purge cycle in #{purge_duration.round(2)}s: #{total_purged} entries purged")
-            $metrics.timing('purge.cycle.duration', purge_duration)
-            $metrics.gauge('purge.total.count', total_purged)
+            $metrics.gauge('vmpooler_purge.cycle.duration', purge_duration)
+            $metrics.gauge('vmpooler_purge.total.count', total_purged)
           end
         rescue StandardError => e
           $logger.log('s', "[!] [purge] Failed during purge cycle: #{e}")
@@ -854,7 +854,7 @@ module Vmpooler
               end
 
               $logger.log('d', "[!] [purge] Purged stale pending VM '#{vm}' from '#{pool_name}' (age: #{age.round(0)}s)")
-              $metrics.increment("purge.pending.#{pool_name}.count")
+              $metrics.increment("vmpooler_purge.pending.#{pool_name}.count")
             end
           end
         rescue StandardError => e
@@ -884,7 +884,7 @@ module Vmpooler
             else
               redis.smove(queue_key, "vmpooler__completed__#{pool_name}", vm)
               $logger.log('d', "[!] [purge] Moved stale ready VM '#{vm}' from '#{pool_name}' to completed (age: #{age.round(0)}s)")
-              $metrics.increment("purge.ready.#{pool_name}.count")
+              $metrics.increment("vmpooler_purge.ready.#{pool_name}.count")
             end
             purged_count += 1
           end
@@ -920,7 +920,7 @@ module Vmpooler
             else
               redis.srem(queue_key, vm)
               $logger.log('d', "[!] [purge] Removed stale completed VM '#{vm}' from '#{pool_name}' (age: #{age.round(0)}s)")
-              $metrics.increment("purge.completed.#{pool_name}.count")
+              $metrics.increment("vmpooler_purge.completed.#{pool_name}.count")
             end
             purged_count += 1
           end
@@ -968,7 +968,7 @@ module Vmpooler
                 expiration_ttl = 3600 # 1 hour
                 redis.expire(vm_key, expiration_ttl)
                 $logger.log('d', "[!] [purge] Set expiration on orphaned metadata for '#{vm}' (age: #{age.round(0)}s)")
-                $metrics.increment('purge.orphaned.count')
+                $metrics.increment('vmpooler_purge.orphaned.count')
               end
               purged_count += 1
             end
@@ -1017,7 +1017,9 @@ module Vmpooler
             health_status = determine_health_status(health_metrics)
 
             # Store health metrics in Redis for API consumption
-            redis.hmset('vmpooler__health', *health_metrics.to_a.flatten)
+            # Convert nested hash to JSON for storage
+            require 'json'
+            redis.hset('vmpooler__health', 'metrics', health_metrics.to_json)
             redis.hset('vmpooler__health', 'status', health_status)
             redis.hset('vmpooler__health', 'last_check', Time.now.iso8601)
             redis.expire('vmpooler__health', 3600) # Expire after 1 hour
@@ -1029,7 +1031,7 @@ module Vmpooler
             push_health_metrics(health_metrics, health_status)
 
             health_duration = Time.now - health_start
-            $metrics.timing('health.check.duration', health_duration)
+            $metrics.gauge('vmpooler_health.check.duration', health_duration)
           end
         rescue StandardError => e
           $logger.log('s', "[!] [health] Failed during health check: #{e}")
@@ -1252,37 +1254,37 @@ module Vmpooler
 
     def push_health_metrics(metrics, status)
       # Push error metrics first
-      $metrics.gauge('health.dlq.total_size', metrics['errors']['dlq_total_size'])
-      $metrics.gauge('health.stuck_vms.count', metrics['errors']['stuck_vm_count'])
-      $metrics.gauge('health.orphaned_metadata.count', metrics['errors']['orphaned_metadata_count'])
+      $metrics.gauge('vmpooler_health.dlq.total_size', metrics['errors']['dlq_total_size'])
+      $metrics.gauge('vmpooler_health.stuck_vms.count', metrics['errors']['stuck_vm_count'])
+      $metrics.gauge('vmpooler_health.orphaned_metadata.count', metrics['errors']['orphaned_metadata_count'])
 
       # Push per-pool queue metrics
       metrics['queues'].each do |pool_name, queues|
         next if pool_name == 'dlq'
 
-        $metrics.gauge("health.queue.#{pool_name}.pending.size", queues['pending']['size'])
-        $metrics.gauge("health.queue.#{pool_name}.pending.oldest_age", queues['pending']['oldest_age'])
-        $metrics.gauge("health.queue.#{pool_name}.pending.stuck_count", queues['pending']['stuck_count'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.size", queues['pending']['size'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.oldest_age", queues['pending']['oldest_age'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.stuck_count", queues['pending']['stuck_count'])
 
-        $metrics.gauge("health.queue.#{pool_name}.ready.size", queues['ready']['size'])
-        $metrics.gauge("health.queue.#{pool_name}.ready.oldest_age", queues['ready']['oldest_age'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.ready.size", queues['ready']['size'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.ready.oldest_age", queues['ready']['oldest_age'])
 
-        $metrics.gauge("health.queue.#{pool_name}.completed.size", queues['completed']['size'])
+        $metrics.gauge("vmpooler_health.queue.#{pool_name}.completed.size", queues['completed']['size'])
       end
 
       # Push DLQ metrics
       metrics['queues']['dlq']&.each do |queue_type, dlq_metrics|
-        $metrics.gauge("health.dlq.#{queue_type}.size", dlq_metrics['size'])
+        $metrics.gauge("vmpooler_health.dlq.#{queue_type}.size", dlq_metrics['size'])
       end
 
       # Push task metrics
-      $metrics.gauge('health.tasks.clone.active', metrics['tasks']['clone']['active'])
-      $metrics.gauge('health.tasks.ondemand.active', metrics['tasks']['ondemand']['active'])
-      $metrics.gauge('health.tasks.ondemand.pending', metrics['tasks']['ondemand']['pending'])
+      $metrics.gauge('vmpooler_health.tasks.clone.active', metrics['tasks']['clone']['active'])
+      $metrics.gauge('vmpooler_health.tasks.ondemand.active', metrics['tasks']['ondemand']['active'])
+      $metrics.gauge('vmpooler_health.tasks.ondemand.pending', metrics['tasks']['ondemand']['pending'])
 
       # Push status last (0=healthy, 1=degraded, 2=unhealthy)
       status_value = { 'healthy' => 0, 'degraded' => 1, 'unhealthy' => 2 }[status] || 2
-      $metrics.gauge('health.status', status_value)
+      $metrics.gauge('vmpooler_health.status', status_value)
     end
 
     def create_vm_disk(pool_name, vm, disk_size, provider)
@@ -2244,6 +2246,15 @@ module Vmpooler
         redis.zrem('vmpooler__provisioning__request', request_id)
         return
       end
+
+      # Check if request was already marked as failed (e.g., by delete endpoint)
+      request_status = redis.hget("vmpooler__odrequest__#{request_id}", 'status')
+      if request_status == 'failed'
+        $logger.log('s', "Request '#{request_id}' already marked as failed, skipping VM creation")
+        redis.zrem('vmpooler__provisioning__request', request_id)
+        return
+      end
+
       score = redis.zscore('vmpooler__provisioning__request', request_id)
       requested = requested.split(',')
 
diff --git a/spec/unit/pool_manager_spec.rb b/spec/unit/pool_manager_spec.rb
index abe5555..1b2ccef 100644
--- a/spec/unit/pool_manager_spec.rb
+++ b/spec/unit/pool_manager_spec.rb
@@ -1107,7 +1107,8 @@ EOT
     context 'with no errors during cloning' do
       before(:each) do
         allow(metrics).to receive(:timing)
-        expect(metrics).to receive(:timing).with(/clone\./,/0/)
+        allow(metrics).to receive(:gauge)
+        expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
         expect(provider).to receive(:create_vm).with(pool, String)
         allow(provider).to receive(:get_vm_ip_address).and_return(1)
         allow(subject).to receive(:get_domain_for_pool).and_return('example.com')
@@ -1158,7 +1159,8 @@ EOT
     context 'with a failure to get ip address after cloning' do
       it 'should log a message that it completed being cloned' do
         allow(metrics).to receive(:timing)
-        expect(metrics).to receive(:timing).with(/clone\./,/0/)
+        allow(metrics).to receive(:gauge)
+        expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
         expect(provider).to receive(:create_vm).with(pool, String)
         allow(provider).to receive(:get_vm_ip_address).and_return(nil)
 
@@ -1217,7 +1219,8 @@ EOT
     context 'with request_id' do
       before(:each) do
         allow(metrics).to receive(:timing)
-        expect(metrics).to receive(:timing).with(/clone\./,/0/)
+        allow(metrics).to receive(:gauge)
+        expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
         expect(provider).to receive(:create_vm).with(pool, String)
         allow(provider).to receive(:get_vm_ip_address).with(vm,pool).and_return(1)
         allow(subject).to receive(:get_dns_plugin_class_name_for_pool).and_return(dns_plugin)
@@ -1255,7 +1258,7 @@ EOT
         resolv = class_double("Resolv").as_stubbed_const(:transfer_nested_constants => true)
         expect(subject).to receive(:generate_and_check_hostname).exactly(3).times.and_return([vm_name, true]) #skip this, make it available all times
         expect(resolv).to receive(:getaddress).exactly(3).times.and_return("1.2.3.4")
-        expect(metrics).to receive(:increment).with("errors.staledns.#{pool}").exactly(3).times
+        expect(metrics).to receive(:increment).with("vmpooler_errors.staledns.#{pool}").exactly(3).times
         expect{subject._clone_vm(pool,provider,dns_plugin)}.to raise_error(/Unable to generate a unique hostname after/)
       end
       it 'should be successful if DNS does not exist' do
@@ -1353,7 +1356,8 @@ EOT
       it 'should emit a timing metric' do
         allow(subject).to receive(:get_vm_usage_labels)
         allow(metrics).to receive(:timing)
-        expect(metrics).to receive(:timing).with("destroy.#{pool}", String)
+        allow(metrics).to receive(:gauge)
+        expect(metrics).to receive(:gauge).with("vmpooler_destroy.#{pool}", String)
 
         subject._destroy_vm(vm,pool,provider,dns_plugin)
       end
@@ -5174,6 +5178,44 @@ EOT
       end
     end
 
+    context 'when request is already marked as failed' do
+      let(:request_string) { "#{pool}:#{pool}:1" }
+      before(:each) do
+        redis_connection_pool.with do |redis|
+          create_ondemand_request_for_test(request_id, current_time.to_i, request_string, redis)
+          set_ondemand_request_status(request_id, 'failed', redis)
+        end
+      end
+
+      it 'logs that the request is already failed' do
+        redis_connection_pool.with do |redis|
+          expect(logger).to receive(:log).with('s', "Request '#{request_id}' already marked as failed, skipping VM creation")
+          subject.create_ondemand_vms(request_id, redis)
+        end
+      end
+
+      it 'removes the request from provisioning__request queue' do
+        redis_connection_pool.with do |redis|
+          subject.create_ondemand_vms(request_id, redis)
+          expect(redis.zscore('vmpooler__provisioning__request', request_id)).to be_nil
+        end
+      end
+
+      it 'does not create VM tasks' do
+        redis_connection_pool.with do |redis|
+          subject.create_ondemand_vms(request_id, redis)
+          expect(redis.zcard('vmpooler__odcreate__task')).to eq(0)
+        end
+      end
+
+      it 'does not add to provisioning__processing queue' do
+        redis_connection_pool.with do |redis|
+          subject.create_ondemand_vms(request_id, redis)
+          expect(redis.zscore('vmpooler__provisioning__processing', request_id)).to be_nil
+        end
+      end
+    end
+
     context 'with a request that has data' do
       let(:request_string) { "#{pool}:#{pool}:1" }
       before(:each) do
diff --git a/spec/unit/queue_reliability_spec.rb b/spec/unit/queue_reliability_spec.rb
index db895ae..fe95548 100644
--- a/spec/unit/queue_reliability_spec.rb
+++ b/spec/unit/queue_reliability_spec.rb
@@ -119,7 +119,7 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
 
         it 'increments DLQ metrics' do
           redis_connection_pool.with do |redis_connection|
-            expect(metrics).to receive(:increment).with('dlq.pending.count')
+            expect(metrics).to receive(:increment).with('vmpooler_dlq.pending.count')
             
             subject.move_to_dlq(vm, pool, 'pending', error_class, error_message, redis_connection)
           end
@@ -223,7 +223,7 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
 
         it 'increments purge metrics' do
           redis_connection_pool.with do |redis_connection|
-            expect(metrics).to receive(:increment).with("purge.pending.#{pool}.count")
+            expect(metrics).to receive(:increment).with("vmpooler_purge.pending.#{pool}.count")
             
             subject.purge_pending_queue(pool, redis_connection)
           end
@@ -460,35 +460,35 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
 
       it 'pushes status metric' do
         allow(metrics).to receive(:gauge)
-        expect(metrics).to receive(:gauge).with('health.status', 0)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.status', 0)
         
         subject.push_health_metrics(metrics_data, 'healthy')
       end
 
       it 'pushes error metrics' do
         allow(metrics).to receive(:gauge)
-        expect(metrics).to receive(:gauge).with('health.dlq.total_size', 25)
-        expect(metrics).to receive(:gauge).with('health.stuck_vms.count', 2)
-        expect(metrics).to receive(:gauge).with('health.orphaned_metadata.count', 3)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.dlq.total_size', 25)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.stuck_vms.count', 2)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.orphaned_metadata.count', 3)
         
         subject.push_health_metrics(metrics_data, 'healthy')
       end
 
       it 'pushes per-pool queue metrics' do
         allow(metrics).to receive(:gauge)
-        expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.size', 10)
-        expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.oldest_age', 3600)
-        expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.stuck_count', 2)
-        expect(metrics).to receive(:gauge).with('health.queue.test-pool.ready.size', 50)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.size', 10)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.oldest_age', 3600)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.stuck_count', 2)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.ready.size', 50)
         
         subject.push_health_metrics(metrics_data, 'healthy')
       end
 
       it 'pushes task metrics' do
         allow(metrics).to receive(:gauge)
-        expect(metrics).to receive(:gauge).with('health.tasks.clone.active', 3)
-        expect(metrics).to receive(:gauge).with('health.tasks.ondemand.active', 2)
-        expect(metrics).to receive(:gauge).with('health.tasks.ondemand.pending', 5)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.clone.active', 3)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.ondemand.active', 2)
+        expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.ondemand.pending', 5)
         
         subject.push_health_metrics(metrics_data, 'healthy')
       end