mirror of
https://github.com/puppetlabs/vmpooler.git
synced 2026-01-26 01:58:41 -05:00
Prevent VM allocation for already-deleted request-ids
This commit is contained in:
parent
c24fe28d6d
commit
46e77010f6
9 changed files with 127 additions and 1230 deletions
4
Gemfile
4
Gemfile
|
|
@ -3,11 +3,11 @@ source ENV['GEM_SOURCE'] || 'https://rubygems.org'
|
|||
gemspec
|
||||
|
||||
# Evaluate Gemfile.local if it exists
|
||||
if File.exists? "#{__FILE__}.local"
|
||||
if File.exist? "#{__FILE__}.local"
|
||||
instance_eval(File.read("#{__FILE__}.local"))
|
||||
end
|
||||
|
||||
# Evaluate ~/.gemfile if it exists
|
||||
if File.exists?(File.join(Dir.home, '.gemfile'))
|
||||
if File.exist?(File.join(Dir.home, '.gemfile'))
|
||||
instance_eval(File.read(File.join(Dir.home, '.gemfile')))
|
||||
end
|
||||
|
|
|
|||
|
|
@ -197,6 +197,7 @@ GEM
|
|||
PLATFORMS
|
||||
arm64-darwin-22
|
||||
arm64-darwin-23
|
||||
arm64-darwin-25
|
||||
universal-java-11
|
||||
universal-java-17
|
||||
x86_64-darwin-22
|
||||
|
|
|
|||
|
|
@ -1,375 +0,0 @@
|
|||
# Implementation Summary: Redis Queue Reliability Features
|
||||
|
||||
## Overview
|
||||
Successfully implemented Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features for VMPooler to improve Redis queue reliability and observability.
|
||||
|
||||
## Branch
|
||||
- **Repository**: `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler`
|
||||
- **Branch**: `P4DEVOPS-8567` (created from main)
|
||||
- **Status**: Implementation complete, ready for testing
|
||||
|
||||
## What Was Implemented
|
||||
|
||||
### 1. Dead-Letter Queue (DLQ)
|
||||
**Purpose**: Capture and track failed VM operations for visibility and debugging.
|
||||
|
||||
**Files Modified**:
|
||||
- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
|
||||
- Added `dlq_enabled?`, `dlq_ttl`, `dlq_max_entries` helper methods
|
||||
- Added `move_to_dlq` method to capture failures
|
||||
- Updated `handle_timed_out_vm` to use DLQ
|
||||
- Updated `_clone_vm` rescue block to use DLQ
|
||||
- Updated `vm_still_ready?` rescue block to use DLQ
|
||||
|
||||
**Features**:
|
||||
- ✅ Captures failures from pending, clone, and ready queues
|
||||
- ✅ Stores complete failure context (VM, pool, error, timestamp, retry count, request ID)
|
||||
- ✅ Uses Redis sorted sets (scored by timestamp) for easy age-based queries
|
||||
- ✅ Enforces TTL-based expiration (default 7 days)
|
||||
- ✅ Enforces max entries limit to prevent unbounded growth
|
||||
- ✅ Automatically trims oldest entries when limit reached
|
||||
- ✅ Increments metrics for DLQ operations
|
||||
|
||||
**DLQ Keys**:
|
||||
- `vmpooler__dlq__pending` - Failed pending VMs
|
||||
- `vmpooler__dlq__clone` - Failed clone operations
|
||||
- `vmpooler__dlq__ready` - Failed ready queue VMs
|
||||
|
||||
### 2. Auto-Purge Mechanism
|
||||
**Purpose**: Automatically remove stale entries from queues to prevent resource leaks.
|
||||
|
||||
**Files Modified**:
|
||||
- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
|
||||
- Added `purge_enabled?`, `purge_dry_run?` helper methods
|
||||
- Added age threshold methods: `max_pending_age`, `max_ready_age`, `max_completed_age`, `max_orphaned_age`
|
||||
- Added `purge_stale_queue_entries` main loop
|
||||
- Added `purge_pending_queue`, `purge_ready_queue`, `purge_completed_queue` methods
|
||||
- Added `purge_orphaned_metadata` method
|
||||
- Integrated purge thread into main execution loop
|
||||
|
||||
**Features**:
|
||||
- ✅ Purges pending VMs stuck longer than threshold (default 2 hours)
|
||||
- ✅ Purges ready VMs idle longer than threshold (default 24 hours)
|
||||
- ✅ Purges completed VMs older than threshold (default 1 hour)
|
||||
- ✅ Detects and expires orphaned VM metadata
|
||||
- ✅ Moves purged pending VMs to DLQ for visibility
|
||||
- ✅ Dry-run mode for testing (logs without purging)
|
||||
- ✅ Configurable purge interval (default 1 hour)
|
||||
- ✅ Increments per-pool purge metrics
|
||||
- ✅ Runs in background thread
|
||||
|
||||
### 3. Health Checks
|
||||
**Purpose**: Monitor queue health and expose metrics for alerting and dashboards.
|
||||
|
||||
**Files Modified**:
|
||||
- [`lib/vmpooler/pool_manager.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb)
|
||||
- Added `health_check_enabled?`, `health_thresholds` helper methods
|
||||
- Added `check_queue_health` main method
|
||||
- Added `calculate_health_metrics` to gather queue metrics
|
||||
- Added `calculate_queue_ages` helper
|
||||
- Added `count_orphaned_metadata` helper
|
||||
- Added `determine_health_status` to classify health (healthy/degraded/unhealthy)
|
||||
- Added `log_health_summary` for log output
|
||||
- Added `push_health_metrics` to expose metrics
|
||||
- Integrated health check thread into main execution loop
|
||||
|
||||
**Features**:
|
||||
- ✅ Monitors per-pool queue sizes (pending, ready, completed)
|
||||
- ✅ Calculates queue ages (oldest, average)
|
||||
- ✅ Detects stuck VMs (age > threshold)
|
||||
- ✅ Monitors DLQ sizes
|
||||
- ✅ Counts orphaned metadata
|
||||
- ✅ Monitors task queue sizes (clone, on-demand)
|
||||
- ✅ Determines overall health status (healthy/degraded/unhealthy)
|
||||
- ✅ Stores metrics in Redis for API consumption (`vmpooler__health`)
|
||||
- ✅ Pushes metrics to metrics system (Prometheus, Graphite)
|
||||
- ✅ Logs periodic health summary
|
||||
- ✅ Configurable thresholds and intervals
|
||||
- ✅ Runs in background thread
|
||||
|
||||
## Configuration
|
||||
|
||||
**Files Created**:
|
||||
- [`vmpooler.yml.example`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler.yml.example) - Example configuration showing all options
|
||||
|
||||
**Configuration Options**:
|
||||
|
||||
```yaml
|
||||
:config:
|
||||
# Dead-Letter Queue
|
||||
dlq_enabled: false # Set to true to enable
|
||||
dlq_ttl: 168 # hours (7 days)
|
||||
dlq_max_entries: 10000
|
||||
|
||||
# Auto-Purge
|
||||
purge_enabled: false # Set to true to enable
|
||||
purge_interval: 3600 # seconds (1 hour)
|
||||
purge_dry_run: false # Set to true for testing
|
||||
max_pending_age: 7200 # 2 hours
|
||||
max_ready_age: 86400 # 24 hours
|
||||
max_completed_age: 3600 # 1 hour
|
||||
max_orphaned_age: 86400 # 24 hours
|
||||
|
||||
# Health Checks
|
||||
health_check_enabled: false # Set to true to enable
|
||||
health_check_interval: 300 # seconds (5 minutes)
|
||||
health_thresholds:
|
||||
pending_queue_max: 100
|
||||
ready_queue_max: 500
|
||||
dlq_max_warning: 100
|
||||
dlq_max_critical: 1000
|
||||
stuck_vm_age_threshold: 7200
|
||||
stuck_vm_max_warning: 10
|
||||
stuck_vm_max_critical: 50
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
**Files Created**:
|
||||
1. [`REDIS_QUEUE_RELIABILITY.md`](/Users/mahima.singh/vmpooler-projects/Vmpooler/REDIS_QUEUE_RELIABILITY.md)
|
||||
- Comprehensive design document
|
||||
- Feature requirements with acceptance criteria
|
||||
- Implementation plan and phases
|
||||
- Configuration examples
|
||||
- Metrics definitions
|
||||
|
||||
2. [`QUEUE_RELIABILITY_OPERATOR_GUIDE.md`](/Users/mahima.singh/vmpooler-projects/Vmpooler/QUEUE_RELIABILITY_OPERATOR_GUIDE.md)
|
||||
- Complete operator guide
|
||||
- Feature descriptions and benefits
|
||||
- Configuration examples
|
||||
- Common scenarios and troubleshooting
|
||||
- Best practices
|
||||
- Migration guide
|
||||
|
||||
## Testing
|
||||
|
||||
**Files Created**:
|
||||
- [`spec/unit/queue_reliability_spec.rb`](/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/spec/unit/queue_reliability_spec.rb)
|
||||
- 30+ unit tests covering:
|
||||
- DLQ helper methods and operations
|
||||
- Purge helper methods and queue operations
|
||||
- Health check calculations and status determination
|
||||
- Metric push operations
|
||||
|
||||
**Test Coverage**:
|
||||
- ✅ DLQ enabled/disabled states
|
||||
- ✅ DLQ TTL and max entries configuration
|
||||
- ✅ DLQ entry creation with all fields
|
||||
- ✅ DLQ max entries enforcement
|
||||
- ✅ Purge enabled/disabled states
|
||||
- ✅ Purge dry-run mode
|
||||
- ✅ Purge age threshold configuration
|
||||
- ✅ Purge pending, ready, completed queues
|
||||
- ✅ Purge orphaned metadata detection
|
||||
- ✅ Health check enabled/disabled states
|
||||
- ✅ Health threshold configuration
|
||||
- ✅ Queue age calculations
|
||||
- ✅ Health status determination (healthy/degraded/unhealthy)
|
||||
- ✅ Metric push operations
|
||||
|
||||
## Code Quality
|
||||
|
||||
**Validation**:
|
||||
- ✅ Ruby syntax check passed: `ruby -c lib/vmpooler/pool_manager.rb` → Syntax OK
|
||||
- ✅ No compilation errors
|
||||
- ✅ Follows existing VMPooler code patterns
|
||||
- ✅ Proper error handling with rescue blocks
|
||||
- ✅ Logging at appropriate levels ('s' for significant, 'd' for debug)
|
||||
- ✅ Metrics increments and gauges
|
||||
|
||||
## Metrics
|
||||
|
||||
**New Metrics Added**:
|
||||
|
||||
```
|
||||
# DLQ metrics
|
||||
vmpooler.dlq.pending.count
|
||||
vmpooler.dlq.clone.count
|
||||
vmpooler.dlq.ready.count
|
||||
|
||||
# Purge metrics
|
||||
vmpooler.purge.pending.<pool>.count
|
||||
vmpooler.purge.ready.<pool>.count
|
||||
vmpooler.purge.completed.<pool>.count
|
||||
vmpooler.purge.orphaned.count
|
||||
vmpooler.purge.cycle.duration
|
||||
vmpooler.purge.total.count
|
||||
|
||||
# Health metrics
|
||||
vmpooler.health.status # 0=healthy, 1=degraded, 2=unhealthy
|
||||
vmpooler.health.dlq.total_size
|
||||
vmpooler.health.stuck_vms.count
|
||||
vmpooler.health.orphaned_metadata.count
|
||||
vmpooler.health.queue.<pool>.pending.size
|
||||
vmpooler.health.queue.<pool>.pending.oldest_age
|
||||
vmpooler.health.queue.<pool>.pending.stuck_count
|
||||
vmpooler.health.queue.<pool>.ready.size
|
||||
vmpooler.health.queue.<pool>.ready.oldest_age
|
||||
vmpooler.health.queue.<pool>.completed.size
|
||||
vmpooler.health.dlq.<type>.size
|
||||
vmpooler.health.tasks.clone.active
|
||||
vmpooler.health.tasks.ondemand.active
|
||||
vmpooler.health.tasks.ondemand.pending
|
||||
vmpooler.health.check.duration
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
### 1. Local Testing
|
||||
```bash
|
||||
cd /Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler
|
||||
|
||||
# Run unit tests
|
||||
bundle exec rspec spec/unit/queue_reliability_spec.rb
|
||||
|
||||
# Run all tests
|
||||
bundle exec rspec
|
||||
```
|
||||
|
||||
### 2. Enable Features in Development
|
||||
Update your vmpooler configuration:
|
||||
```yaml
|
||||
:config:
|
||||
# Start with DLQ only
|
||||
dlq_enabled: true
|
||||
dlq_ttl: 24 # Short TTL for dev
|
||||
|
||||
# Enable purge in dry-run mode first
|
||||
purge_enabled: true
|
||||
purge_dry_run: true
|
||||
purge_interval: 600 # Check every 10 minutes
|
||||
max_pending_age: 1800 # 30 minutes
|
||||
|
||||
# Enable health checks
|
||||
health_check_enabled: true
|
||||
health_check_interval: 60 # Check every minute
|
||||
```
|
||||
|
||||
### 3. Monitor Logs
|
||||
Watch for:
|
||||
```bash
|
||||
# DLQ operations
|
||||
grep "dlq" vmpooler.log
|
||||
|
||||
# Purge operations (dry-run)
|
||||
grep "purge.*dry-run" vmpooler.log
|
||||
|
||||
# Health checks
|
||||
grep "health" vmpooler.log
|
||||
```
|
||||
|
||||
### 4. Query Redis
|
||||
```bash
|
||||
# Check DLQ entries
|
||||
redis-cli ZCARD vmpooler__dlq__pending
|
||||
redis-cli ZRANGE vmpooler__dlq__pending 0 9
|
||||
|
||||
# Check health status
|
||||
redis-cli HGETALL vmpooler__health
|
||||
```
|
||||
|
||||
### 5. Deployment Plan
|
||||
1. **Dev Environment**:
|
||||
- Enable all features with aggressive thresholds
|
||||
- Monitor for 1 week
|
||||
- Verify DLQ captures failures correctly
|
||||
- Verify purge detects stale entries (dry-run)
|
||||
- Verify health status is accurate
|
||||
|
||||
2. **Staging Environment**:
|
||||
- Enable DLQ and health checks
|
||||
- Enable purge in dry-run mode
|
||||
- Monitor for 1 week
|
||||
- Review DLQ patterns
|
||||
- Tune thresholds based on actual usage
|
||||
|
||||
3. **Production Environment**:
|
||||
- Enable DLQ and health checks
|
||||
- Enable purge in dry-run mode initially
|
||||
- Monitor for 2 weeks
|
||||
- Verify no false positives
|
||||
- Enable purge in live mode
|
||||
- Set up alerting based on health metrics
|
||||
|
||||
### 6. Testing Checklist
|
||||
- [ ] Run unit tests: `bundle exec rspec spec/unit/queue_reliability_spec.rb`
|
||||
- [ ] Run full test suite: `bundle exec rspec`
|
||||
- [ ] Start VMPooler with features enabled
|
||||
- [ ] Create a VM with invalid template → verify DLQ capture
|
||||
- [ ] Let VM sit in pending too long → verify purge detection (dry-run)
|
||||
- [ ] Query `vmpooler__health` → verify metrics present
|
||||
- [ ] Check Prometheus/Graphite → verify metrics pushed
|
||||
- [ ] Enable purge live mode → verify stale entries removed
|
||||
- [ ] Monitor logs for thread startup/health
|
||||
|
||||
## Files Changed/Created
|
||||
|
||||
### Modified Files:
|
||||
1. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/lib/vmpooler/pool_manager.rb`
|
||||
- Added ~350 lines of code
|
||||
- 3 major features implemented
|
||||
- Integrated into main execution loop
|
||||
|
||||
### New Files:
|
||||
1. `/Users/mahima.singh/vmpooler-projects/Vmpooler/REDIS_QUEUE_RELIABILITY.md` (290 lines)
|
||||
2. `/Users/mahima.singh/vmpooler-projects/Vmpooler/QUEUE_RELIABILITY_OPERATOR_GUIDE.md` (600+ lines)
|
||||
3. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler.yml.example` (100+ lines)
|
||||
4. `/Users/mahima.singh/vmpooler-projects/Vmpooler/vmpooler/spec/unit/queue_reliability_spec.rb` (500+ lines)
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
✅ **All features are opt-in** via configuration:
|
||||
- Default: All features disabled (`dlq_enabled: false`, `purge_enabled: false`, `health_check_enabled: false`)
|
||||
- Existing behavior unchanged when features are disabled
|
||||
- No breaking changes to existing code or APIs
|
||||
|
||||
## Performance Impact
|
||||
|
||||
**Expected**:
|
||||
- Redis memory: +1-5MB (depends on DLQ size)
|
||||
- CPU: +1-2% during purge/health check cycles
|
||||
- Network: Minimal (metric pushes only)
|
||||
|
||||
**Mitigation**:
|
||||
- Background threads prevent blocking main pool operations
|
||||
- Configurable intervals allow tuning based on load
|
||||
- DLQ max entries limit prevents unbounded growth
|
||||
- Purge targets only stale entries (age-based)
|
||||
|
||||
## Known Limitations
|
||||
|
||||
1. **DLQ Querying**: Currently requires Redis CLI or custom tooling. Future: Add API endpoints for DLQ queries.
|
||||
2. **Purge Validation**: Does not check provider to confirm VM still exists before purging. Relies on age thresholds only.
|
||||
3. **Health Status**: Stored in Redis only, no persistent history. Consider exporting to time-series DB for trending.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **API Endpoints**:
|
||||
- `GET /api/v1/queue/dlq` - Query DLQ entries
|
||||
- `GET /api/v1/queue/health` - Get health metrics
|
||||
- `POST /api/v1/queue/purge` - Trigger manual purge (admin only)
|
||||
|
||||
2. **Advanced Purge**:
|
||||
- Provider validation before purging
|
||||
- Purge on-demand requests that are too old
|
||||
- Purge VMs without corresponding provider VM
|
||||
|
||||
3. **Advanced Health**:
|
||||
- Processing rate calculations (VMs/minute)
|
||||
- Trend analysis (queue size over time)
|
||||
- Predictive alerting (queue will hit threshold in X minutes)
|
||||
|
||||
## Summary
|
||||
|
||||
Successfully implemented comprehensive queue reliability features for VMPooler:
|
||||
- **DLQ**: Capture and track all failures
|
||||
- **Auto-Purge**: Automatically clean up stale entries
|
||||
- **Health Checks**: Monitor queue health and expose metrics
|
||||
|
||||
All features are:
|
||||
- ✅ Fully implemented and tested
|
||||
- ✅ Backward compatible (opt-in)
|
||||
- ✅ Well documented
|
||||
- ✅ Ready for testing in development environment
|
||||
|
||||
Total lines of code added: ~1,500 lines (code + tests + docs)
|
||||
|
|
@ -1,444 +0,0 @@
|
|||
# Queue Reliability Features - Operator Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers the Dead-Letter Queue (DLQ), Auto-Purge, and Health Check features added to VMPooler for improved queue reliability and observability.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Dead-Letter Queue (DLQ)
|
||||
|
||||
The DLQ captures failed VM creation attempts and queue transitions, providing visibility into failures without losing data.
|
||||
|
||||
**What gets captured:**
|
||||
- VMs that fail during clone operations
|
||||
- VMs that timeout in pending queue
|
||||
- VMs that become unreachable in ready queue
|
||||
- Any permanent errors (template not found, permission denied, etc.)
|
||||
|
||||
**Benefits:**
|
||||
- Failed VMs are not lost - they're moved to DLQ for analysis
|
||||
- Complete failure context (error message, timestamp, retry count, request ID)
|
||||
- TTL-based expiration prevents unbounded growth
|
||||
- Size limiting prevents memory issues
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:config:
|
||||
dlq_enabled: true
|
||||
dlq_ttl: 168 # hours (7 days)
|
||||
dlq_max_entries: 10000 # per DLQ queue
|
||||
```
|
||||
|
||||
**Querying DLQ via Redis CLI:**
|
||||
```bash
|
||||
# View all pending DLQ entries
|
||||
redis-cli ZRANGE vmpooler__dlq__pending 0 -1
|
||||
|
||||
# View DLQ entries with scores (timestamps)
|
||||
redis-cli ZRANGE vmpooler__dlq__pending 0 -1 WITHSCORES
|
||||
|
||||
# Get DLQ size
|
||||
redis-cli ZCARD vmpooler__dlq__pending
|
||||
|
||||
# View recent failures (last 10)
|
||||
redis-cli ZREVRANGE vmpooler__dlq__clone 0 9
|
||||
|
||||
# View entries older than 1 hour (timestamp in seconds)
|
||||
redis-cli ZRANGEBYSCORE vmpooler__dlq__pending -inf $(date -d '1 hour ago' +%s)
|
||||
```
|
||||
|
||||
**DLQ Keys:**
|
||||
- `vmpooler__dlq__pending` - Failed pending VMs
|
||||
- `vmpooler__dlq__clone` - Failed clone operations
|
||||
- `vmpooler__dlq__ready` - Failed ready queue VMs
|
||||
- `vmpooler__dlq__tasks` - Failed tasks
|
||||
|
||||
**Entry Format:**
|
||||
Each DLQ entry contains:
|
||||
```json
|
||||
{
|
||||
"vm": "pooler-happy-elephant",
|
||||
"pool": "centos-7-x86_64",
|
||||
"queue_from": "pending",
|
||||
"error_class": "StandardError",
|
||||
"error_message": "template centos-7-template does not exist",
|
||||
"failed_at": "2024-01-15T10:30:00Z",
|
||||
"retry_count": 3,
|
||||
"request_id": "req-abc123",
|
||||
"pool_alias": "centos-7"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Auto-Purge
|
||||
|
||||
Automatically removes stale entries from queues to prevent resource leaks and maintain queue health.
|
||||
|
||||
**What gets purged:**
|
||||
- **Pending VMs**: Stuck in pending queue longer than `max_pending_age`
|
||||
- **Ready VMs**: Idle in ready queue longer than `max_ready_age`
|
||||
- **Completed VMs**: In completed queue longer than `max_completed_age`
|
||||
- **Orphaned Metadata**: VM metadata without corresponding queue entry
|
||||
|
||||
**Benefits:**
|
||||
- Prevents queue bloat from stuck/forgotten VMs
|
||||
- Automatically cleans up after process crashes or bugs
|
||||
- Configurable thresholds per environment
|
||||
- Dry-run mode for safe testing
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:config:
|
||||
purge_enabled: true
|
||||
purge_interval: 3600 # seconds (1 hour) - how often to run
|
||||
purge_dry_run: false # set to true to log but not purge
|
||||
|
||||
# Age thresholds (in seconds)
|
||||
max_pending_age: 7200 # 2 hours
|
||||
max_ready_age: 86400 # 24 hours
|
||||
max_completed_age: 3600 # 1 hour
|
||||
max_orphaned_age: 86400 # 24 hours
|
||||
```
|
||||
|
||||
**Testing Purge (Dry-Run Mode):**
|
||||
```yaml
|
||||
:config:
|
||||
purge_enabled: true
|
||||
purge_dry_run: true # Logs what would be purged without actually purging
|
||||
max_pending_age: 600 # Use shorter thresholds for testing
|
||||
```
|
||||
|
||||
Watch logs for:
|
||||
```
|
||||
[*] [purge][dry-run] Would purge stale pending VM 'pooler-happy-elephant' (age: 3650s, max: 600s)
|
||||
```
|
||||
|
||||
**Monitoring Purge:**
|
||||
Check logs for purge cycles:
|
||||
```
|
||||
[*] [purge] Starting stale queue entry purge cycle
|
||||
[!] [purge] Purged stale pending VM 'pooler-sad-dog' from 'centos-7-x86_64' (age: 7250s)
|
||||
[!] [purge] Moved stale ready VM 'pooler-angry-cat' from 'ubuntu-2004-x86_64' to completed (age: 90000s)
|
||||
[*] [purge] Completed purge cycle in 2.34s: 12 entries purged
|
||||
```
|
||||
|
||||
### 3. Health Checks
|
||||
|
||||
Monitors queue health and exposes metrics for alerting and dashboards.
|
||||
|
||||
**What gets monitored:**
|
||||
- Queue sizes (pending, ready, completed)
|
||||
- Queue ages (oldest VM, average age)
|
||||
- Stuck VMs (VMs in pending queue longer than threshold)
|
||||
- DLQ size
|
||||
- Orphaned metadata count
|
||||
- Task queue sizes (clone, on-demand)
|
||||
- Overall health status (healthy/degraded/unhealthy)
|
||||
|
||||
**Benefits:**
|
||||
- Proactive detection of queue issues
|
||||
- Metrics for alerting and dashboards
|
||||
- Historical health tracking
|
||||
- API endpoint for health status
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:config:
|
||||
health_check_enabled: true
|
||||
health_check_interval: 300 # seconds (5 minutes)
|
||||
|
||||
health_thresholds:
|
||||
pending_queue_max: 100
|
||||
ready_queue_max: 500
|
||||
dlq_max_warning: 100
|
||||
dlq_max_critical: 1000
|
||||
stuck_vm_age_threshold: 7200 # 2 hours
|
||||
stuck_vm_max_warning: 10
|
||||
stuck_vm_max_critical: 50
|
||||
```
|
||||
|
||||
**Health Status Levels:**
|
||||
- **Healthy**: All metrics within normal thresholds
|
||||
- **Degraded**: Some metrics elevated but functional (DLQ > warning, queue sizes elevated)
|
||||
- **Unhealthy**: Critical thresholds exceeded (DLQ > critical, many stuck VMs, queues backed up)
|
||||
|
||||
**Viewing Health Status:**
|
||||
|
||||
Via Redis:
|
||||
```bash
|
||||
# Get current health status
|
||||
redis-cli HGETALL vmpooler__health
|
||||
|
||||
# Get specific health metric
|
||||
redis-cli HGET vmpooler__health status
|
||||
redis-cli HGET vmpooler__health last_check
|
||||
```
|
||||
|
||||
Via Logs:
|
||||
```
|
||||
[*] [health] Status: HEALTHY | Queues: P=45 R=230 C=12 | DLQ=25 | Stuck=3 | Orphaned=5
|
||||
```
|
||||
|
||||
**Exposed Metrics:**
|
||||
|
||||
The following metrics are pushed to the metrics system (Prometheus, Graphite, etc.):
|
||||
|
||||
```
|
||||
# Health status (0=healthy, 1=degraded, 2=unhealthy)
|
||||
vmpooler.health.status
|
||||
|
||||
# Error metrics
|
||||
vmpooler.health.dlq.total_size
|
||||
vmpooler.health.stuck_vms.count
|
||||
vmpooler.health.orphaned_metadata.count
|
||||
|
||||
# Per-pool queue metrics
|
||||
vmpooler.health.queue.<pool_name>.pending.size
|
||||
vmpooler.health.queue.<pool_name>.pending.oldest_age
|
||||
vmpooler.health.queue.<pool_name>.pending.stuck_count
|
||||
vmpooler.health.queue.<pool_name>.ready.size
|
||||
vmpooler.health.queue.<pool_name>.ready.oldest_age
|
||||
vmpooler.health.queue.<pool_name>.completed.size
|
||||
|
||||
# DLQ metrics
|
||||
vmpooler.health.dlq.<queue_type>.size
|
||||
|
||||
# Task metrics
|
||||
vmpooler.health.tasks.clone.active
|
||||
vmpooler.health.tasks.ondemand.active
|
||||
vmpooler.health.tasks.ondemand.pending
|
||||
```
|
||||
|
||||
## Common Scenarios
|
||||
|
||||
### Scenario 1: Investigating Failed VM Requests
|
||||
|
||||
**Problem:** User reports VM request failed.
|
||||
|
||||
**Steps:**
|
||||
1. Check DLQ for the request:
|
||||
```bash
|
||||
redis-cli ZRANGE vmpooler__dlq__pending 0 -1 | grep "req-abc123"
|
||||
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123"
|
||||
```
|
||||
|
||||
2. Parse the JSON entry to see failure details:
|
||||
```bash
|
||||
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | grep "req-abc123" | jq .
|
||||
```
|
||||
|
||||
3. Common failure reasons:
|
||||
- `template does not exist` - Template missing or renamed in provider
|
||||
- `permission denied` - VMPooler lacks permissions to clone template
|
||||
- `timeout` - VM failed to become ready within timeout period
|
||||
- `failed to obtain IP` - Network/DHCP issue
|
||||
|
||||
### Scenario 2: Queue Backup
|
||||
|
||||
**Problem:** Pending queue growing, VMs not moving to ready.
|
||||
|
||||
**Steps:**
|
||||
1. Check health status:
|
||||
```bash
|
||||
redis-cli HGET vmpooler__health status
|
||||
```
|
||||
|
||||
2. Check pending queue metrics:
|
||||
```bash
|
||||
# View stuck VMs
|
||||
redis-cli HGET vmpooler__health stuck_vm_count
|
||||
|
||||
# Check oldest VM age
|
||||
redis-cli SMEMBERS vmpooler__pending__centos-7-x86_64 | head -1 | xargs -I {} redis-cli HGET vmpooler__vm__{} clone
|
||||
```
|
||||
|
||||
3. Check DLQ for recent failures:
|
||||
```bash
|
||||
redis-cli ZREVRANGE vmpooler__dlq__clone 0 9
|
||||
```
|
||||
|
||||
4. Common causes:
|
||||
- Provider errors (vCenter unreachable, no resources)
|
||||
- Network issues (can't reach VMs, no DHCP)
|
||||
- Configuration issues (wrong template name, bad credentials)
|
||||
|
||||
### Scenario 3: High DLQ Size
|
||||
|
||||
**Problem:** DLQ size growing, indicating persistent failures.
|
||||
|
||||
**Steps:**
|
||||
1. Check DLQ size:
|
||||
```bash
|
||||
redis-cli ZCARD vmpooler__dlq__pending
|
||||
redis-cli ZCARD vmpooler__dlq__clone
|
||||
```
|
||||
|
||||
2. Identify common failure patterns:
|
||||
```bash
|
||||
redis-cli ZRANGE vmpooler__dlq__clone 0 -1 | jq -r '.error_message' | sort | uniq -c | sort -rn
|
||||
```
|
||||
|
||||
3. Fix underlying issues (template exists, permissions, network)
|
||||
|
||||
4. If issues resolved, DLQ entries will expire after TTL (default 7 days)
|
||||
|
||||
### Scenario 4: Testing Configuration Changes
|
||||
|
||||
**Problem:** Want to test new purge thresholds without affecting production.
|
||||
|
||||
**Steps:**
|
||||
1. Enable dry-run mode:
|
||||
```yaml
|
||||
:config:
|
||||
purge_dry_run: true
|
||||
max_pending_age: 3600 # Test with 1 hour
|
||||
```
|
||||
|
||||
2. Monitor logs for purge detections:
|
||||
```bash
|
||||
tail -f vmpooler.log | grep "purge.*dry-run"
|
||||
```
|
||||
|
||||
3. Verify detection is correct
|
||||
|
||||
4. Disable dry-run when ready:
|
||||
```yaml
|
||||
:config:
|
||||
purge_dry_run: false
|
||||
```
|
||||
|
||||
### Scenario 5: Alerting on Queue Health
|
||||
|
||||
**Problem:** Want to be notified when queues are unhealthy.
|
||||
|
||||
**Steps:**
|
||||
1. Set up Prometheus alerts based on health metrics:
|
||||
```yaml
|
||||
- alert: VMPoolerUnhealthy
|
||||
expr: vmpooler_health_status >= 2
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "VMPooler is unhealthy"
|
||||
|
||||
- alert: VMPoolerHighDLQ
|
||||
expr: vmpooler_health_dlq_total_size > 500
|
||||
for: 30m
|
||||
annotations:
|
||||
summary: "VMPooler DLQ size is high"
|
||||
|
||||
- alert: VMPoolerStuckVMs
|
||||
expr: vmpooler_health_stuck_vms_count > 20
|
||||
for: 15m
|
||||
annotations:
|
||||
summary: "Many VMs stuck in pending queue"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### DLQ Not Capturing Failures
|
||||
|
||||
**Check:**
|
||||
1. Is DLQ enabled? `redis-cli HGET vmpooler__config dlq_enabled`
|
||||
2. Are failures actually occurring? Check logs for error messages
|
||||
3. Is Redis accessible? `redis-cli PING`
|
||||
|
||||
### Purge Not Running
|
||||
|
||||
**Check:**
|
||||
1. Is purge enabled? Check config `purge_enabled: true`
|
||||
2. Check logs for purge thread startup: `[*] [purge] Starting stale queue entry purge cycle`
|
||||
3. Is purge interval too long? Default is 1 hour
|
||||
4. Check thread status in logs: `[!] [queue_purge] worker thread died`
|
||||
|
||||
### Health Check Not Updating
|
||||
|
||||
**Check:**
|
||||
1. Is health check enabled? Check config `health_check_enabled: true`
|
||||
2. Check last update time: `redis-cli HGET vmpooler__health last_check`
|
||||
3. Check logs for health check runs: `[*] [health] Status:`
|
||||
4. Check thread status: `[!] [health_check] worker thread died`
|
||||
|
||||
### Metrics Not Appearing
|
||||
|
||||
**Check:**
|
||||
1. Is metrics system configured? Check `:statsd` or `:graphite` config
|
||||
2. Are metrics being sent? Check logs for metric sends
|
||||
3. Check firewall/network to metrics server
|
||||
4. Test metrics manually: `redis-cli HGETALL vmpooler__health`
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Development/Testing Environments
|
||||
- Enable DLQ with shorter TTL (24-48 hours)
|
||||
- Enable purge with dry-run mode initially
|
||||
- Use aggressive purge thresholds (30min pending, 6hr ready)
|
||||
- Enable health checks with 1-minute interval
|
||||
- Monitor logs closely for issues
|
||||
|
||||
### Production Environments
|
||||
- Enable DLQ with 7-day TTL
|
||||
- Enable purge after testing in dev
|
||||
- Use conservative purge thresholds (2hr pending, 24hr ready)
|
||||
- Enable health checks with 5-minute interval
|
||||
- Set up alerting based on health metrics
|
||||
- Monitor DLQ size and set alerts (>500 = investigate)
|
||||
|
||||
### Capacity Planning
|
||||
- Monitor queue sizes during peak times
|
||||
- Adjust thresholds based on actual usage patterns
|
||||
- Review DLQ entries weekly for systemic issues
|
||||
- Track purge counts to identify resource leaks
|
||||
|
||||
### Debugging
|
||||
- Keep DLQ TTL long enough for investigation (7+ days)
|
||||
- Use dry-run mode when testing threshold changes
|
||||
- Correlate DLQ entries with provider logs
|
||||
- Check health metrics before and after changes
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### Enabling Features in Existing Deployment
|
||||
|
||||
1. **Phase 1: Enable DLQ**
|
||||
- Add DLQ config with conservative TTL
|
||||
- Monitor DLQ size and entry patterns
|
||||
- Verify no performance impact
|
||||
- Adjust TTL as needed
|
||||
|
||||
2. **Phase 2: Enable Health Checks**
|
||||
- Add health check config
|
||||
- Verify metrics are exposed
|
||||
- Set up dashboards
|
||||
- Configure alerting
|
||||
|
||||
3. **Phase 3: Enable Purge (Dry-Run)**
|
||||
- Add purge config with `purge_dry_run: true`
|
||||
- Monitor logs for purge detections
|
||||
- Verify thresholds are appropriate
|
||||
- Adjust thresholds based on observations
|
||||
|
||||
4. **Phase 4: Enable Purge (Live)**
|
||||
- Set `purge_dry_run: false`
|
||||
- Monitor queue sizes and purge counts
|
||||
- Watch for unexpected VM removal
|
||||
- Adjust thresholds if needed
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **DLQ**: Minimal overhead, uses Redis sorted sets
|
||||
- **Purge**: Runs in background thread, iterates through queues
|
||||
- **Health Checks**: Lightweight, caches metrics between runs
|
||||
|
||||
Expected impact:
|
||||
- Redis memory: +1-5MB for DLQ (depends on DLQ size)
|
||||
- CPU: +1-2% during purge/health check cycles
|
||||
- Network: Minimal, only metric pushes
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
1. Check logs for error messages
|
||||
2. Review DLQ entries for failure patterns
|
||||
3. Check health status and metrics
|
||||
4. Open issue on GitHub with logs and config
|
||||
|
||||
|
|
@ -1,362 +0,0 @@
|
|||
# Redis Queue Reliability Features
|
||||
|
||||
## Overview
|
||||
This document describes the implementation of dead-letter queues (DLQ), auto-purge mechanisms, and health checks for VMPooler Redis queues.
|
||||
|
||||
## Background
|
||||
|
||||
### Current Queue Structure
|
||||
VMPooler uses Redis sets and sorted sets for queue management:
|
||||
|
||||
- **Pool Queues** (Sets): `vmpooler__pending__#{pool}`, `vmpooler__ready__#{pool}`, `vmpooler__running__#{pool}`, `vmpooler__completed__#{pool}`, `vmpooler__discovered__#{pool}`, `vmpooler__migrating__#{pool}`
|
||||
- **Task Queues** (Sorted Sets): `vmpooler__odcreate__task` (on-demand creation tasks), `vmpooler__provisioning__processing`
|
||||
- **Task Queues** (Sets): `vmpooler__tasks__disk`, `vmpooler__tasks__snapshot`, `vmpooler__tasks__snapshot-revert`
|
||||
- **VM Metadata** (Hashes): `vmpooler__vm__#{vm}` - contains clone time, IP, template, pool, domain, request_id, pool_alias, error details
|
||||
- **Request Metadata** (Hashes): `vmpooler__odrequest__#{request_id}` - contains status, retry_count, token info
|
||||
|
||||
### Current Error Handling
|
||||
- Permanent errors (e.g., template not found) are detected in `_clone_vm` rescue block
|
||||
- Failed VMs are removed from pending queue
|
||||
- Request status is set to 'failed' and re-queue is prevented in outer `clone_vm` rescue block
|
||||
- VM metadata expires after data_ttl hours
|
||||
|
||||
### Problem Areas
|
||||
1. **Lost visibility**: Failed messages are removed but no centralized tracking
|
||||
2. **Stale data**: VMs stuck in queues due to process crashes or bugs
|
||||
3. **No monitoring**: No automated way to detect queue health issues
|
||||
4. **Manual cleanup**: Operators must manually identify and clean stale entries
|
||||
|
||||
## Feature Requirements
|
||||
|
||||
### 1. Dead-Letter Queue (DLQ)
|
||||
|
||||
#### Purpose
|
||||
Capture failed VM creation requests for visibility, debugging, and potential retry/recovery.
|
||||
|
||||
#### Design
|
||||
|
||||
**DLQ Structure:**
|
||||
```
|
||||
vmpooler__dlq__pending # Failed pending VMs (sorted set, scored by failure timestamp)
|
||||
vmpooler__dlq__clone # Failed clone operations (sorted set)
|
||||
vmpooler__dlq__ready # Failed ready queue VMs (sorted set)
|
||||
vmpooler__dlq__tasks # Failed tasks (hash of task_type -> failed items)
|
||||
```
|
||||
|
||||
**DLQ Entry Format:**
|
||||
```json
|
||||
{
|
||||
"vm": "vm-name-abc123",
|
||||
"pool": "pool-name",
|
||||
"queue_from": "pending",
|
||||
"error_class": "StandardError",
|
||||
"error_message": "template does not exist",
|
||||
"failed_at": "2024-01-15T10:30:00Z",
|
||||
"retry_count": 3,
|
||||
"request_id": "req-123456",
|
||||
"pool_alias": "centos-7"
|
||||
}
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:redis:
|
||||
dlq_enabled: true
|
||||
dlq_ttl: 168 # hours (7 days)
|
||||
dlq_max_entries: 10000 # per DLQ queue
|
||||
```
|
||||
|
||||
**Implementation Points:**
|
||||
- `fail_pending_vm`: Move to DLQ when VM fails during pending checks
|
||||
- `_clone_vm` rescue: Move to DLQ on clone failure
|
||||
- `_check_ready_vm`: Move to DLQ when ready VM becomes unreachable
|
||||
- `_destroy_vm` rescue: Log destroy failures to DLQ
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- [ ] Failed VMs are automatically moved to appropriate DLQ
|
||||
- [ ] DLQ entries contain complete failure context (error, timestamp, retry count)
|
||||
- [ ] DLQ entries expire after configurable TTL
|
||||
- [ ] DLQ size is limited to prevent unbounded growth
|
||||
- [ ] DLQ entries are queryable via Redis CLI or API
|
||||
|
||||
### 2. Auto-Purge Mechanism
|
||||
|
||||
#### Purpose
|
||||
Automatically remove stale entries from queues to prevent resource leaks and improve queue health.
|
||||
|
||||
#### Design
|
||||
|
||||
**Purge Targets:**
|
||||
1. **Pending VMs**: Stuck in pending > max_pending_age (e.g., 2 hours)
|
||||
2. **Ready VMs**: Idle in ready queue > max_ready_age (e.g., 24 hours for on-demand, 48 hours for pool)
|
||||
3. **Completed VMs**: In completed queue > max_completed_age (e.g., 1 hour)
|
||||
4. **Orphaned VM Metadata**: VM hash exists but VM not in any queue
|
||||
5. **Expired Requests**: On-demand requests > max_request_age (e.g., 24 hours)
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:config:
|
||||
purge_enabled: true
|
||||
purge_interval: 3600 # seconds (1 hour)
|
||||
max_pending_age: 7200 # seconds (2 hours)
|
||||
max_ready_age: 86400 # seconds (24 hours)
|
||||
max_completed_age: 3600 # seconds (1 hour)
|
||||
max_orphaned_age: 86400 # seconds (24 hours)
|
||||
max_request_age: 86400 # seconds (24 hours)
|
||||
purge_dry_run: false # if true, log what would be purged but don't purge
|
||||
```
|
||||
|
||||
**Purge Process:**
|
||||
1. Scan each queue for stale entries (based on age thresholds)
|
||||
2. Check if VM still exists in provider (optional validation)
|
||||
3. Move stale entries to DLQ with reason
|
||||
4. Remove from original queue
|
||||
5. Log purge metrics
|
||||
|
||||
**Implementation:**
|
||||
- New method: `purge_stale_queue_entries` - main purge loop
|
||||
- Helper methods: `check_pending_age`, `check_ready_age`, `check_completed_age`, `find_orphaned_metadata`
|
||||
- Scheduled task: Run every `purge_interval` seconds
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- [ ] Stale pending VMs are detected and moved to DLQ
|
||||
- [ ] Stale ready VMs are detected and moved to completed queue
|
||||
- [ ] Stale completed VMs are removed from queue
|
||||
- [ ] Orphaned VM metadata is detected and expired
|
||||
- [ ] Purge metrics are logged (count, age, reason)
|
||||
- [ ] Dry-run mode available for testing
|
||||
- [ ] Purge runs on configurable interval
|
||||
|
||||
### 3. Health Checks
|
||||
|
||||
#### Purpose
|
||||
Monitor Redis queue health and expose metrics for alerting and dashboards.
|
||||
|
||||
#### Design
|
||||
|
||||
**Health Metrics:**
|
||||
```ruby
|
||||
{
|
||||
queues: {
|
||||
pending: {
|
||||
pool_name: {
|
||||
size: 10,
|
||||
oldest_age: 3600, # seconds
|
||||
avg_age: 1200,
|
||||
stuck_count: 2 # VMs older than threshold
|
||||
}
|
||||
},
|
||||
ready: { ... },
|
||||
completed: { ... },
|
||||
dlq: { ... }
|
||||
},
|
||||
tasks: {
|
||||
clone: { active: 5, pending: 10 },
|
||||
ondemand: { active: 2, pending: 5 }
|
||||
},
|
||||
processing_rate: {
|
||||
clone_rate: 10.5, # VMs per minute
|
||||
destroy_rate: 8.2
|
||||
},
|
||||
errors: {
|
||||
dlq_size: 150,
|
||||
stuck_vm_count: 5,
|
||||
orphaned_metadata_count: 12
|
||||
},
|
||||
status: "healthy|degraded|unhealthy"
|
||||
}
|
||||
```
|
||||
|
||||
**Health Status Criteria:**
|
||||
- **Healthy**: All queues within normal thresholds, DLQ size < 100, no stuck VMs
|
||||
- **Degraded**: Some queues elevated but functional, DLQ size < 1000, few stuck VMs
|
||||
- **Unhealthy**: Queues critically backed up, DLQ size > 1000, many stuck VMs
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
:config:
|
||||
health_check_enabled: true
|
||||
health_check_interval: 300 # seconds (5 minutes)
|
||||
health_thresholds:
|
||||
pending_queue_max: 100
|
||||
ready_queue_max: 500
|
||||
dlq_max_warning: 100
|
||||
dlq_max_critical: 1000
|
||||
stuck_vm_age_threshold: 7200 # 2 hours
|
||||
stuck_vm_max_warning: 10
|
||||
stuck_vm_max_critical: 50
|
||||
```
|
||||
|
||||
**Implementation:**
|
||||
- New method: `check_queue_health` - main health check
|
||||
- Helper methods: `calculate_queue_metrics`, `calculate_processing_rate`, `determine_health_status`
|
||||
- Expose via:
|
||||
- Redis hash: `vmpooler__health` (for API consumption)
|
||||
- Metrics: Push to existing $metrics system
|
||||
- Logs: Periodic health summary in logs
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- [ ] Queue sizes are monitored per pool
|
||||
- [ ] Queue ages are calculated (oldest, average)
|
||||
- [ ] Stuck VMs are detected (age > threshold)
|
||||
- [ ] DLQ size is monitored
|
||||
- [ ] Processing rates are calculated
|
||||
- [ ] Overall health status is determined
|
||||
- [ ] Health metrics are exposed via Redis, metrics, and logs
|
||||
- [ ] Health check runs on configurable interval
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Dead-Letter Queue
|
||||
1. Add DLQ configuration parsing
|
||||
2. Implement `move_to_dlq` helper method
|
||||
3. Update `fail_pending_vm` to use DLQ
|
||||
4. Update `_clone_vm` rescue block to use DLQ
|
||||
5. Update `_check_ready_vm` to use DLQ
|
||||
6. Add DLQ TTL enforcement
|
||||
7. Add DLQ size limiting
|
||||
8. Unit tests for DLQ operations
|
||||
|
||||
### Phase 2: Auto-Purge
|
||||
1. Add purge configuration parsing
|
||||
2. Implement `purge_stale_queue_entries` main loop
|
||||
3. Implement age-checking helper methods
|
||||
4. Implement orphan detection
|
||||
5. Add purge metrics logging
|
||||
6. Add dry-run mode
|
||||
7. Unit tests for purge logic
|
||||
8. Integration test for full purge cycle
|
||||
|
||||
### Phase 3: Health Checks
|
||||
1. Add health check configuration parsing
|
||||
2. Implement `check_queue_health` main method
|
||||
3. Implement metric calculation helpers
|
||||
4. Implement health status determination
|
||||
5. Expose metrics via Redis hash
|
||||
6. Expose metrics via $metrics system
|
||||
7. Add periodic health logging
|
||||
8. Unit tests for health check logic
|
||||
|
||||
### Phase 4: Integration & Documentation
|
||||
1. Update configuration examples
|
||||
2. Update operator documentation
|
||||
3. Update API documentation (if exposing health endpoint)
|
||||
4. Add troubleshooting guide for DLQ/purge
|
||||
5. Create runbook for operators
|
||||
6. Update TESTING.md with DLQ/purge/health check testing
|
||||
|
||||
## Migration & Rollout
|
||||
|
||||
### Backward Compatibility
|
||||
- All features are opt-in via configuration
|
||||
- Default: `dlq_enabled: false`, `purge_enabled: false`, `health_check_enabled: false`
|
||||
- Existing behavior unchanged when features disabled
|
||||
|
||||
### Rollout Strategy
|
||||
1. Deploy with features disabled
|
||||
2. Enable DLQ first, monitor for issues
|
||||
3. Enable health checks, validate metrics
|
||||
4. Enable auto-purge in dry-run mode, validate detection
|
||||
5. Enable auto-purge in live mode, monitor impact
|
||||
|
||||
### Monitoring During Rollout
|
||||
- Monitor DLQ growth rate
|
||||
- Monitor purge counts and reasons
|
||||
- Monitor health status changes
|
||||
- Watch for unexpected VM removal
|
||||
- Check for performance impact (Redis load, memory)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- DLQ capture for various error scenarios
|
||||
- DLQ TTL enforcement
|
||||
- DLQ size limiting
|
||||
- Age calculation for purge detection
|
||||
- Orphan detection logic
|
||||
- Health metric calculations
|
||||
- Health status determination
|
||||
|
||||
### Integration Tests
|
||||
- End-to-end VM failure → DLQ flow
|
||||
- End-to-end purge cycle
|
||||
- Health check with real queue data
|
||||
- DLQ + purge interaction (purge should respect DLQ entries)
|
||||
|
||||
### Manual Testing
|
||||
1. Create VM with invalid template → verify DLQ entry
|
||||
2. Let VM sit in pending too long → verify purge detection
|
||||
3. Check health endpoint → verify metrics accuracy
|
||||
4. Run purge in dry-run → verify correct detection without deletion
|
||||
5. Run purge in live mode → verify stale entries removed
|
||||
|
||||
## API Changes (Optional)
|
||||
|
||||
If exposing to API:
|
||||
```
|
||||
GET /api/v1/queue/health
|
||||
Returns: Health metrics JSON
|
||||
|
||||
GET /api/v1/queue/dlq?queue=pending&limit=50
|
||||
Returns: DLQ entries for specified queue
|
||||
|
||||
POST /api/v1/queue/purge?dry_run=true
|
||||
Returns: Purge simulation results (admin only)
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
New metrics to add:
|
||||
```
|
||||
vmpooler.dlq.pending.size
|
||||
vmpooler.dlq.clone.size
|
||||
vmpooler.dlq.ready.size
|
||||
vmpooler.dlq.tasks.size
|
||||
|
||||
vmpooler.purge.pending.count
|
||||
vmpooler.purge.ready.count
|
||||
vmpooler.purge.completed.count
|
||||
vmpooler.purge.orphaned.count
|
||||
|
||||
vmpooler.health.status # 0=healthy, 1=degraded, 2=unhealthy
|
||||
vmpooler.health.stuck_vms.count
|
||||
vmpooler.health.queue.#{queue_name}.size
|
||||
vmpooler.health.queue.#{queue_name}.oldest_age
|
||||
```
|
||||
|
||||
## Configuration Example
|
||||
|
||||
```yaml
|
||||
---
|
||||
:config:
|
||||
# Existing config...
|
||||
|
||||
# Dead-Letter Queue
|
||||
dlq_enabled: true
|
||||
dlq_ttl: 168 # hours (7 days)
|
||||
dlq_max_entries: 10000
|
||||
|
||||
# Auto-Purge
|
||||
purge_enabled: true
|
||||
purge_interval: 3600 # seconds (1 hour)
|
||||
purge_dry_run: false
|
||||
max_pending_age: 7200 # seconds (2 hours)
|
||||
max_ready_age: 86400 # seconds (24 hours)
|
||||
max_completed_age: 3600 # seconds (1 hour)
|
||||
max_orphaned_age: 86400 # seconds (24 hours)
|
||||
|
||||
# Health Checks
|
||||
health_check_enabled: true
|
||||
health_check_interval: 300 # seconds (5 minutes)
|
||||
health_thresholds:
|
||||
pending_queue_max: 100
|
||||
ready_queue_max: 500
|
||||
dlq_max_warning: 100
|
||||
dlq_max_critical: 1000
|
||||
stuck_vm_age_threshold: 7200 # 2 hours
|
||||
stuck_vm_max_warning: 10
|
||||
stuck_vm_max_critical: 50
|
||||
|
||||
:redis:
|
||||
# Existing redis config...
|
||||
```
|
||||
|
|
@ -329,6 +329,30 @@ module Vmpooler
|
|||
buckets: REDIS_CONNECT_BUCKETS,
|
||||
docstring: 'vmpooler redis connection wait time',
|
||||
param_labels: %i[type provider]
|
||||
},
|
||||
vmpooler_health: {
|
||||
mtype: M_GAUGE,
|
||||
torun: %i[manager],
|
||||
docstring: 'vmpooler health check metrics',
|
||||
param_labels: %i[metric_path]
|
||||
},
|
||||
vmpooler_purge: {
|
||||
mtype: M_GAUGE,
|
||||
torun: %i[manager],
|
||||
docstring: 'vmpooler purge metrics',
|
||||
param_labels: %i[metric_path]
|
||||
},
|
||||
vmpooler_destroy: {
|
||||
mtype: M_GAUGE,
|
||||
torun: %i[manager],
|
||||
docstring: 'vmpooler destroy metrics',
|
||||
param_labels: %i[poolname]
|
||||
},
|
||||
vmpooler_clone: {
|
||||
mtype: M_GAUGE,
|
||||
torun: %i[manager],
|
||||
docstring: 'vmpooler clone metrics',
|
||||
param_labels: %i[poolname]
|
||||
}
|
||||
}
|
||||
end
|
||||
|
|
|
|||
|
|
@ -200,11 +200,11 @@ module Vmpooler
|
|||
redis.hset("vmpooler__odrequest__#{request_id}", 'status', 'failed')
|
||||
redis.hset("vmpooler__odrequest__#{request_id}", 'failure_reason', failure_reason)
|
||||
$logger.log('s', "[!] [#{pool}] '#{vm}' permanently failed: #{failure_reason}")
|
||||
$metrics.increment("errors.permanently_failed.#{pool}")
|
||||
$metrics.increment("vmpooler_errors.permanently_failed.#{pool}")
|
||||
end
|
||||
end
|
||||
end
|
||||
$metrics.increment("errors.markedasfailed.#{pool}")
|
||||
$metrics.increment("vmpooler_errors.markedasfailed.#{pool}")
|
||||
open_socket_error || clone_error
|
||||
end
|
||||
|
||||
|
|
@ -477,7 +477,7 @@ module Vmpooler
|
|||
ttl_seconds = dlq_ttl * 3600
|
||||
redis.expire(dlq_key, ttl_seconds)
|
||||
|
||||
$metrics.increment("dlq.#{queue_type}.count") unless skip_metrics
|
||||
$metrics.increment("vmpooler_dlq.#{queue_type}.count") unless skip_metrics
|
||||
$logger.log('d', "[!] [dlq] Moved '#{vm}' from '#{queue_type}' queue to DLQ: #{error_message}")
|
||||
rescue StandardError => e
|
||||
$logger.log('s', "[!] [dlq] Failed to move '#{vm}' to DLQ: #{e}")
|
||||
|
|
@ -551,10 +551,10 @@ module Vmpooler
|
|||
hostname_retries += 1
|
||||
|
||||
if !hostname_available
|
||||
$metrics.increment("errors.duplicatehostname.#{pool_name}")
|
||||
$metrics.increment("vmpooler_errors.duplicatehostname.#{pool_name}")
|
||||
$logger.log('s', "[!] [#{pool_name}] Generated hostname #{fqdn} was not unique (attempt \##{hostname_retries} of #{max_hostname_retries})")
|
||||
elsif !dns_available
|
||||
$metrics.increment("errors.staledns.#{pool_name}")
|
||||
$metrics.increment("vmpooler_errors.staledns.#{pool_name}")
|
||||
$logger.log('s', "[!] [#{pool_name}] Generated hostname #{fqdn} already exists in DNS records (#{dns_ip}), stale DNS")
|
||||
end
|
||||
end
|
||||
|
|
@ -600,7 +600,7 @@ module Vmpooler
|
|||
provider.create_vm(pool_name, new_vmname)
|
||||
finish = format('%<time>.2f', time: Time.now - start)
|
||||
$logger.log('s', "[+] [#{pool_name}] '#{new_vmname}' cloned in #{finish} seconds")
|
||||
$metrics.timing("clone.#{pool_name}", finish)
|
||||
$metrics.gauge("vmpooler_clone.#{pool_name}", finish)
|
||||
|
||||
$logger.log('d', "[ ] [#{pool_name}] Obtaining IP for '#{new_vmname}'")
|
||||
ip_start = Time.now
|
||||
|
|
@ -714,7 +714,7 @@ module Vmpooler
|
|||
|
||||
finish = format('%<time>.2f', time: Time.now - start)
|
||||
$logger.log('s', "[-] [#{pool}] '#{vm}' destroyed in #{finish} seconds")
|
||||
$metrics.timing("destroy.#{pool}", finish)
|
||||
$metrics.gauge("vmpooler_destroy.#{pool}", finish)
|
||||
end
|
||||
end
|
||||
dereference_mutex(vm)
|
||||
|
|
@ -809,8 +809,8 @@ module Vmpooler
|
|||
|
||||
purge_duration = Time.now - purge_start
|
||||
$logger.log('s', "[*] [purge] Completed purge cycle in #{purge_duration.round(2)}s: #{total_purged} entries purged")
|
||||
$metrics.timing('purge.cycle.duration', purge_duration)
|
||||
$metrics.gauge('purge.total.count', total_purged)
|
||||
$metrics.gauge('vmpooler_purge.cycle.duration', purge_duration)
|
||||
$metrics.gauge('vmpooler_purge.total.count', total_purged)
|
||||
end
|
||||
rescue StandardError => e
|
||||
$logger.log('s', "[!] [purge] Failed during purge cycle: #{e}")
|
||||
|
|
@ -854,7 +854,7 @@ module Vmpooler
|
|||
end
|
||||
|
||||
$logger.log('d', "[!] [purge] Purged stale pending VM '#{vm}' from '#{pool_name}' (age: #{age.round(0)}s)")
|
||||
$metrics.increment("purge.pending.#{pool_name}.count")
|
||||
$metrics.increment("vmpooler_purge.pending.#{pool_name}.count")
|
||||
end
|
||||
end
|
||||
rescue StandardError => e
|
||||
|
|
@ -884,7 +884,7 @@ module Vmpooler
|
|||
else
|
||||
redis.smove(queue_key, "vmpooler__completed__#{pool_name}", vm)
|
||||
$logger.log('d', "[!] [purge] Moved stale ready VM '#{vm}' from '#{pool_name}' to completed (age: #{age.round(0)}s)")
|
||||
$metrics.increment("purge.ready.#{pool_name}.count")
|
||||
$metrics.increment("vmpooler_purge.ready.#{pool_name}.count")
|
||||
end
|
||||
purged_count += 1
|
||||
end
|
||||
|
|
@ -920,7 +920,7 @@ module Vmpooler
|
|||
else
|
||||
redis.srem(queue_key, vm)
|
||||
$logger.log('d', "[!] [purge] Removed stale completed VM '#{vm}' from '#{pool_name}' (age: #{age.round(0)}s)")
|
||||
$metrics.increment("purge.completed.#{pool_name}.count")
|
||||
$metrics.increment("vmpooler_purge.completed.#{pool_name}.count")
|
||||
end
|
||||
purged_count += 1
|
||||
end
|
||||
|
|
@ -968,7 +968,7 @@ module Vmpooler
|
|||
expiration_ttl = 3600 # 1 hour
|
||||
redis.expire(vm_key, expiration_ttl)
|
||||
$logger.log('d', "[!] [purge] Set expiration on orphaned metadata for '#{vm}' (age: #{age.round(0)}s)")
|
||||
$metrics.increment('purge.orphaned.count')
|
||||
$metrics.increment('vmpooler_purge.orphaned.count')
|
||||
end
|
||||
purged_count += 1
|
||||
end
|
||||
|
|
@ -1017,7 +1017,9 @@ module Vmpooler
|
|||
health_status = determine_health_status(health_metrics)
|
||||
|
||||
# Store health metrics in Redis for API consumption
|
||||
redis.hmset('vmpooler__health', *health_metrics.to_a.flatten)
|
||||
# Convert nested hash to JSON for storage
|
||||
require 'json'
|
||||
redis.hset('vmpooler__health', 'metrics', health_metrics.to_json)
|
||||
redis.hset('vmpooler__health', 'status', health_status)
|
||||
redis.hset('vmpooler__health', 'last_check', Time.now.iso8601)
|
||||
redis.expire('vmpooler__health', 3600) # Expire after 1 hour
|
||||
|
|
@ -1029,7 +1031,7 @@ module Vmpooler
|
|||
push_health_metrics(health_metrics, health_status)
|
||||
|
||||
health_duration = Time.now - health_start
|
||||
$metrics.timing('health.check.duration', health_duration)
|
||||
$metrics.gauge('vmpooler_health.check.duration', health_duration)
|
||||
end
|
||||
rescue StandardError => e
|
||||
$logger.log('s', "[!] [health] Failed during health check: #{e}")
|
||||
|
|
@ -1252,37 +1254,37 @@ module Vmpooler
|
|||
|
||||
def push_health_metrics(metrics, status)
|
||||
# Push error metrics first
|
||||
$metrics.gauge('health.dlq.total_size', metrics['errors']['dlq_total_size'])
|
||||
$metrics.gauge('health.stuck_vms.count', metrics['errors']['stuck_vm_count'])
|
||||
$metrics.gauge('health.orphaned_metadata.count', metrics['errors']['orphaned_metadata_count'])
|
||||
$metrics.gauge('vmpooler_health.dlq.total_size', metrics['errors']['dlq_total_size'])
|
||||
$metrics.gauge('vmpooler_health.stuck_vms.count', metrics['errors']['stuck_vm_count'])
|
||||
$metrics.gauge('vmpooler_health.orphaned_metadata.count', metrics['errors']['orphaned_metadata_count'])
|
||||
|
||||
# Push per-pool queue metrics
|
||||
metrics['queues'].each do |pool_name, queues|
|
||||
next if pool_name == 'dlq'
|
||||
|
||||
$metrics.gauge("health.queue.#{pool_name}.pending.size", queues['pending']['size'])
|
||||
$metrics.gauge("health.queue.#{pool_name}.pending.oldest_age", queues['pending']['oldest_age'])
|
||||
$metrics.gauge("health.queue.#{pool_name}.pending.stuck_count", queues['pending']['stuck_count'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.size", queues['pending']['size'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.oldest_age", queues['pending']['oldest_age'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.pending.stuck_count", queues['pending']['stuck_count'])
|
||||
|
||||
$metrics.gauge("health.queue.#{pool_name}.ready.size", queues['ready']['size'])
|
||||
$metrics.gauge("health.queue.#{pool_name}.ready.oldest_age", queues['ready']['oldest_age'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.ready.size", queues['ready']['size'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.ready.oldest_age", queues['ready']['oldest_age'])
|
||||
|
||||
$metrics.gauge("health.queue.#{pool_name}.completed.size", queues['completed']['size'])
|
||||
$metrics.gauge("vmpooler_health.queue.#{pool_name}.completed.size", queues['completed']['size'])
|
||||
end
|
||||
|
||||
# Push DLQ metrics
|
||||
metrics['queues']['dlq']&.each do |queue_type, dlq_metrics|
|
||||
$metrics.gauge("health.dlq.#{queue_type}.size", dlq_metrics['size'])
|
||||
$metrics.gauge("vmpooler_health.dlq.#{queue_type}.size", dlq_metrics['size'])
|
||||
end
|
||||
|
||||
# Push task metrics
|
||||
$metrics.gauge('health.tasks.clone.active', metrics['tasks']['clone']['active'])
|
||||
$metrics.gauge('health.tasks.ondemand.active', metrics['tasks']['ondemand']['active'])
|
||||
$metrics.gauge('health.tasks.ondemand.pending', metrics['tasks']['ondemand']['pending'])
|
||||
$metrics.gauge('vmpooler_health.tasks.clone.active', metrics['tasks']['clone']['active'])
|
||||
$metrics.gauge('vmpooler_health.tasks.ondemand.active', metrics['tasks']['ondemand']['active'])
|
||||
$metrics.gauge('vmpooler_health.tasks.ondemand.pending', metrics['tasks']['ondemand']['pending'])
|
||||
|
||||
# Push status last (0=healthy, 1=degraded, 2=unhealthy)
|
||||
status_value = { 'healthy' => 0, 'degraded' => 1, 'unhealthy' => 2 }[status] || 2
|
||||
$metrics.gauge('health.status', status_value)
|
||||
$metrics.gauge('vmpooler_health.status', status_value)
|
||||
end
|
||||
|
||||
def create_vm_disk(pool_name, vm, disk_size, provider)
|
||||
|
|
@ -2244,6 +2246,15 @@ module Vmpooler
|
|||
redis.zrem('vmpooler__provisioning__request', request_id)
|
||||
return
|
||||
end
|
||||
|
||||
# Check if request was already marked as failed (e.g., by delete endpoint)
|
||||
request_status = redis.hget("vmpooler__odrequest__#{request_id}", 'status')
|
||||
if request_status == 'failed'
|
||||
$logger.log('s', "Request '#{request_id}' already marked as failed, skipping VM creation")
|
||||
redis.zrem('vmpooler__provisioning__request', request_id)
|
||||
return
|
||||
end
|
||||
|
||||
score = redis.zscore('vmpooler__provisioning__request', request_id)
|
||||
requested = requested.split(',')
|
||||
|
||||
|
|
|
|||
|
|
@ -1107,7 +1107,8 @@ EOT
|
|||
context 'with no errors during cloning' do
|
||||
before(:each) do
|
||||
allow(metrics).to receive(:timing)
|
||||
expect(metrics).to receive(:timing).with(/clone\./,/0/)
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
|
||||
expect(provider).to receive(:create_vm).with(pool, String)
|
||||
allow(provider).to receive(:get_vm_ip_address).and_return(1)
|
||||
allow(subject).to receive(:get_domain_for_pool).and_return('example.com')
|
||||
|
|
@ -1158,7 +1159,8 @@ EOT
|
|||
context 'with a failure to get ip address after cloning' do
|
||||
it 'should log a message that it completed being cloned' do
|
||||
allow(metrics).to receive(:timing)
|
||||
expect(metrics).to receive(:timing).with(/clone\./,/0/)
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
|
||||
expect(provider).to receive(:create_vm).with(pool, String)
|
||||
allow(provider).to receive(:get_vm_ip_address).and_return(nil)
|
||||
|
||||
|
|
@ -1217,7 +1219,8 @@ EOT
|
|||
context 'with request_id' do
|
||||
before(:each) do
|
||||
allow(metrics).to receive(:timing)
|
||||
expect(metrics).to receive(:timing).with(/clone\./,/0/)
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with(/vmpooler_clone\./,/0/)
|
||||
expect(provider).to receive(:create_vm).with(pool, String)
|
||||
allow(provider).to receive(:get_vm_ip_address).with(vm,pool).and_return(1)
|
||||
allow(subject).to receive(:get_dns_plugin_class_name_for_pool).and_return(dns_plugin)
|
||||
|
|
@ -1255,7 +1258,7 @@ EOT
|
|||
resolv = class_double("Resolv").as_stubbed_const(:transfer_nested_constants => true)
|
||||
expect(subject).to receive(:generate_and_check_hostname).exactly(3).times.and_return([vm_name, true]) #skip this, make it available all times
|
||||
expect(resolv).to receive(:getaddress).exactly(3).times.and_return("1.2.3.4")
|
||||
expect(metrics).to receive(:increment).with("errors.staledns.#{pool}").exactly(3).times
|
||||
expect(metrics).to receive(:increment).with("vmpooler_errors.staledns.#{pool}").exactly(3).times
|
||||
expect{subject._clone_vm(pool,provider,dns_plugin)}.to raise_error(/Unable to generate a unique hostname after/)
|
||||
end
|
||||
it 'should be successful if DNS does not exist' do
|
||||
|
|
@ -1353,7 +1356,8 @@ EOT
|
|||
it 'should emit a timing metric' do
|
||||
allow(subject).to receive(:get_vm_usage_labels)
|
||||
allow(metrics).to receive(:timing)
|
||||
expect(metrics).to receive(:timing).with("destroy.#{pool}", String)
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with("vmpooler_destroy.#{pool}", String)
|
||||
|
||||
subject._destroy_vm(vm,pool,provider,dns_plugin)
|
||||
end
|
||||
|
|
@ -5174,6 +5178,44 @@ EOT
|
|||
end
|
||||
end
|
||||
|
||||
context 'when request is already marked as failed' do
|
||||
let(:request_string) { "#{pool}:#{pool}:1" }
|
||||
before(:each) do
|
||||
redis_connection_pool.with do |redis|
|
||||
create_ondemand_request_for_test(request_id, current_time.to_i, request_string, redis)
|
||||
set_ondemand_request_status(request_id, 'failed', redis)
|
||||
end
|
||||
end
|
||||
|
||||
it 'logs that the request is already failed' do
|
||||
redis_connection_pool.with do |redis|
|
||||
expect(logger).to receive(:log).with('s', "Request '#{request_id}' already marked as failed, skipping VM creation")
|
||||
subject.create_ondemand_vms(request_id, redis)
|
||||
end
|
||||
end
|
||||
|
||||
it 'removes the request from provisioning__request queue' do
|
||||
redis_connection_pool.with do |redis|
|
||||
subject.create_ondemand_vms(request_id, redis)
|
||||
expect(redis.zscore('vmpooler__provisioning__request', request_id)).to be_nil
|
||||
end
|
||||
end
|
||||
|
||||
it 'does not create VM tasks' do
|
||||
redis_connection_pool.with do |redis|
|
||||
subject.create_ondemand_vms(request_id, redis)
|
||||
expect(redis.zcard('vmpooler__odcreate__task')).to eq(0)
|
||||
end
|
||||
end
|
||||
|
||||
it 'does not add to provisioning__processing queue' do
|
||||
redis_connection_pool.with do |redis|
|
||||
subject.create_ondemand_vms(request_id, redis)
|
||||
expect(redis.zscore('vmpooler__provisioning__processing', request_id)).to be_nil
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
context 'with a request that has data' do
|
||||
let(:request_string) { "#{pool}:#{pool}:1" }
|
||||
before(:each) do
|
||||
|
|
|
|||
|
|
@ -119,7 +119,7 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
|
|||
|
||||
it 'increments DLQ metrics' do
|
||||
redis_connection_pool.with do |redis_connection|
|
||||
expect(metrics).to receive(:increment).with('dlq.pending.count')
|
||||
expect(metrics).to receive(:increment).with('vmpooler_dlq.pending.count')
|
||||
|
||||
subject.move_to_dlq(vm, pool, 'pending', error_class, error_message, redis_connection)
|
||||
end
|
||||
|
|
@ -223,7 +223,7 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
|
|||
|
||||
it 'increments purge metrics' do
|
||||
redis_connection_pool.with do |redis_connection|
|
||||
expect(metrics).to receive(:increment).with("purge.pending.#{pool}.count")
|
||||
expect(metrics).to receive(:increment).with("vmpooler_purge.pending.#{pool}.count")
|
||||
|
||||
subject.purge_pending_queue(pool, redis_connection)
|
||||
end
|
||||
|
|
@ -460,35 +460,35 @@ describe 'Vmpooler::PoolManager - Queue Reliability Features' do
|
|||
|
||||
it 'pushes status metric' do
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with('health.status', 0)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.status', 0)
|
||||
|
||||
subject.push_health_metrics(metrics_data, 'healthy')
|
||||
end
|
||||
|
||||
it 'pushes error metrics' do
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with('health.dlq.total_size', 25)
|
||||
expect(metrics).to receive(:gauge).with('health.stuck_vms.count', 2)
|
||||
expect(metrics).to receive(:gauge).with('health.orphaned_metadata.count', 3)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.dlq.total_size', 25)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.stuck_vms.count', 2)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.orphaned_metadata.count', 3)
|
||||
|
||||
subject.push_health_metrics(metrics_data, 'healthy')
|
||||
end
|
||||
|
||||
it 'pushes per-pool queue metrics' do
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.size', 10)
|
||||
expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.oldest_age', 3600)
|
||||
expect(metrics).to receive(:gauge).with('health.queue.test-pool.pending.stuck_count', 2)
|
||||
expect(metrics).to receive(:gauge).with('health.queue.test-pool.ready.size', 50)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.size', 10)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.oldest_age', 3600)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.pending.stuck_count', 2)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.queue.test-pool.ready.size', 50)
|
||||
|
||||
subject.push_health_metrics(metrics_data, 'healthy')
|
||||
end
|
||||
|
||||
it 'pushes task metrics' do
|
||||
allow(metrics).to receive(:gauge)
|
||||
expect(metrics).to receive(:gauge).with('health.tasks.clone.active', 3)
|
||||
expect(metrics).to receive(:gauge).with('health.tasks.ondemand.active', 2)
|
||||
expect(metrics).to receive(:gauge).with('health.tasks.ondemand.pending', 5)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.clone.active', 3)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.ondemand.active', 2)
|
||||
expect(metrics).to receive(:gauge).with('vmpooler_health.tasks.ondemand.pending', 5)
|
||||
|
||||
subject.push_health_metrics(metrics_data, 'healthy')
|
||||
end
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue