This change fixes template alias evaluation to ensure that the correct
data is set when generating on demand requests for pools that have a
backend weight configured for a value of 0. Without this change vmpooler
will return an empty selection in api for template alias evaluation.
To support this change tests are added that first reproduced the
failure, and then verified that it is resolved with the addition of the
patch. Additionally, test coverage is added to ensure that code paths
that include pickup gem usage are covered.
Introducing the Prometheus Stats code into ABS showed that the Clarity
could be improved a bit with better variable naming, some refactoring
to reduce repitition and documenting the Metrics table itself.
Filtering these changes back to the vmpooler code base.
This commit updates the method used for chceking the status of an
ondemand request to ensure that if multiple aliases are used to fulfill
a request that they are correctly presented as a single pool again when
everything is ready. Without this change it is possible for only one
group of an aliased pool to show up in pending or completed requests.
Ensure that the correct stats are registered for the Manager and the api
respectively. E.g. all checkout counters are for the api only, whereas
clone times belong to the manager.
Also new ondemand functionality stats weren't registered, so add these
along with missing delete stats.
Use the example provided in the Ruby Client to provide a customised
collector appropriate to log all calls to the API. The customised
filtering is used to replace individual node names and templates
for the /vm and request ID's for the /ondemand endpoints.
This module was failing our rubocop checks so have updated it since
it now forms part of vmpooler.
Separate trapping for litmus jobs is also included so that they don't
interfere with stats from the jenkins pipelines.
* (POOLER-174) Reduce duplicate of on demand code introduced in POOLER-158
refactored every parsing of request of type 'pool_alias:pool:count' into a
utility class, that is used by pool_manager and the api v1 class
* add some metrics to the od request generation
* fix rubocop offenses, we are now friends
This change adds a capability to vmpooler to provision instances on
demand. Without this change vmpooler only supports retrieving machines
from pre-provisioned pools.
Additionally, this change refactors redis interactions to reduce round
trips to redis. Specifically, multi and pipelined redis commands are
added where possible to reduce the number of times we are calling redis.
To support the redis refactor the redis interaction has changed to
leveraging a connection pool. In addition to offering multiple
connections for pool manager to use, the redis interactions in pool
manager are now thread safe.
Ready TTL is now a global parameter that can be set as a default for all
pools. A default of 0 has been removed, because this is an unreasonable
default behavior, which would leave a provisioned instance in the pool
indefinitely.
Pool empty messages have been removed when the pool size is set to 0.
Without this change, when a pool was set to a size of 0 the API and pool
manager would both show that a pool is empty.
Before this change if the smove returned false, we would continue handing out the VM
which presumably could still be in the 'ready' state. Upon 'delete' that ready VM would not be
picked up and return a 404 which is consistent with the behavior seen. Adding a metric to keep
track of the smove failures since this is not expected. I think some API logging would be good
to add in the future.
Before this PR, the current running time was being inspected to decide if the
vm lifetime could be extended. But since vm lifetime is absolute and not relative
this check is now removed.
This commit fixes a bug in update_clone_target where I believe `=` was
intended, not `==` because `==` just goes to the void context here.
Thanks Rubocop Lint/Void!
This commit adds a capability to vmpooler to reset a pool, deleting its ready and pending instances and replacing them with fresh ones. Without this change vmpooler does not offer a mechanism to reset a pool without also changing its template.
* (POOLER-123) Implement a max TTL
Before this change, we could checkout a vm and set the lifetime to a
very high number which would esssentially keep the vm running forever.
Now implementing a config setting max_lifetime_upper_limit which enforces
a maximum lifetime in hours both for initial checkout and extending a
running vm
* (POOLER-123) Improve PUT vm endpoint error messaging
Prior to this commit the PUT vm endpoint didn't give any useful
information about why a user's request failed.
This commit updates PUT to output a more helpful set of error messages
in the `failure` key that gets returned in the JSON response.
* (POOLER-123) Update max_lifetime_upper_limit key
This commit switches the max_lifetime_upper_limit key from being a
symbol to being a string, which is what the config hash seems to contain.
* (maint) Add option to disable Redis persistence in docker-compose
This commit is just a handy little command override to the redis
container to prevent persistence.
This commit adds a shared mutex to vmpooler API so that checkout requests can be synchronized across threads. Without this change it is possible in some scenarios for vmpooler to allocate the same SUT to different API requests for a VM.
This change adds the running host for a VM to the API data available via /vm/hostname. Without this change the running host would be logged to vmpooler log, but not available any other way. Additionally, the data will specify if a machine has been migrated. Without this change parent host data for a vmpooler machine is not available via the vmpooler API.
This allows the user to change the cluster in which the targeted pool
will clone to. Upon configuration change, the thread will wake up and
execute the change within 1 second.
This commit updates the reference to domain from vmpooler config. Without this change the domain value is read as an empty string and breaks checkouts.
This commit duplicates the vm_ready? check to the API layer to allow for API to validate that a VM is alive at checkout. Without this change API relies upon the checks in pool_manager validating pools. This change should allow for additional insight into whether a machine is in a ready state and resopnding at checkout time.
Before this change looping over many pools would query the redis backend
for each pool, leading in slow response from the backend for configurations
with many pools (60+)
Changed the requests to use redis pipelines https://redis.io/topics/pipelining
This is supported since the beginning, so will not force any redis update for
users. The pipeline method runs the queries in batches and we need to loop
over the result and reduces the number of requests to redis by N=number of
pools in the configuration.
Before this change we used the API /status endpoint to get specific information
on pools such as the number of ready VMs and the max.
This commit creates two new endpoints to get to that information much quicker
1) poolstat?pool= takes a comma separated list of pools to return, and will provide
the max, ready and alias values.
2) /totalrunning will calculate the total number of running VMs across all pools
This commit updates how a VM is checked out to ensure that there is no window where the VM could be considered discovered, and therefore destroyed. Without this change the VM is retrieved by calling 'spop' on the ready queue, and then adding it to the running queue. This change moves to selecting the VM by retrieving the last member of the set, and moving it with 'smove' from ready to running. As a result of this change vmpooler moves from retrieving the VMs from the ready state randomly, to instead retrieve the oldest VM in the queue. This change should reduce churn where it would otherwise not be required to satisfy demand.
This commit updates how vmpooler retrieves VMs to add a VM to the running queue as soon as it is checked out. Without this change it is possible that a VM can be discovered when it is checked out before it is added to the running queue if multiple systems are requested. Additionally, the dockerfile is updated to support specifying the version of vmpooler to install.
This change updates handling of pool aliases to allow for more than a
single pool to be configured as an alias pool. Without this change if
multiple pools are configured as an alias the last one to configure it
is set as the alias for that pool.
Additionally, redis testing requirements are removed in favor of
mock_redis. Without this change a redis server is required to run
vmpooler tests.
This commit updates the dashboard for vmpooler to ensure it is synchronized with any redis based configuration values before displaying the dashboard. Without this change the pool size value may be displayed incorrectly if the value has been set via the /config/poolsize API endpoint.
This commit updates fetch_single_vm to return the name of the template that was requested, instead of the name of the pool providing the VM to meet the request. Without this change, when an alias is used for fetching a VM, a different pool title may be returned containing the requested VMs than the user initially requested.
This commit updates get_vm in the vmpooler API to allow for setting weights for backends. Additionally, when an alias for a pool exists, and the backend configured is not weighted, then the selection of the pool based on alias will be randomly sampled. Without this change any pool with the title of the alias is exhausted before an alternate pool with the configured alias is used, which results in an uneven distribution of VMs. When all backends involved are configured with weighted values the VM selection will be based on probability using those weights.
A bug is fixed when setting the default ttl for check_ready_vm.
Pickup is added to handle weighted VM selection.
A dockerfile is added that allows for building and installing vmpooler
from the current HEAD in docker to make for easy testing.
* (POOLER-81) Add time_remaining information
Before, the only time calculation displayed for a given VM was the
lifetime parameter. Added the remaining parameter which will
display time until the VM is destroyed as a float.
Additionally, start_time and end_time were added to api to return
as UTC based times (e.g. 2018-07-10 11:01:03 -0700).
* Remove abs eval from GET, rework spec tests to check each field.
This allows us to account for "flakiness" of the remaining return.
* Change datetime to RFC3339 for start_time and end_time
* Revert "(POOLER-34) Ship clone request to ready time to metrics (#277)"
This reverts commit a865e6bd2f.
* Revert "(POOLER-81) Add time_remaining information (#276)"
This reverts commit 1910cffaf7.
* (POOLER-81) Add time_remaining information
Before, the only time calculation displayed for a given VM was the
lifetime parameter. Added the time_remaining parameter which will
display time until the VM is destroyed in hours, minutes, seconds.
Additionally, updated the running parameter to display in a similar
fashion as time_remaining.
* Add spec testing for testing time_remaining stat
This commit adds a configuration endpoint to the vmpooler API. Pool
size, and pool template, can be adjusted for pools that are configured
at vmpooler application start time. Pool template changes trigger a pool
refresh, and the new template has delta disks created automatically by
vmpooler.
Additionally, the capability to create template delta disks is added to
the vsphere provider, and this is implemented to ensure that templates
have delta disks created at application start time.
The mechanism used to find template VM objects is simplified to make the flow of logic easier to understand. As an additional benefit, performance of this lookup is improved by using FindByInventoryPath.
A table of contents is added to API.md to ease navigation. Without this change API.md has no table of contents and is difficult to navigate.
Add mutex object for managing pool configuration updates
This commit adds a mutex object for ensuring that pool configuration changes are synchronized across multiple running threads, removing the possibility of two threads attempting to update something at once, without relying on redis data. Without this change this is managed crudely by specifying in redis that a configuration update is taking place. This redis data is left so the REPOPULATE section of _check_pool can still identify when a configuration change is in progress, and prevent a pool from repopulating at that time.
Add wake up event for pool template changes
This commit adds a wake up event to detect pool template changes.
Additionally, GET /config has a template_ready section added to the
output for each pool, which makes clear when a pool is ready to populate
itself.
This commit add a redis hash where there is one key per pool, and the
stored value is the last time a VM was booted e.g. the last time
a VM went from 'pending' to 'ready'. This is also displayed in the
API as lastBoot:'2018-03-23 17:43:39 +0000'. The data can then be
used by any external system, in this case our alarming system.
The status endpoint provides a lot of statistics. This commit extends it
by supporting a query parameter called 'view' which may contain one or
multiple comma separated names for the top-level statistics returned
in the JSON response. status is always returned.
Optional elements are capacity,queue,clone,boot,pools
Everything is returned when 'view' is not specified, which is
backwards compatible with the current behavior.
Before this change if a pool had an alias configured, the information would not be
made public in the API. This commit adds the alias key in the pool object for each
pool if configured. The alias key can be abscent, a string or an one or multiple
array of strings. The value of the alias is copied from the configuration and can
represent another name for the pool, or another configured pool.
Prior to this the only per-pool statistics that could be extracted from the API
were a list of empty pools in the "status" section of the returned results of
the `/status` endpoint.
This adds a new "pools" section to the '/status' results which lists, for each
pool, the following results:
- The number of ready vms in the pool
- The number of running vms in the pool
- The number of pending vms in the pool
- The maximum size of the pool (as specified in the vmpooler configuration)
Example:
```
{
"boot": {
"duration": {
"average": 163.6,
"min": 65.49,
"max": 830.07,
"total": 247744.71000000002
},
"count": {
"total": 1514
}
# ...
"pools": {
"pool1": {
"ready": 5,
"running": 2,
"pending": 1,
"max": 15
},
"pool2": {
"ready": 0,
"running": 10,
"pending": 0,
"max: 10
}
}
}
```
This includes spec coverage for this change (we could use more specs on `/status` in general); as well as a couple of general spec improvements.
* [QENG-3919] Make vmpooler checkouts be all or nothing (#153)
* (QENG-3919) spike for implementation of all-or-nothing checkout
* Fix two botched variable references
* Aggregate API helper methods
* Add specs for failed multi-vm allocation API endpoints
* (QENG-3919) Add tests for multiple vm requests
* (QENG-3919) Add (failing) specs for POST /vm/pool1+pool2 usages
This exposes the old (bad) behavior on this other code path. Will fix this up next.
* (QENG-3919) Bring query params version in line with JSON post version
Not clear to me why these had to be implemented so differently.
* (QENG-3919) extract common method from both methods of VM allocation
* (QENG-3919) Naming fix, cosmetic cleanups
I mean, I presume all these commits are going to get squashed away on merge anyway.
* (QENG-3919) Update API docs
We consider it a bug that the actual behavior was not this behavior, but the
documentation was also silent on this point.
* (QENG-3919) minor readability tweak in refactored method
* (QENG-3919) Clean up interim comments re: status codes
* (QENG-3919) Drop now-orphaned `checkout_vm` method
We kept this up-to-date while we were upgrading and refactoring, but, turns out,
this method is no longer called anywhere. 💀🔥
* (QENG-3919) Return 503 status on failed allocation
Making sure we go back to the original functionality, which was:
- status 200 when vms successfully allocated
- status 404 when a pool name is unknown
- status 404 when no pool name is specified
- status 503 when vm allocation failed
* (QENG-3919) add net-ldap to Gemfile
Maybe we shouldn't foil-ball gems onto servers.
* (QENG-3919) Turns out, spush isn't a redis command
And hence we see once again the weakness of mockist tests.
* (QENG-3919) Pin the net-ldap gem to 0.11 for the jrubies, etc.
* (QENG-3919) Correct an old spelling error in spec descriptions
* (QENG-3919) Further tweak net-ldap version
* (QENG-3919) return_single_vm -> return_vm_to_ready_state
cc @shermdog
* (RE-7014) Add support for statsd
They way we were using graphite was incorrect for the type of data we were sending it. statsd is the appropriate mechanism for our needs.
statsd and graphite are mutually exclusive and configuring statsd will take precendence over Graphite. Example of configuration in vmpooler.yaml.example
* (RE-7014) Add tracking of vm gets via statsd
Add the tracking of successful, failed, invalid, and empty pool vm gets. It is possible we may want to tweak this, but have validated with spec tests and pcaps.
```
vmpooler-tmp-dev.ready.debian-7-x86_64:1|c
vmpooler-tmp-dev.running.debian-7-x86_64:1|c
vmpooler-tmp-dev.checkout.invalid:1|c
vmpooler-tmp-dev.checkout.success.debian-7-x86_64:1|c
vmpooler-tmp-dev.checkout.empty:1|c
vmpooler-tmp-dev.running.debian-7-x86_64:1|c
vmpooler-tmp-dev.clone.debian-7-x86_64:12.10|ms
vmpooler-tmp-dev.ready.debian-7-x86_64:1|c
```
* (RE-7014) statsd nitpicks and additional rspec
Cleaned up some code review nitpicks and added pool_manager_spec for empty pool.
* (RE-7014) update statsd to use gauge for running/ready
Previously was using increment which was incorrect for that particular application.
* Revert "Merge pull request #155 from shermdog/RE-7014-cinext"
This reverts commit cc03a86f6a, reversing
changes made to 5aaab7c5c2.
* (QENG-4070) Consistently return 503 if valid pool is empty
There were several problems with how the pooler checked out vms with
respect to empty pools, invalid pools, and aliases:
- If the vmpooler config did not contain any aliases and the caller
requested a vm from an empty pool or a non-existent one, the vmpooler
would error with:
NoMethodError - undefined method `[]' for nil:NilClass
If the config contained a non-nil alias section, then:
- If the caller requested a vm from an empty pool and either the vm
didn't have an alias or the aliased pool was empty or non-existent, then
the request for that vm would be silently ignored. The vmpooler would
return 200 if the caller asked for multiple vms and the vmpooler was
able to checkout at least one vm. Otherwise it would return 404.
- Similarly, if the caller requested a vm from a non-existent pool, then
the request was silently ignored.
This commit adds a `pool_names` Set to the config containing all valid
pool names including aliases. This is used to determine whether a
requested template name is valid or not. This is necessary because redis
does not distinguish between empty and non-existent sets, e.g. the
following returns false in both cases:
backend.exists('vmpooler__ready__' + key)
If the caller requests a vm (single or multiple), and any vm references
an invalid pool name, we immediately return 404. Otherwise, we know the
request is for valid pool names, since the vmpooler requires a restart
to change pool names and counts.
We then attempt to acquire each vm, trying to match on pool name or
failing back to aliased pool name, as was the previous behavior.
The resulting behavior is:
- If the caller asks for at least one vm from an unknown pool, then
don't try to checkout any vms and respond with 404.
- If the caller asks for a vm, and at least one pool is empty, then
respond with 503, returning checked out vms back to the pool.
- Otherwise return 200 with the list of checked out vms.
This commit also makes `alias` optional again.
This commit also re-enables tests that were merged in from master, but
originally commented out due to the bugs described above..
* (maint) Add json pessimistic pin
json 2.0.x was released on July 1 and is not compatible with ruby < 2.0.
Since we still support that version, add a pessimistic pin, which is
what we were using prior to July 1.
* [QENG-4070] Make json version conditional on RUBY_VERSION
* Drop extraneous mocks from updated test
* Revert "Revert "Merge pull request #155 from shermdog/RE-7014-cinext""
This reverts commit 0fd6fff934.
* Fix some spec errors
These were caused in part by dropping changes from the original PR when we
dropped the v1_spec.rb master test file (in favor of the updated and separated
versions).
* [QENG-4075] Fix bug with template name on allocation failure
We're returning [nil,nil] in this case, meaning that name will not be set. This
means we'll get an error trying to concatenate the stats string. Use the
requested template name here instead.
* [QENG-4075] Refactor statsd methods / classes
Prior to this we could easily run into situations where `statds_prefix` would
be `nil` (and possibly the `statsd` handle itself). There was some significant
complexity and brittleness in how statsd was set up.
Refactored so that:
- `statsd_prefix` is no longer exposed to any callers of statsd methods
- there is now a `Vmpooler::DummyStatsd` class which can be returned when we are not actually going to publish stats, but would like to keep the calling interface consistent
- setup of the statsd handle is via just passing in `config[:statsd]`, if `nil`, this will result in a dummy handle being return
- defaulting of `server` values was fixed -- this did not actually work in the previous implementation. `config[:statsd][:server]` is now required.
- tests use a `DummyStatsd` instance instead of an rspec double.
- calls to `statsd.increment` were taking incorrect arguments (some our fault, some part of the prior implementation), and were not collecting data on which pools were "invalid" or "empty". Fixed this and are now explicitly tracking the invalid/empty pool names.
* [QENG-4075] Drop now-superfluous :statsd config defaulting
* [QENG-4075] Unify graphite and statsd for the pool manager
Prior to this, the `pool_manager.rb` library could take handles for both
graphite and statsd endpoints (which were considered mutually exclusive),
and then would use one. There was a bevy of conditional logic around sending
metrics to the graphite/statsd handles (and actually at least one bug of
omission).
Here we refactor more, building on earlier work:
- Our graphite class comes into line with the API of our Statsd and DummyStatsd classes
- In `pool_manager.rb` we now accept a single "metrics" handle, and we drop all the conditional logic around statsd vs. graphite
- We move the inconsistent error handling out of the calling classes and into our metrics classes, actually logging to `$stderr` when we can't publish metrics
- We unify the setup code to use `config` to determine whether statsd, graphite, or a dummy metrics handle should be used, and make that happen.
- Cleaned up some tests. We could probably stand to do a bit more work in this area.
* [QENG-4075] Clean up pool manager, specs
Prior to this, `pool_manager.rb` allowed the `metrics` argument to be optional,
but at this point it will be an instance of `Vmpooler::Statsd`,
'Vmpooler::Graphite', or `Vmpooler::DummyStatsd`, so making this non-optional.
Cleaned up that file's tests, cosmetically, as well as recognizing that the
behavioral difference between graphite and statsd does not depend on the pool
manager.
* [QENG-4075] update example vmpooler.yaml file
This documents the changes to :server being mandatory for all metrics
endpoints, as well as the graphite endpoint supporting an optional :port
configuration value.
* [QENG-4075] Rename usages of statsd -> metrics
Really, let's just support a generic metrics interface.
* (maint) move statsd-ruby require into Vmpooler::Statsd class
We've managed to move mentions of this out of the calling code, so let's
move the require.
* (maint) metrics.log -> metrics.timing
We missed this during the refactoring. Bringing this up to date.
* [QENG-4075] Allow specifying 'graphs:' for dashboard
Prior to this the dashboard front-end would use the configuration settings
for `graphite[:server]`/`graphite[:prefix]` to locate a graphite server
to use for rendering graphs.
Now that we have multiple possible metrics backends, the front-end graph
host for the dashboard could be entirely different from the back-end metrics
server that we publish to (if any).
This decouples those settings:
- use `graphs[:server]` / `graphs[:prefix]` for the graphite-compatible web front-end to use for dashboard display graphs
- fall back to `graphite[:server]`/`graphite[:prefix]` if `graphs` is not specified, in order to support legacy `vmpooler.yaml` configurations.
Note that since `statsd` takes precedence over `graphite`, it's possible to specify both `statsd` (for publishing) and `graphite` (for reading). We still prefer `graphs` over `graphite`.
Updated the example `vmpooler.yaml` config file.
* (maint) fix variable reference in new_metrics
This was referencing config directly, when what we want is for a
hash to be passed in (derived from config).
* (maint) Fix typo in updated graph link call
* (maint) default :graphs prefix to 'vmpooler'
* (maint) Fix parse error in vmpooler script
The things you find through manual QA 🧌
* (maint) use strings instead of symbols in config
Nested hash data comes back with string keys, not symbols. Be consistent.
* [QENG-4075] Factor out Vmpooler::DummyStatsd
This makes it visible to lib/vmpooler.rb, as well as putting this dummy
metrics endpoint in its own file for easier discovery.
* (maint) clean up statsd inclusion and require lines
The library is actually required as 'statsd' and not 'ruby-statsd', best I can tell.
* (maint) construct ::Statsd instead of Statsd
Because it's ambiguous in this scope, and, well, it doesn't
actually work in production.
* [QENG-4075] Also track completely invalid requests
When we don't even get a pool name we still want metrics to be recorded.
Add an additional disk to a running VM via the vmpooler API.
````
$ curl -X POST -H X-AUTH-TOKEN:a9znth9dn01t416hrguu56ze37t790bl --url vmpooler.company.com/api/v1/vm/fq6qlpjlsskycq6/disk/8
````
````json
{
"ok": true,
"fq6qlpjlsskycq6": {
"disk": "+8mb"
}
}
````
Provisioning and attaching disks can take a moment, but once the task completes it will be reflected in a `GET /vm/<hostname>` query:
````
$ curl --url vmpooler.company.com/api/v1/vm/fq6qlpjlsskycq6
````
````json
{
"ok": true,
"fq6qlpjlsskycq6": {
"template": "debian-7-x86_64",
"lifetime": 2,
"running": 0.08,
"state": "running",
"disk": [
"+8mb"
],
"domain": "delivery.puppetlabs.net"
}
}
The following pool configuration would allow a pool to be aliased in POST
requests as 'centos-6-x86_64', 'centos-6-amd64', or 'centos-6-64':
````yaml
- name: 'centos-6-x86_64'
alias: [ 'centos-6-amd64', 'centos-6-64' ]
template: 'templates/centos-6-x86_64'
folder: 'vmpooler/centos-6-x86_64'
datastore: 'instance1'
size: 5
````
The 'alias' configuration can be either a string or an array.
Note that even when requesting an alias, the pool's 'name' is returned in
the JSON response:
````
$ curl -d '{"centos-6-64":"1"}' --url vmpooler/api/v1/vm
````
````json
{
"ok": true,
"centos-6-x86_64": {
"hostname": "cuna2qeahwlzji7"
},
"domain": "company.com"
}
````