We very often see situations when:
- all EC2 slots are used and some (or even most) instances are idle but kept running until the end of the paid hour,
- some jobs are blocked in the queue because they need to run on EC2 on specific instances which are not running/not idle.
While I understand the need to optimise cost and starting speed, that causes long feedback loop.
Worse case scenario means a job can be stuck for more than 30 minutes with all EC2 agents being idle. Then the EC2 agent can take as much as 20mn to be started (spot instance bidding time + VM startup). So we create nearly an hour of wait time to save a few cents...
Please consider the following improvement to ease the pain for those corner cases:
IF all the EC2 slots are used
AND there is a job in the queue that can only run in EC2
AND no EC2 image currently running have the right requirements
disable and kill an idle EC2 image
The exact algorithm for idle EC2 image selection is left as an exercise to the reader: Most represented type of idle images? Longest idle?