What would you like to be added:
Dedicated API, likely a dedicated Hero/Uber ClusterQueue type/mode, for running "hero" workloads.
There is no one definition of a hero workload, but they have common characteristics: they often take >50% of the cluster capacity and run for a prolonged time (weeks), and are "super high priority". Such workloads are often used for AI training, and basically impact the cluster for the entire organization within the period.
Why is this needed:
There is currently no go-to setup for running "hero" workloads.
For example, getting suddenly >50% of cluster quota is hard, because even if the workload has "super high priority", then it still cannot preempt from CQs which are below nominal quota.
Different users make different approaches which have their pros / cons, and eventually come up with something that "works", but it is re-discovering the wheel.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
What would you like to be added:
Dedicated API, likely a dedicated Hero/Uber ClusterQueue type/mode, for running "hero" workloads.
There is no one definition of a hero workload, but they have common characteristics: they often take >50% of the cluster capacity and run for a prolonged time (weeks), and are "super high priority". Such workloads are often used for AI training, and basically impact the cluster for the entire organization within the period.
Why is this needed:
There is currently no go-to setup for running "hero" workloads.
For example, getting suddenly >50% of cluster quota is hard, because even if the workload has "super high priority", then it still cannot preempt from CQs which are below nominal quota.
Different users make different approaches which have their pros / cons, and eventually come up with something that "works", but it is re-discovering the wheel.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.