PREEMPTIBLE_NVIDIA_V100_GPUS quota exceeded
Categories
(Release Engineering :: Firefox-CI Administration, defect)
Tracking
(Not tracked)
People
(Reporter: jcristau, Assigned: jcristau)
Details
Attachments
(2 files)
Since yesterday we seem to be getting spammed with QUOTA_EXCEEDED errors from worker-manager, mostly due to the PREEMPTIBLE_NVIDIA_V100_GPUS quota (128 per region), but also some for N2_CPUS (256 per region).
- is there unusual load on the translations pools at the moment for some reason?
- should we try to get the quota increased?
- we should probably adjust maxCapacity for these pools to not be far above the relevant quotas
| Assignee | ||
Comment 1•1 year ago
|
||
There's no point having a maxCapacity that's higher than our GPU quota.
Updated•1 year ago
|
| Assignee | ||
Comment 2•1 year ago
|
||
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 3•1 year ago
|
||
I've made a couple of changes to the live pool configs for now.
| Assignee | ||
Comment 4•1 year ago
|
||
translations-1/b-linux-large-gcp-1tb-64-512-std-d2g is taking up all the non-preemptible cpu quota with 8 running instances (maxCapacity was set to 1000).
translations-1/b-linux-v100-gpu-4 is taking up all the gpu quota with 75 running instances (maxCapacity was 400), and starving other pools like translations-1/b-linux-v100-gpu-4-300gb.
Comment 5•1 year ago
|
||
We're training a bunch of languages now (see this dashboard), so it's expected that the number of machines allocated simultaneously has increased. As for standard CPU instances, it's a temporary workaround until we split the alignments task into smaller ones and switch to smaller preemptible machines. We can probably set the quota for those CPU machines to the number of languages we're training (say 20).
Comment 6•1 year ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #0)
Since yesterday we seem to be getting spammed with QUOTA_EXCEEDED errors from worker-manager, mostly due to the PREEMPTIBLE_NVIDIA_V100_GPUS quota (128 per region), but also some for N2_CPUS (256 per region).
- is there unusual load on the translations pools at the moment for some reason?
Evgeny answered this part.
- should we try to get the quota increased?
We've already increased GPU quota as much as GCP allows. (I didn't follow-up after that to make these adjustments though, sorry.)
- we should probably adjust maxCapacity for these pools to not be far above the relevant quotas
+1. Thank you for putting these patches up.
| Assignee | ||
Updated•1 year ago
|
Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/d2e7bbe58115
reduce maxCapacity for translations gpu pools. r=bhearsum
https://hg.mozilla.org/ci/ci-configuration/rev/6ddb2f086b98
reduce maxCapacity for non-preemptible translations pools. r=bhearsum
| Assignee | ||
Comment 8•1 year ago
|
||
(In reply to Evgeny Pavlov from comment #5)
We're training a bunch of languages now (see this dashboard)
Wrong link?
Comment 9•1 year ago
|
||
I think he meant this link.
The link gets updated occasionally but it's available at the top of this spreadsheet.
Description
•