Closed Bug 1905474 Opened 1 year ago Closed 1 year ago

PREEMPTIBLE_NVIDIA_V100_GPUS quota exceeded

Categories

(Release Engineering :: Firefox-CI Administration, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcristau, Assigned: jcristau)

Details

Attachments

(2 files)

Since yesterday we seem to be getting spammed with QUOTA_EXCEEDED errors from worker-manager, mostly due to the PREEMPTIBLE_NVIDIA_V100_GPUS quota (128 per region), but also some for N2_CPUS (256 per region).

  • is there unusual load on the translations pools at the moment for some reason?
  • should we try to get the quota increased?
  • we should probably adjust maxCapacity for these pools to not be far above the relevant quotas
Flags: needinfo?(gtatum)
Flags: needinfo?(epavlov)
Flags: needinfo?(bhearsum)

There's no point having a maxCapacity that's higher than our GPU quota.

Assignee: nobody → jcristau
Status: NEW → ASSIGNED
Keywords: leave-open

I've made a couple of changes to the live pool configs for now.

translations-1/b-linux-large-gcp-1tb-64-512-std-d2g is taking up all the non-preemptible cpu quota with 8 running instances (maxCapacity was set to 1000).
translations-1/b-linux-v100-gpu-4 is taking up all the gpu quota with 75 running instances (maxCapacity was 400), and starving other pools like translations-1/b-linux-v100-gpu-4-300gb.

We're training a bunch of languages now (see this dashboard), so it's expected that the number of machines allocated simultaneously has increased. As for standard CPU instances, it's a temporary workaround until we split the alignments task into smaller ones and switch to smaller preemptible machines. We can probably set the quota for those CPU machines to the number of languages we're training (say 20).

Flags: needinfo?(epavlov)

(In reply to Julien Cristau [:jcristau] from comment #0)

Since yesterday we seem to be getting spammed with QUOTA_EXCEEDED errors from worker-manager, mostly due to the PREEMPTIBLE_NVIDIA_V100_GPUS quota (128 per region), but also some for N2_CPUS (256 per region).

  • is there unusual load on the translations pools at the moment for some reason?

Evgeny answered this part.

  • should we try to get the quota increased?

We've already increased GPU quota as much as GCP allows. (I didn't follow-up after that to make these adjustments though, sorry.)

  • we should probably adjust maxCapacity for these pools to not be far above the relevant quotas

+1. Thank you for putting these patches up.

Flags: needinfo?(gtatum)
Flags: needinfo?(bhearsum)
Keywords: leave-open

Pushed by jcristau@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/d2e7bbe58115
reduce maxCapacity for translations gpu pools. r=bhearsum
https://hg.mozilla.org/ci/ci-configuration/rev/6ddb2f086b98
reduce maxCapacity for non-preemptible translations pools. r=bhearsum

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

(In reply to Evgeny Pavlov from comment #5)

We're training a bunch of languages now (see this dashboard)

Wrong link?

I think he meant this link.

The link gets updated occasionally but it's available at the top of this spreadsheet.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: