Closed Bug 1658297 Opened 5 years ago Closed 3 years ago

gcp builds fail to get workers since August 7th

Categories

(Release Engineering :: Firefox-CI Administration, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: tomprince)

Details

Attachments

(1 file)

That time correlates with https://hg.mozilla.org/ci/ci-configuration/rev/40e3c565b9f64672e34fbc3814b4b37ec94b15fd

I do see at least one worker of that type running tasks recently:
https://firefox-ci-tc.services.mozilla.com/tasks/QB9z4ClPSdWmwvOICSWwng
https://firefox-ci-tc.services.mozilla.com/tasks/cGiNQvPORwSTbcL5vE3GXQ

So it looks like things are OK now? :bhearsum, any ideas what's up here?

Flags: needinfo?(bhearsum)
Component: Operations and Service Requests → Firefox-CI Administration
Product: Taskcluster → Release Engineering

I think those are the new d-w images we made for the artifact upload fix. They were untested on our end so I'm not completely surprised if they broke. This might be a tc-side issue afterall.

The exceptions are DEADLINE_EXCEEDED, so it seems like the pool was having trouble spinning up. Why that would be is unclear to me, but perhaps it's one time thing when we deploy new images in GCP? Adding Tom, in case he has any insight.

Flags: needinfo?(bhearsum) → needinfo?(mozilla)

Looking at https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-3%2Fb-linux-gcp/errors, it appears that the new image is bigger than old image, causing the worker to not provision.

Flags: needinfo?(mozilla)
Assignee: nobody → mozilla
Status: NEW → ASSIGNED
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/12f116b3c893 Use a large system disks for gcp workers; r=bhearsum

For the record, the error is

Invalid value for field 'resource.disks[0].initializeParams.diskSizeGb': '10'. Requested disk size cannot be smaller than the image size (20 GB)

I don't have permission to look at machine images in either of the fxci projects, so I can't tell what the old image size was.

My guess would be that the image size difference is due to
https://github.com/taskcluster/taskcluster/pull/3312

That was a switch from

  • tc-proxy 5.1.0 -> v36.0.0 (7.7MB -> 8.7MB)
  • livelog v4 -> v36.0.0 (13.3MB -> 10.2MB)

so maybe that's not a good guess.

https://github.com/taskcluster/monopacker/pull/70 might also be responsible -- that packages the appropriate version of node.

Anyway, bumping the root disk size probably makes sense.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
QA Contact: mgoossens
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: