Closed Bug 1830415 Opened 1 year ago Closed 2 months ago

Upgrade image builder

Categories

(NSS :: Test, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jschanck, Assigned: ahal)

References

(Blocks 1 open bug)

Details

(Whiteboard: [nss-ci])

Attachments

(3 files)

We use a docker image builder that was copied from M-C in Bug 1396772. We should try upgrading to the image builder currently used by taskgraph (mozillareleases/image_builder:5.0.0). As I mentioned in the team meeting, I think this might fix our LSAN issue (Bug 1755267).

I don't think the work on sharing files between images that was mentioned in Bug 1396772 was ever done, so upgrading might be as simple as applying

-     image: "nssdev/image_builder:0.1.5",
+     image: "mozillareleases/image_builder:5.0.0"

to nss/automation/taskcluster/graph/src/image_builder.js and then reconfiguring the environment variables. I gave this a try, but I don't understand the environment variables well enough to make it work.

Severity: -- → S4
Priority: -- → P3
Whiteboard: [nss-ci]

:jcristau, we're no longer able to build docker images in NSS CI. The issue (afaict) is that the version of docker used in our nssdev/image_builder:0.1.5 image only supports the deprecated Docker Hub v1 API (failure log). I cut nssdev/image_builder:0.1.7 with a new version of docker, and that works for me locally, but I can't get it to run in CI (try).

It would be great if we could use the same image builder as Firefox. I opened this bug a while ago to push us in that direction. Is this something you could help us with?

Severity: S4 → N/A
Flags: needinfo?(jcristau)
Priority: P3 → P1

I was hoping bug 1854095 would get us there, but I guess not in time :/
We might be able to kludge our way through it, but am busy with other stuff at the moment... Keeping the needinfo.

For some context:
IIRC the old image_builder uses docker-in-docker, which means it relies on the host's docker daemon, which is version 1.6 or something similarly ancient.
The image_builder we currently use in gecko and taskgraph uses kaniko instead of docker, so it's not reliant on the host's version, but it has different expectations: its input comes as a tarball exported by the decision task as an artifact, that contains the Dockerfile and relevant files from the source tree. That gets passed in to the image builder using CONTEXT_TASK_ID and CONTEXT_PATH environment variables. It shouldn't be too hard to massage the nss task builder this way, hopefully.

As a short-term workaround, would hardcoding https://hg.mozilla.org/projects/nss/file/tip/automation/taskcluster/graph/src/context_hash.js#l51 to April 2024 unblock things for now?

I made some progress here. Currently hitting:
https://firefox-ci-tc.services.mozilla.com/tasks/fiprCGAhR4O48V1704IKEQ/runs/0/logs/public/logs/live.log

Need to compare the .tar.gz docker contexts generated here with the ones generated by Taskgraph.

Assignee: nobody → ahal
Status: NEW → ASSIGNED

Interestingly I can run the following command successfully:

docker run --rm -e CONTEXT_TASK_ID=UF7oK-roQi6ODDHviGgGpA -e CONTEXT_PATH=public/docker-contexts/docker-clang-format.tar.gz -e TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com -e container=docker mozillareleases/image_builder:5.0.0

So I think the docker-context artifact is fine and the issue is related to the workers building the image. I noticed the image builders here are using linux-gcp instead of images-gcp, so maybe that's the problem? Both pools use the same VM image, but the latter has a dindImage setting. I'm not sure why that's necessary as we don't dind.. but I don't have any other ideas so going to try setting up these pools.

Keywords: leave-open
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/2686eb62e566
Setup nss images pools, r=releng-reviewers,jcristau

Sigh same error on the images-gcp pool. I don't understand why:
A) The same Dockerfile works locally
B) The same image_builder image + worker pool works for other projects

Unless it's a combination of these specific Dockerfiles combined with the (presumably) old version of Docker the workers are running? But it doesn't look like there's anything very unique about any of these Dockerfiles.

Looks like this was because the tasks still had the dind feature enabled, turning that off gets them to work again. Still testing but looking promising.

Attachment #9401498 - Attachment description: WIP: Bug 1830415 - Switch to the mozillareleases/image_builder image → Bug 1830415 - Switch to the mozillareleases/image_builder image, r?jcristau!
Flags: needinfo?(jcristau)

Hi,

It seems that yaml found some defects in .taskcluster.yml. See here: https://phabricator.services.mozilla.com/D210674.

Do my modifications make sense? https://phabricator.services.mozilla.com/D210674#change-5vhjnTHSmNXQ

Flags: needinfo?(ahal)

Oops, I missed that, thanks for fixing!

This actually landed already, so you'll have to rebase and submit a new patch. Feel free to flag me for review and I can take a look.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Flags: needinfo?(ahal)
Resolution: --- → FIXED
Attachment #9403140 - Attachment description: WIP: Bug 1830415 - Modification of .taskcluster.yml due to mozlint defects → Bug 1830415 - Modification of .taskcluster.yml due to mozlint indent defects

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: