Upgrade image builder
Categories
(NSS :: Test, task, P1)
Tracking
(Not tracked)
People
(Reporter: jschanck, Assigned: ahal)
References
(Blocks 1 open bug)
Details
(Whiteboard: [nss-ci])
Attachments
(3 files)
We use a docker image builder that was copied from M-C in Bug 1396772. We should try upgrading to the image builder currently used by taskgraph (mozillareleases/image_builder:5.0.0). As I mentioned in the team meeting, I think this might fix our LSAN issue (Bug 1755267).
I don't think the work on sharing files between images that was mentioned in Bug 1396772 was ever done, so upgrading might be as simple as applying
- image: "nssdev/image_builder:0.1.5",
+ image: "mozillareleases/image_builder:5.0.0"
to nss/automation/taskcluster/graph/src/image_builder.js
and then reconfiguring the environment variables. I gave this a try, but I don't understand the environment variables well enough to make it work.
Updated•1 year ago
|
Reporter | ||
Comment 1•7 months ago
|
||
:jcristau, we're no longer able to build docker images in NSS CI. The issue (afaict) is that the version of docker used in our nssdev/image_builder:0.1.5
image only supports the deprecated Docker Hub v1 API (failure log). I cut nssdev/image_builder:0.1.7
with a new version of docker, and that works for me locally, but I can't get it to run in CI (try).
It would be great if we could use the same image builder as Firefox. I opened this bug a while ago to push us in that direction. Is this something you could help us with?
Comment 2•7 months ago
|
||
I was hoping bug 1854095 would get us there, but I guess not in time :/
We might be able to kludge our way through it, but am busy with other stuff at the moment... Keeping the needinfo.
For some context:
IIRC the old image_builder uses docker-in-docker, which means it relies on the host's docker daemon, which is version 1.6 or something similarly ancient.
The image_builder we currently use in gecko and taskgraph uses kaniko instead of docker, so it's not reliant on the host's version, but it has different expectations: its input comes as a tarball exported by the decision task as an artifact, that contains the Dockerfile and relevant files from the source tree. That gets passed in to the image builder using CONTEXT_TASK_ID
and CONTEXT_PATH
environment variables. It shouldn't be too hard to massage the nss task builder this way, hopefully.
As a short-term workaround, would hardcoding https://hg.mozilla.org/projects/nss/file/tip/automation/taskcluster/graph/src/context_hash.js#l51 to April 2024 unblock things for now?
Assignee | ||
Comment 3•7 months ago
|
||
Assignee | ||
Comment 4•7 months ago
|
||
I made some progress here. Currently hitting:
https://firefox-ci-tc.services.mozilla.com/tasks/fiprCGAhR4O48V1704IKEQ/runs/0/logs/public/logs/live.log
Need to compare the .tar.gz
docker contexts generated here with the ones generated by Taskgraph.
Assignee | ||
Comment 5•7 months ago
|
||
Interestingly I can run the following command successfully:
docker run --rm -e CONTEXT_TASK_ID=UF7oK-roQi6ODDHviGgGpA -e CONTEXT_PATH=public/docker-contexts/docker-clang-format.tar.gz -e TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com -e container=docker mozillareleases/image_builder:5.0.0
So I think the docker-context artifact is fine and the issue is related to the workers building the image. I noticed the image builders here are using linux-gcp
instead of images-gcp
, so maybe that's the problem? Both pools use the same VM image, but the latter has a dindImage
setting. I'm not sure why that's necessary as we don't dind.. but I don't have any other ideas so going to try setting up these pools.
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 6•7 months ago
|
||
Assignee | ||
Comment 8•7 months ago
|
||
Sigh same error on the images-gcp
pool. I don't understand why:
A) The same Dockerfile works locally
B) The same image_builder image + worker pool works for other projects
Unless it's a combination of these specific Dockerfiles combined with the (presumably) old version of Docker the workers are running? But it doesn't look like there's anything very unique about any of these Dockerfiles.
Assignee | ||
Comment 9•7 months ago
|
||
Looks like this was because the tasks still had the dind
feature enabled, turning that off gets them to work again. Still testing but looking promising.
Assignee | ||
Comment 10•7 months ago
|
||
Updated•7 months ago
|
Assignee | ||
Updated•7 months ago
|
Comment 11•7 months ago
|
||
Hi,
It seems that yaml found some defects in .taskcluster.yml. See here: https://phabricator.services.mozilla.com/D210674.
Do my modifications make sense? https://phabricator.services.mozilla.com/D210674#change-5vhjnTHSmNXQ
Assignee | ||
Comment 12•7 months ago
|
||
Oops, I missed that, thanks for fixing!
This actually landed already, so you'll have to rebase and submit a new patch. Feel free to flag me for review and I can take a look.
Updated•7 months ago
|
Comment 13•7 months ago
|
||
Updated•7 months ago
|
Comment 14•7 months ago
|
||
A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)
Description
•