Closed
Bug 1488148
Opened 6 years ago
Closed 6 years ago
Rework images used for taskcluster
Categories
(NSS :: Test, enhancement)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mt, Assigned: mt)
References
Details
Attachments
(1 file)
The current head of clang isn't compiling correctly. We should fix the version of clang that we compile against to something less risky. Ubuntu 18.04 uses clang-6.0 as a default, which seems reasonable. We're currently on 16.04, which is perhaps a little out of date, so maybe update the base docker image at the same time. Why we decided that pulling the latest source was a good idea is a mystery, but maybe that was the only way to get a reliable build for fuzzing.
Assignee | ||
Comment 1•6 years ago
|
||
OK, I think that I am close: https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=4bea4dd4f3cd4e41c86682e478ead7f499f1fe8f&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-resultStatus=retry&filter-resultStatus=success Apart from one thing: https://github.com/google/sanitizers/issues/954 Ubuntu 18.04 has a recent version of clang, but that version is incompatible with the glibc version (2.27). That manifests in the above bug. The fix <https://github.com/llvm-project/llvm-project-20170507/commit/a8112539b3e38fa2503eacb41aacdbce98391a3b> isn't in an LLVM release yet. So our options are to build a special image for fuzzing 32-bit versions, disabling 32-bit fuzzing for a while, or building clang from source (we've done this before). Downgrading glibc doesn't seem feasible. We'll see if we can wait this one out, but we have failing builds right now, so it can't take too long.
Assignee: nobody → martin.thomson
Assignee | ||
Comment 2•6 years ago
|
||
This does some fairly major restructuring of the docker images we use for CI. The genesis of the change is that we were pulling a version of clang that didn't work for fuzzing tests. It turns out that is a use case that is not well-supported anyway, and we have open bugs on it, but this installs workarounds for all the problems I found. Firstly, our images were bloated. This slims down most of the images. The biggest gains are in the clang-format image (down to around a fifth of its previous size). The main linux image we use for building and running tests is also less than half its original size. To achieve that, I had to make two new images. One for all the esoteric builds we run (we compile with multiple gcc and clang versions, and I've added some more to that list). That's a fairly sizeable image. The other is for the interop and bogo suites, where we rely on having Rust and go available. go adds a tidy 250Mb to an image, and Rust adds 750Mb. Using an image with both of those for regular builds can't be good for performance. I didn't expect to see real performance gains here, but the Linux build (32-bit, default config) went from 4:18 down to 2:42 (roughly). The bulk of that time is accounted for by downloading the docker image, so it's clear that an optimization worth spending the time on. Secondly, we had a lot of custom configuration stuff in the builds. This removes most of that in favour of using stock Ubuntu packages from 18.04. The one exception here is - I hope - temporary. As noted in the bug comments, the current release of LLVM 6 has a bug where you can't run address sanitizer on a 32-bit machine if it has glibc 2.27 (which Ubuntu 18.04 does). That's fairly crippling because we need a newer version of LLVM than runs by default on Ubuntu 16.04, so we're stuck with installing a non-stock version for 32-bit runs. I've opted to (temporarily) run 16.04 with an LLVM from the LLVM project. The final change, which is minor, but a little odd and worth noting: the images come with a pre-modified /etc/hosts file. This was originally part of the post-run scripts that taskcluster runs, which makes little sense. I also discovered some problems with name resolution in some of the builds related to this. Adding the -4 option to tstclnt to the chains script, as we have everywhere else, seems to fix it. (The scripts still include the same tweak, for the aarch64 builds, which seem to be a rule unto themselves, but that's a matter for bug 1488325.) There is still more work to be done for the HACL* and SAW builds, which use some very strange configurations. Also, all of the aarch64 images aren't built automatically, so we use images from Franziskus' dockerhub account. This is not good. After digging around a little, there's probably something to be done with QEMU, but I decided that was a major project.
Comment 3•6 years ago
|
||
Comment on attachment 9006447 [details] Bug 1488148 - Rework CI images for clang version stability, r?jcj,dipen J.C. Jones [:jcj] (he/him) has approved the revision.
Attachment #9006447 -
Flags: review+
Assignee | ||
Comment 4•6 years ago
|
||
Hi Pete, I'm having a problem with the way that taskcluster runs these docker images that you might be able to help with. When running tests, NSS has two requirements for the hostname: 1. it resolves to 127.0.0.1 2. it has a dot in it We've been running with "localhost.localdomain" for the longest time, but I don't understand how it works. And it's started to break. There are lines in our scripts that attempt to add this to /etc/hosts, but I'm fairly sure that this isn't happening (we run tests as an unprivileged user, and only run that code when we are root). Ideally I would like to have docker run with the `--add-host nss.test:127.0.0.1` argument, though I see that taskcluster uses a third-party module for managing containers and I can't see a way to make that happen. I would settle for being able to set the hostname to a dotted name. What do you recommend? Do you know how this worked in the past?
Flags: needinfo?(pmoore)
Assignee | ||
Comment 5•6 years ago
|
||
As an aside, https://github.com/taskcluster/docker-worker/blob/master/config.yml#L171 gives the impression that we were getting a synthesized hostname. Something like blah.taskcluster-worker.net. That doesn't seem to be the case when I test it. It looks like there isn't any tweaking of the hostname at all: when I disable our overrides for HOST and DOMSUF, I get (for example) HOST="1ef45bb94e5a" DOMSUF="(none)". I'm guessing that is because the value is assigned to a variable that is never read: https://github.com/taskcluster/docker-worker/blob/master/src/lib/task.js#L861
Assignee | ||
Comment 6•6 years ago
|
||
Ahh, now I know why we don't have this problem currently. We don't have "USER worker" in the Dockerfile for the image that we use for running builds. This is bad, but it explains what I'm seeing. If I have to remove that from the new images here, that would be not-awesome, but it offers a way out. We'll see what comes of my request in comment 4 first.
Assignee | ||
Updated•6 years ago
|
Summary: Don't use clang HEAD for CI → Rework images used for taskcluster
Comment 7•6 years ago
|
||
Hi Martin, :wcosta is our domain expert here and will know much more than me. :-)
Flags: needinfo?(pmoore) → needinfo?(wcosta)
Comment 8•6 years ago
|
||
Comment on attachment 9006447 [details] Bug 1488148 - Rework CI images for clang version stability, r?jcj,dipen Dipen Patel [:dipen] has approved the revision.
Attachment #9006447 -
Flags: review+
Comment hidden (Intermittent Failures Robot) |
Comment 10•6 years ago
|
||
(In reply to Martin Thomson [:mt:] from comment #4) > Hi Pete, > > I'm having a problem with the way that taskcluster runs these docker images > that you might be able to help with. > > When running tests, NSS has two requirements for the hostname: > > 1. it resolves to 127.0.0.1 > 2. it has a dot in it > > We've been running with "localhost.localdomain" for the longest time, but I > don't understand how it works. And it's started to break. There are lines > in our scripts that attempt to add this to /etc/hosts, but I'm fairly sure > that this isn't happening (we run tests as an unprivileged user, and only > run that code when we are root). > > Ideally I would like to have docker run with the `--add-host > nss.test:127.0.0.1` argument, though I see that taskcluster uses a > third-party module for managing containers and I can't see a way to make > that happen. I would settle for being able to set the hostname to a dotted > name. What do you recommend? Do you know how this worked in the past? I am afraid I didn't understand the issue. "localhost.localdomain" used to resolve to 127.0.0.1 and not anymore?
Flags: needinfo?(wcosta) → needinfo?(martin.thomson)
Assignee | ||
Comment 11•6 years ago
|
||
Sorry, I think that I can explain more clearly now. I was confused by some apparent strangeness. We currently meet those requirements by running the container as root (no "USER worker" in the Dockerfile) and we then run `echo 127.0.0.1 localhost.localdomain >> /etc/hosts`. I would like to be able to run without root access if possible - the way we manage that causes our setup to be brittle and inconsistent. However, to do that, we need a way to add hosts at the time of docker invocation because of the weird requirement to have a dot in the name. That is, when you run the image in the worker, can you set the hostname please? Or, can you allow us to set it through configuration? Ideally, we'd have a way to do the equivalent of --add-host, but I would settle for --hostname. As I said in comment 5, it looks like docker-worker attempts to set a hostname, but fails to. The name that it apparently tries to use would be fine, but it doesn't do anything. Fixing that might do.
Flags: needinfo?(martin.thomson) → needinfo?(wcosta)
Comment 12•6 years ago
|
||
I deployed in hg-worker a new docker-worker that resolves localhost.localdomain to 127.0.0.1, could you please give it a try?
Flags: needinfo?(wcosta)
Assignee | ||
Comment 13•6 years ago
|
||
Thanks! That works nicely. It even worked on our aarch64 images.
Comment hidden (Intermittent Failures Robot) |
Comment 15•6 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/docker-worker https://github.com/taskcluster/docker-worker/commit/48f7d78c25f2cce22290d8c85fda973d693562b8 Bug 1488148: Add localhost.localdomain host name This is a requirement for nss-try. We could work on allowing additional configuration from the VM using the user data, but as we aim to replace docker-worker, lets postpone more elaborate configs for newer worker.
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•6 years ago
|
||
This ain't resolved. We still have to land the other pieces yet.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•6 years ago
|
||
Took a while, but here we go: https://hg.mozilla.org/projects/nss/rev/e56241dc5b61300dc426884b7c3cf50f28e9586b
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → FIXED
Version: 3.39 → 3.40
Assignee | ||
Comment 18•6 years ago
|
||
And because no big landing is complete without at least one follow-up: https://hg.mozilla.org/projects/nss/rev/a74229a54349be704bdede56b6f567e0988b9653
You need to log in
before you can comment on or make changes to this bug.
Description
•