Closed Bug 1488148 Opened 6 years ago Closed 6 years ago

Rework images used for taskcluster

Categories

(NSS :: Test, enhancement)

3.40
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mt, Assigned: mt)

References

Details

Attachments

(1 file)

The current head of clang isn't compiling correctly.  We should fix the version of clang that we compile against to something less risky.  Ubuntu 18.04 uses clang-6.0 as a default, which seems reasonable.  We're currently on 16.04, which is perhaps a little out of date, so maybe update the base docker image at the same time.

Why we decided that pulling the latest source was a good idea is a mystery, but maybe that was the only way to get a reliable build for fuzzing.
OK, I think that I am close:

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=4bea4dd4f3cd4e41c86682e478ead7f499f1fe8f&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-resultStatus=retry&filter-resultStatus=success

Apart from one thing: https://github.com/google/sanitizers/issues/954 

Ubuntu 18.04 has a recent version of clang, but that version is incompatible with the glibc version (2.27).  That manifests in the above bug.

The fix <https://github.com/llvm-project/llvm-project-20170507/commit/a8112539b3e38fa2503eacb41aacdbce98391a3b> isn't in an LLVM release yet.  So our options are to build a special image for fuzzing 32-bit versions, disabling 32-bit fuzzing for a while, or building clang from source (we've done this before).  Downgrading glibc doesn't seem feasible.

We'll see if we can wait this one out, but we have failing builds right now, so it can't take too long.
Assignee: nobody → martin.thomson
This does some fairly major restructuring of the docker images we use for CI.
The genesis of the change is that we were pulling a version of clang that didn't
work for fuzzing tests.  It turns out that is a use case that is not
well-supported anyway, and we have open bugs on it, but this installs
workarounds for all the problems I found.

Firstly, our images were bloated.  This slims down most of the images.  The
biggest gains are in the clang-format image (down to around a fifth of its
previous size).  The main linux image we use for building and running tests is
also less than half its original size.

To achieve that, I had to make two new images.  One for all the esoteric builds
we run (we compile with multiple gcc and clang versions, and I've added some
more to that list).  That's a fairly sizeable image.  The other is for the
interop and bogo suites, where we rely on having Rust and go available.  go adds
a tidy 250Mb to an image, and Rust adds 750Mb.  Using an image with both of
those for regular builds can't be good for performance.  I didn't expect to see
real performance gains here, but the Linux build (32-bit, default config) went
from 4:18 down to 2:42 (roughly).  The bulk of that time is accounted for by
downloading the docker image, so it's clear that an optimization worth spending
the time on.

Secondly, we had a lot of custom configuration stuff in the builds.  This
removes most of that in favour of using stock Ubuntu packages from 18.04.

The one exception here is - I hope - temporary.  As noted in the bug comments,
the current release of LLVM 6 has a bug where you can't run address sanitizer on
a 32-bit machine if it has glibc 2.27 (which Ubuntu 18.04 does).  That's fairly
crippling because we need a newer version of LLVM than runs by default on Ubuntu
16.04, so we're stuck with installing a non-stock version for 32-bit runs.  I've
opted to (temporarily) run 16.04 with an LLVM from the LLVM project.

The final change, which is minor, but a little odd and worth noting: the images
come with a pre-modified /etc/hosts file.  This was originally part of the
post-run scripts that taskcluster runs, which makes little sense.  I also
discovered some problems with name resolution in some of the builds related to
this.  Adding the -4 option to tstclnt to the chains script, as we have
everywhere else, seems to fix it.  (The scripts still include the same tweak,
for the aarch64 builds, which seem to be a rule unto themselves, but that's a
matter for bug 1488325.)

There is still more work to be done for the HACL* and SAW builds, which use some
very strange configurations.  Also, all of the aarch64 images aren't built
automatically, so we use images from Franziskus' dockerhub account.  This is not
good.  After digging around a little, there's probably something to be done with
QEMU, but I decided that was a major project.
Comment on attachment 9006447 [details]
Bug 1488148 - Rework CI images for clang version stability, r?jcj,dipen

J.C. Jones [:jcj] (he/him) has approved the revision.
Attachment #9006447 - Flags: review+
Hi Pete,

I'm having a problem with the way that taskcluster runs these docker images that you might be able to help with.

When running tests, NSS has two requirements for the hostname:

1. it resolves to 127.0.0.1
2. it has a dot in it

We've been running with "localhost.localdomain" for the longest time, but I don't understand how it works.  And it's started to break.  There are lines in our scripts that attempt to add this to /etc/hosts, but I'm fairly sure that this isn't happening (we run tests as an unprivileged user, and only run that code when we are root).

Ideally I would like to have docker run with the `--add-host nss.test:127.0.0.1` argument, though I see that taskcluster uses a third-party module for managing containers and I can't see a way to make that happen.  I would settle for being able to set the hostname to a dotted name.  What do you recommend?  Do you know how this worked in the past?
Flags: needinfo?(pmoore)
As an aside, https://github.com/taskcluster/docker-worker/blob/master/config.yml#L171 gives the impression that we were getting a synthesized hostname.  Something like blah.taskcluster-worker.net.  That doesn't seem to be the case when I test it.  It looks like there isn't any tweaking of the hostname at all: when I disable our overrides for HOST and DOMSUF, I get (for example) HOST="1ef45bb94e5a" DOMSUF="(none)".

I'm guessing that is because the value is assigned to a variable that is never read: https://github.com/taskcluster/docker-worker/blob/master/src/lib/task.js#L861
Ahh, now I know why we don't have this problem currently.  We don't have "USER worker" in the Dockerfile for the image that we use for running builds.  This is bad, but it explains what I'm seeing.  If I have to remove that from the new images here, that would be not-awesome, but it offers a way out.  We'll see what comes of my request in comment 4 first.
Summary: Don't use clang HEAD for CI → Rework images used for taskcluster
Hi Martin,

:wcosta is our domain expert here and will know much more than me. :-)
Flags: needinfo?(pmoore) → needinfo?(wcosta)
Comment on attachment 9006447 [details]
Bug 1488148 - Rework CI images for clang version stability, r?jcj,dipen

Dipen Patel [:dipen] has approved the revision.
Attachment #9006447 - Flags: review+
(In reply to Martin Thomson [:mt:] from comment #4)
> Hi Pete,
> 
> I'm having a problem with the way that taskcluster runs these docker images
> that you might be able to help with.
> 
> When running tests, NSS has two requirements for the hostname:
> 
> 1. it resolves to 127.0.0.1
> 2. it has a dot in it
> 
> We've been running with "localhost.localdomain" for the longest time, but I
> don't understand how it works.  And it's started to break.  There are lines
> in our scripts that attempt to add this to /etc/hosts, but I'm fairly sure
> that this isn't happening (we run tests as an unprivileged user, and only
> run that code when we are root).
> 
> Ideally I would like to have docker run with the `--add-host
> nss.test:127.0.0.1` argument, though I see that taskcluster uses a
> third-party module for managing containers and I can't see a way to make
> that happen.  I would settle for being able to set the hostname to a dotted
> name.  What do you recommend?  Do you know how this worked in the past?

I am afraid I didn't understand the issue. "localhost.localdomain" used to resolve to 127.0.0.1 and not anymore?
Flags: needinfo?(wcosta) → needinfo?(martin.thomson)
Sorry, I think that I can explain more clearly now.  I was confused by some apparent strangeness.

We currently meet those requirements by running the container as root (no "USER worker" in the Dockerfile) and we then run `echo 127.0.0.1 localhost.localdomain >> /etc/hosts`.  I would like to be able to run without root access if possible - the way we manage that causes our setup to be brittle and inconsistent.  However, to do that, we need a way to add hosts at the time of docker invocation because of the weird requirement to have a dot in the name.

That is, when you run the image in the worker, can you set the hostname please?  Or, can you allow us to set it through configuration?  Ideally, we'd have a way to do the equivalent of --add-host, but I would settle for --hostname.

As I said in comment 5, it looks like docker-worker attempts to set a hostname, but fails to.  The name that it apparently tries to use would be fine, but it doesn't do anything.  Fixing that might do.
Flags: needinfo?(martin.thomson) → needinfo?(wcosta)
I deployed in hg-worker a new docker-worker that resolves localhost.localdomain to 127.0.0.1, could you please give it a try?
Flags: needinfo?(wcosta)
Thanks!  That works nicely.  It even worked on our aarch64 images.
Commit pushed to master at https://github.com/taskcluster/docker-worker

https://github.com/taskcluster/docker-worker/commit/48f7d78c25f2cce22290d8c85fda973d693562b8
Bug 1488148: Add localhost.localdomain host name

This is a requirement for nss-try. We could work on allowing additional
configuration from the VM using the user data, but as we aim to replace
docker-worker, lets postpone more elaborate configs for newer worker.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
This ain't resolved.  We still have to land the other pieces yet.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Took a while, but here we go: https://hg.mozilla.org/projects/nss/rev/e56241dc5b61300dc426884b7c3cf50f28e9586b
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Version: 3.39 → 3.40
And because no big landing is complete without at least one follow-up: https://hg.mozilla.org/projects/nss/rev/a74229a54349be704bdede56b6f567e0988b9653
See Also: → 1559766
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: