Closed Bug 831491 (asan-tests) Opened 7 years ago Closed 6 years ago

run tests on ASAN builds

Categories

(Release Engineering :: General, defect)

All
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: sec-want)

Attachments

(8 files)

Spinning out from bug#753148.

This bug is to track figuring out *which* testsuites to run, on *which* ASAN builds, on *which* branches. (I think I remember :decoder saying tests on opt nightly builds only for now and then revisit later, but that was a few hours ago, so asking here to verify). Once we have this list of testsuites, please reassign back. 

It would be helpful to have idea of duration and frequency of running testsuites, on ASAN builds, to help us with capacity planning.

To set expectations, note that currently our test machines are very overloaded, so enabling additional testload needs careful evaluation until we get more machines, or can offload other existing test suites.
As discussed in the meeting, if there are resource problems with running the tests, we would be very happy if we could at least run the tests on optimized builds (they're a lot faster and therefore will consume less resources) and only once a day.

Furthermore, we don't need to run all tests. Currently we run the following set:

reftest,crashtest,xpcshell,jsreftest,mochitests

Let me know if anything else is required :)
Depends on: 833018
builds: Opt Linux64 on mozilla-central branch
Tests: reftest, crashtest, xpcshell, jsreftest, mochitests (from comment 1)
Assigning to John to see what the next steps are to make this happen. The tests run fine manually. There are two known broken tests (see "depends on" list) that we can disable if necessary to get this landed, but of course fixing the tests would be better.
Assignee: choller → joduinn
There are some test failures right now but no blocking issues. The same set of tests runs already daily with a scheduled try push which can then be disabled.
As discussed, here are the times that the different tests take (extracted from a try run):


mochitest-1: elapsed: 17 mins, 28 secs
mochitest-2: elapsed: 19 mins, 10 secs
mochitest-3: elapsed: 37 mins, 51 secs
mochitest-4: elapsed: 12 mins, 40 secs
mochitest-5: elapsed: 20 mins, 23 secs
mochitest-o: elapsed: 19 mins, 5 secs
mochitest-bc: elapsed: 1 hrs, 6 mins, 39 secs
crashtest: elapsed: 13 mins, 10 secs
reftest: elapsed: 34 mins, 21 secs
jsreftest: elapsed: 21 mins, 28 secs
xpcshell: elapsed: 57 mins, 49 secs
(In reply to Christian Holler (:decoder) from comment #5)
> As discussed, here are the times that the different tests take (extracted
> from a try run):
Ok, I'm comparing these to end-to-end runs of the same jobs on a Ubuntu 64 opt build off of mozilla central. So, this is the timing for the entire job (including setup and teardown) because that is what matters when it comes to capacity planning - i.e. how quickly can we start and finish a job.
> 
> 
> mochitest-1: elapsed: 17 mins, 28 secs
mochitest-1 on Ubuntu 64 is usually around 24 minutes. Not sure how on earth you're faster on an ASAN build than that.
> mochitest-2: elapsed: 19 mins, 10 secs
This is usually 9 minutes
> mochitest-3: elapsed: 37 mins, 51 secs
This is usually 23 minutes
> mochitest-4: elapsed: 12 mins, 40 secs
Usually 6 minutes
> mochitest-5: elapsed: 20 mins, 23 secs
Usually 11 minutes
> mochitest-o: elapsed: 19 mins, 5 secs
Usually 16 mins
> mochitest-bc: elapsed: 1 hrs, 6 mins, 39 secs
Usually 34 mins
> crashtest: elapsed: 13 mins, 10 secs
Usually 7 mins
> reftest: elapsed: 34 mins, 21 secs
Usually 28 mins
> jsreftest: elapsed: 21 mins, 28 secs
Usually 10 mins
> xpcshell: elapsed: 57 mins, 49 secs
Usually about 32 mins

We've often said that an ASAN build is about the speed of a debug build (2x slower, roughly) than an opt build. And these numbers seem to bear that out.  Given that we should NOT run Talos tests on an ASAN build, I don't think it would be a bad thing to turn these on per-push for linux 64.  Given that linux is a platform we can put in the cloud that would provide some amount of useful data to developers about their patches with the least amount of impact to the overall automation infrastructure.

I would encourage you (decoder) to dig into that mochitest-1 number and understand why you're *faster* on an ASAN build than we generally are. Did the mochitest suite in question crash half-way through or something?

But otherwise, I don't see any reason why we couldn't move ahead with this per-push for linux 64 platforms given that we can virtualize most of that impact.
(In reply to Clint Talbert ( :ctalbert ) from comment #6)
> 
> I would encourage you (decoder) to dig into that mochitest-1 number and
> understand why you're *faster* on an ASAN build than we generally are. Did
> the mochitest suite in question crash half-way through or something?

You're right, this is a more recent regression in M1 and the crash is in the WebGL testsuite. I'll try to explicitly disable the faulty tests and do another measurement. There also seem to be some random oranges on M-oth that could interfere here, I'll see if I can disable these too and get them on file. This shouldn't block us though from pushing the tests forward in the meantime.
Depends on: 872577
Depends on: 899802
I've disabled the WebGL tests in a try push (we need to re-enable them when we have upgraded our Mesa version) and mochitest-1 runs in 36 minutes now.

I'm also working actively on resolving another timeout on mochitest-bc, which might be due to OOM (the cloud linux machines seem to have more memory, enough to not trigger my "low-memory" configuration, but still not enough to run in default mode. Testing this now.).

There are also two more orange bugs open right now, but we can easily disable these tests until we have a fix.

Clint, what would be the next steps to get this on mozilla-inbound and have tests enabled? Or would we first enable them on m-c only and then move to inbound? The ultimate goal is to have the tests + unhide them on tbpl (which is not going to happen with m-c only).
Flags: needinfo?(ctalbert)
Alias: asan-tests
Depends on: 902132
No longer depends on: 899802
Depends on: 902157
:decoder:


Some questions while mtg w/ctalbert just now:
1) Before we start enabling these new tests on mozilla-inbound/b2g-inbound/fx-team, I recommend we get these working and all green on lower-traffic branch such as a project branch or mozilla-central first. Once these are all green, we can enable the tests on other high-volume branches.

2) The changes to what tests are run on what branches, are handled within buildbot scheduling logic. This bug is in the correct component for that once all dep.bugs are resolved.

3) Thanks for the runtimes in comment#5, comment#6. The question about what *frequency* of tests still unresolved - how often do you *need* these run? Is once-per-nightly enough? Given current infrastructure load, we can only support running additional tests like this on virtualized OS like ubuntu (not physical OS like fedora), and either way there's a financial $$$ cost to balance here.
Assignee: joduinn → nobody
Flags: needinfo?(ctalbert)
(In reply to John O'Duinn [:joduinn] from comment #9)
> :decoder:
> 
> 
> Some questions while mtg w/ctalbert just now:
> 1) Before we start enabling these new tests on
> mozilla-inbound/b2g-inbound/fx-team, I recommend we get these working and
> all green on lower-traffic branch such as a project branch or
> mozilla-central first. 

I've done a try push today and it's almost green now. Getting it entirely green on mozilla-central/inbound is just a matter of days now. Once that is green, we can enable them on mozilla-central and then on mozilla-inbound (I recommend doing that quickly though before people start introducing more regressions again).

> 
> 3) Thanks for the runtimes in comment#5, comment#6. The question about what
> *frequency* of tests still unresolved - how often do you *need* these run?
> Is once-per-nightly enough? 

The sheriffs say we can only unhide the tests on tbpl if we have them on mozilla-inbound such that we can easily identify the regressing changeset.

> Given current infrastructure load, we can only
> support running additional tests like this on virtualized OS like ubuntu
> (not physical OS like fedora), and either way there's a financial $$$ cost
> to balance here.

I guess just Ubuntu is fine :) About the costs, this is something managers need to work out. Dan Veditz told me we have support for doing this, so I assume we also have the financial support :) I'm just driving the technical side of this. 

Thanks!
So with regard to the requirement from the sheriffs, I totally understand where they are coming from. I also understand Joduinn's concerns about turning on a test system that will consume our test slaves for twice as long as normal. There has to be a compromise here.

How often do we expect ASAN builds to show regressions that are worthy of a backout? If they are going to be mostly green, then what if we run it on m-c for each push, and m-i and other integration branches every 4 hours? How stable have the ASAN tests been to date?

We need a way to turn them on in the short term, and in order to do that, we need to find a way to make them work without massive impacts to our current infrastructure load.

For the long term solution of running them per push in the cloud, we need to get more money allocated to our Amazon bill. For that, I'd recommend Dveditz start a thread with me, joduinn, and bmoss and make the case with regard to what having per push ASAN builds on every tree will buy us in the long term. (I'd write it with an eye toward what it will save us on having to do security re-spins).  I'm happy to help edit it if you email me directly, but I don't want to try to make your case for you.

Ed, flagging you for more info on how we can turn these on in a way that won't severely impact slave wait times and will still allow the sheriffs to ascertain what went wrong when these tests highlight an issue. (See my proposal about three paragraphs above).
Flags: needinfo?(emorley)
ASan tests should be stable enough for that, it's not that there is a new failure every hour (not even every day). Just when it happens, Sheriffs need to be able to blame someone without going through a full merge I guess.
(In reply to comment #11)
> How often do we expect ASAN builds to show regressions that are worthy of a
> backout?

This is a very difficult question to answer without guessing.  I don't think we can reliably answer this question for any of our other test suites, FWIW.  And given the fact that these tests are stable, failures only in ASAN tests are probably more serious than the average failure that we back out stuff for these days.
Product: mozilla.org → Release Engineering
(Commenting here at decoder's request)

I'm fine with Clint's proposal. We already have a precedent in that PGO builds are only run on a set frequency and we've managed to survive thus far. Ultimately, it's hard to judge whether this will work for ASAN or not - we don't know what we don't know :). I'm open to trying it as proposed for now as long as we remain open to increasing the frequency should we find it to be unworkable for whatever reason.
Asan builds only run on one platform (Fedora64) x opt+debug (100 and 150 mins respectively). This is relatively little load compared to all the other builds and test runs combined. As such, why are we worried this will increase load unnecessarily?

More importantly, why are we singling out ASan builds in particular for "we think they won't fail often, so let's not run them all the time"? The same could be said of many of the unit tests running on all three variants of OS X for example, which account for much more machine time.

Until we have the architecture in place to make regression hunting easy (eg bisect in the cloud) I'm really quite adverse to running ASan builds anything other than per push, given the little infra load saving. Once bisect in the cloud in up and running, we'll be able to save on ASan and many more suites combined, making this look like a drop in the ocean... :-)
Flags: needinfo?(emorley)
Oh tests, mis-read read prior comments (sorry still playing catch up, got back late from the ER, yey). In which case then yeah maybe we just need to run these periodically on non-mozilla-central trees at least until we know if they fail too frequently for that to be viable.
(In reply to Ed Morley [:edmorley UTC+1] from comment #16)
> Oh tests, mis-read read prior comments (sorry still playing catch up, got
> back late from the ER, yey). In which case then yeah maybe we just need to
> run these periodically on non-mozilla-central trees at least until we know
> if they fail too frequently for that to be viable.

Why non-mozilla-central? We are already creating builds for mozilla-central and have been working quite hard to get it green there because the next step was supposed to be enabling them on mozilla-central. And if that is stable, mozilla-inbound/unhiding on tbpl.
Per push mozilla-central, periodically for non-mozilla-central.
Depends on: 905636
No longer depends on: 750932
Depends on: 906100
OS: Mac OS X → Linux
Hardware: x86 → All
ASan is now green on mozilla-central:

https://tbpl.mozilla.org/?tree=Try&rev=01a5d7808288

(The orange Build is ok, that's just because of the way this was pushed to try. The builds in the ASan build job are green of course).
Per meeting w/dveditz, :decoder and joduinn just now:

(In reply to John O'Duinn [:joduinn] from comment #9)
> :decoder:
> 
> Some questions while mtg w/ctalbert just now:
> 1) Before we start enabling these new tests on
> mozilla-inbound/b2g-inbound/fx-team, I recommend we get these working and
> all green on lower-traffic branch such as a project branch or
> mozilla-central first. Once these are all green, we can enable the tests on
> other high-volume branches.
decoder now has some testsuites running green for ASAN builds on ubuntu64, specifically: reftest, crashtest, xpcshell, jsreftest, mochitests. To make sure these tests *stay* green, we'd like to enable them on other branches, so developers can see bustages, and sheriffs can do backouts-as-needed.

This needs matching changes to trychooser, to enable running these ASan-builds, and tests-on-ASan-builds. Bug#847973 tracks getting those changes into trychooser. Bug#887641 tracks supporting those builds on Try, as not-default. 

These  tests-on-opt-asan builds are slower then usual tests-on-opt builds, and  are approx same as debug builds (see comment#5, comment#6 for details).  There is no need to run these ASAN tests on debug+asan builds, as these  would be *super* slow.


> 2) The changes to what tests are run on what branches, are handled within
> buildbot scheduling logic. This bug is in the correct component for that
> once all dep.bugs are resolved.
For sheriffs to be able to support these builds+tests on mozilla-central, we also need to have these builds + tests on all 3 inbounds (mozilla-inbound, b2g-inbound, fx-team) and non-default-on-try. Ideally, security folks would  like ASan builds+tests to be run per-checkin, so lets start with this, as this is most helpful to developers doing landings. 

Other project branches may also choose to have ASan builds, but they should file bugs asking for them as/when needed.

> 3) Thanks for the runtimes in comment#5, comment#6. The question about what
> *frequency* of tests still unresolved - how often do you *need* these run?
> Is once-per-nightly enough? Given current infrastructure load, we can only
> support running additional tests like this on virtualized OS like ubuntu
> (not physical OS like fedora), and either way there's a financial $$$ cost
> to balance here.
Because these are running on Ubuntu64 (on AWS), notFedora64 (physical hardware), we can enable this without impacting other test jobs. If we find this $$$ on AWS to be a problem, we could reduce cadence - maybe to same cadence as the PGO builds for windows? It may also be possible to just once-per-night on the nightly ASan build, and then whenever a problem is detected, have sheriffs file bug with regression range to previous good nightly, and let developers or security folks figure it out using try?
Depends on: 847973
What's the hold up here? Let's get this stood up in AWS. Per comment 1 once a day is fine and over the course of the 7 months this has been open we must have figured out which tests need to be run and fixed whatever tests needed to be fixed. I cannot imagine that running this once a day is going to substantially impact our overall AWS bill. We can revisit this if I am wrong. Is it really a requirement in the short term to stand up trychooser? (there does seem to be a work around for the short run) or can that follow?
(In reply to Bob Moss :bmoss from comment #21)
> What's the hold up here? Let's get this stood up in AWS. Per comment 1 once
> a day is fine 

In the meeting with joduinn today, we discussed how to proceed and this is going to be put into production soon. Per comment above, it won't be sufficient to do this once a day. Rather, we will be starting per push because otherwise we cannot unhide this on TBPL.

There is also a meeting scheduled now for Tuesday in two weeks in case this hasn't been put into production yet until then. Thanks!
Do we need tests on both the opt and debug asan builds?
(In reply to Chris AtLee [:catlee] from comment #23)
> Do we need tests on both the opt and debug asan builds?

Just opt:

(In reply to John O'Duinn [:joduinn] from comment #20)
> These  tests-on-opt-asan builds are slower then usual tests-on-opt builds,
> and  are approx same as debug builds (see comment#5, comment#6 for details).
> There is no need to run these ASAN tests on debug+asan builds, as these 
> would be *super* slow.
Ah, missed that, thanks!
Should we run jetpack tests on these builds?
also turns on builds on try
Attachment #797416 - Flags: review?(rail)
Attachment #797416 - Flags: review?(rail) → review+
Attachment #797420 - Flags: review?(rail) → review+
Attachment #797420 - Flags: checked-in+
Attachment #797428 - Flags: review?(rail) → review+
Attachment #797428 - Flags: checked-in+
Comment on attachment 797416 [details] [diff] [review]
get asan tests running on cedar, try

should be live on cedar/try on the next reconfig
Attachment #797416 - Flags: checked-in+
In production.
Depends on: 911237
Attachment #797922 - Flags: review?(rail) → review+
Attached patch missing watchesSplinter Review
noticed these builder types are missing from watch_pending.cfg too
Attachment #797924 - Flags: review?(rail)
Attachment #797924 - Flags: review?(rail) → review+
Attachment #797922 - Flags: checked-in+
Attachment #797924 - Flags: checked-in+
The patches here have been merged to production (and presumably reconfiged), but the tests aren't being scheduled:
https://tbpl.mozilla.org/?showall=1&jobname=asan

catlee, any ideas? :-)
Flags: needinfo?(catlee)
it's only enabled on cedar right now
Flags: needinfo?(catlee)
Oh, misread earlier comments to mean the patches were for all trees; can see the patch description states the opposite, sorry! :-)
enable tests on m-c, and disable builds/tests on cedar
Attachment #799112 - Flags: review?(rail)
Attachment #799112 - Flags: review?(rail) → review+
Attachment #799112 - Flags: checked-in+
Latest patch is in production.
All done! Jobs are running (but hidden) on tbpl:

https://tbpl.mozilla.org/?showall=1&jobname=asan
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
(In reply to John O'Duinn [:joduinn] from comment #20)
> For sheriffs to be able to support these builds+tests on mozilla-central, we
> also need to have these builds + tests on all 3 inbounds (mozilla-inbound,
> b2g-inbound, fx-team) and non-default-on-try. Ideally, security folks would 
> like ASan builds+tests to be run per-checkin, so lets start with this, as
> this is most helpful to developers doing landings. 

Needed on more than mozilla-central :-)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
To be honest we might as well enable them for all trunk-matching repos:
1) mozilla-central + 3xinbounds will be 95% of the trunk push count, so we won't exactly save much by leaving them off elsewhere (and will only have bad surprises when merging into m-c from project repos otherwise).
2) the asan test jobs are actually quicker than debug runs, and we're only doing them on one platform - so in the grand scheme of things I still think we're worrying over nothing.
And then once you think about writing yet another loop to remove the job from release branches, this probably ought to have a follow-the-trains loop, because we don't really want to properly use memory on the trunk, and then screw up in an insecure way while porting a patch to a release branch, and find out that we did by paying yet another bounty to someone who does run ASan on release branches.
Attachment #800749 - Flags: review?(rail)
Attachment #800749 - Flags: review?(rail) → review+
Attachment #800749 - Flags: checked-in+
something here is in production
Both builds and tests are going on other trees, and are now unhidden on m-c, so _everything_ here is in production!
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Attached patch buildapiSplinter Review
Attachment #804464 - Flags: review?(catlee)
Attachment #804464 - Flags: review?(catlee) → review+
Depends on: 917242
We've just had ASan only test breakage, that was easily identifiable due to us doing per push builds \o/ :-)

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=813a35c5b24a
(In reply to Ed Morley [:edmorley UTC+1] from comment #48)
> We've just had ASan only test breakage, that was easily identifiable due to
> us doing per push builds \o/ :-)
> 
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=813a35c5b24a

Thanks for letting us know! Philor also told me about this. Is it possible that the sheriffs could record such cases when they see it? At least for a while. We are of course interested how many failures we're catching now that would otherwise have been missed or identified later maybe. That would be super awesome.
Yup we can do :-)

CCing the remaining sheriffs not yet CCed - please see comment 49 :-)
For future reference, that was bug 918041, and bug 895091 comment 106 shows a backout.
Depends on: 919145
Depends on: 920055
Depends on: 925873
Depends on: 929024
Depends on: 939513
Depends on: 980997
Depends on: 981000
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.