Closed Bug 1536848 Opened 5 years ago Closed 5 years ago

[meta] Require GCC 7

Categories

(Firefox Build System :: Toolchains, enhancement)

enhancement
Not set
normal

Tracking

(firefox72 fixed)

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 --- fixed

People

(Reporter: emilio, Assigned: froydnj)

References

Details

(Keywords: meta)

Attachments

(1 file)

No description provided.
See Also: → 1467153, 1410217

First attempt at:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfe8a1f3acbf35fd71cfbeb553eb6d4a2b8da411

Sixgill is crashing, and I still hit bug 1467153. There were multiple fishy bits in the build scripts, and I pushed with those and bug 1467153 worked-around in:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfce3c566e2075aabb5d0050dc3a8ba0c3289693

Mmmmm it might be better to not touch toolchains. Especially clang.

I think you'll also need some of the options from the patch in bug 1410217.

Depends on: 1410959

Looks like nathan is self-assigned to a dupe of this, so I'll dupe that one to here since this one has more information.

Assignee: nobody → nfroyd
Depends on: 1560667
Blocks: cxx17

We need this for full C++17 support.

Depends on D51450

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

Depends on: 1593673

(In reply to Marco Castelluccio [:marco] from comment #8)

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

That's a good point. The ccov jobs build just fine with GCC 7, what else do I need to test to make sure the upgrade works?

Flags: needinfo?(mcastelluccio)

(In reply to Nathan Froyd [:froydnj] from comment #9)

(In reply to Marco Castelluccio [:marco] from comment #8)

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

That's a good point. The ccov jobs build just fine with GCC 7, what else do I need to test to make sure the upgrade works?

When I tested GCC 7 at the time, there were a lot of oranges in tests (I think they were mostly timeouts). Did you run tests too? Are they looking more or less as good as on mozilla-central?

Once you have a try build with all tests (well, the same tests as we run on mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=ccov), we can see if the overall coverage percentage is more or less the same as a basic sanity check (you can use the script mentioned in the "Generate report locally" paragraph at https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Measuring_Code_Coverage_on_Firefox, or just ping me and I'll do it).

Flags: needinfo?(mcastelluccio)

Hm, apparently the timeout situation hasn't gotten any better since the last time:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f58cc80362ee5a7b74d497ef310b0932b056a09

What can we do here? The GCC 7 update is blocking other work.

Flags: needinfo?(mcastelluccio)

(In reply to Nathan Froyd [:froydnj] from comment #11)

Hm, apparently the timeout situation hasn't gotten any better since the last time:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f58cc80362ee5a7b74d497ef310b0932b056a09

What can we do here? The GCC 7 update is blocking other work.

The two options are 1) switching the build to Clang (bug 1499663); 2) switch to GCC 7. I'm afraid though that in both cases there will be test failures/timeouts to root out.
We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

Flags: needinfo?(mcastelluccio)

OK, so I'll start by saying this is completely my fault--I didn't realize the coverage build was so sensitive to the version of GCC being used, and failed to factor that into the amount of work that the C++17 work would require. I should have scoped the project out better.

That being said, having a decent amount of work held back by tier 2 jobs that aren't even present in the default list of try jobs is frustrating. I am assuming here that we don't want to shut them all off. But neither do I want to pause the C++17 work for a quarter or more determining how to handle hundreds of failing tests.

Options that I can see:

  1. Shut the jobs off for the time being (listed for completeness)
  2. Demote the jobs to tier 3 (only orange ones, if possible) -- not sure this is really possible
  3. skip-if = ccov or the equivalent for as many things as possible (which could be strictly worse than option 2)
  4. Increase timeouts some more -- I don't think this will actually solve anything, as there's a number of tests that fail without timing out, which suggests that the tests themselves are sensitive to the ordering of events in the test.

We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

I guess this is a mix of option 4 and option 3 above? Where does one modify the timeouts and the chunking of the coverage tests?

Flags: needinfo?(mcastelluccio)

(In reply to Nathan Froyd [:froydnj] from comment #13)

OK, so I'll start by saying this is completely my fault--I didn't realize the coverage build was so sensitive to the version of GCC being used, and failed to factor that into the amount of work that the C++17 work would require. I should have scoped the project out better.

That being said, having a decent amount of work held back by tier 2 jobs that aren't even present in the default list of try jobs is frustrating. I am assuming here that we don't want to shut them all off. But neither do I want to pause the C++17 work for a quarter or more determining how to handle hundreds of failing tests.

I think they're tier 2 mostly to reduce cost (running them on autoland or try by default would be costly), but ideally they'd be tier 1.

Options that I can see:

  1. Shut the jobs off for the time being (listed for completeness)
  2. Demote the jobs to tier 3 (only orange ones, if possible) -- not sure this is really possible
  3. skip-if = ccov or the equivalent for as many things as possible (which could be strictly worse than option 2)
  4. Increase timeouts some more -- I don't think this will actually solve anything, as there's a number of tests that fail without timing out, which suggests that the tests themselves are sensitive to the ordering of events in the test.

We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

I guess this is a mix of option 4 and option 3 above? Where does one modify the timeouts and the chunking of the coverage tests?

Yes.
The chunking and job-level timeouts can be modified in the yaml configuration files like taskcluster/ci/test/mochitest.yml (just search for "cov").
The per-suite test-level timeouts can be modified in suite-specific configuration files (e.g. https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/testing/mochitest/runtests.py#1941).
The individual test-level timeouts also are suite specific, but they can usually be modified in the test files themselves (e.g. by using requestLongerTimeout, https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/toolkit/components/antitracking/test/browser/head.js#57) or in the configuration files linked to the tests.

It might also be worth trying with Clang (but bug 1509665 would have to be fixed first) and/or after moving the tests from debug to opt (I tried this a while ago, without changing the compiler, and the jobs were faster but there was a crash which I did not have time to look into: https://hg.mozilla.org/try/rev/e09ca3c7ce755d91e4362ebb1e0a38bcdf5ab196).

(Make sure you only run the tasks that we are currently running on mozilla-central, otherwise the situation will look more orange than it actually is. You should be able to do that by commenting out https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/tools/tryselect/selectors/fuzzy.py#44, then run "mach try fuzzy" and select all "linux64-ccov/debug" jobs).

Flags: needinfo?(mcastelluccio)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

If you look at https://treeherder.mozilla.org/pushhealth.html?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c (though I'm not sure if it's accurate), there seem to be 509 failures, but some are listed twice and a bit more than 100 are individual test timeouts.

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

(In reply to Marco Castelluccio [:marco] from comment #16)

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

Just to make sure that we are talking about the same thing, when I say "suites timeout", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275129199&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where you get "[taskcluster:error] Task timeout after 7200 seconds. Force killing container." When I say "individual test failures", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113960&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where individual tests within a suite are giving TEST-UNEXPECTED-FAIL. "Individual test timeouts" would be more like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113941&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

which also has the entire suite timing out, as above.

(In reply to Marco Castelluccio [:marco] from comment #16)

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

I am running mach try fuzzy with the full list of jobs and selecting all the ccov ones. I guess we're not running the fission tests under coverage on mozilla-central, so the tests look a little worse than they could otherwise be.

(In reply to Nathan Froyd [:froydnj] from comment #17)

(In reply to Marco Castelluccio [:marco] from comment #16)

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

Just to make sure that we are talking about the same thing, when I say "suites timeout", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275129199&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where you get "[taskcluster:error] Task timeout after 7200 seconds. Force killing container." When I say "individual test failures", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113960&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where individual tests within a suite are giving TEST-UNEXPECTED-FAIL. "Individual test timeouts" would be more like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113941&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

which also has the entire suite timing out, as above.

Yep! Most of the job failures are in the first bucket if I'm not mistaken.

(In reply to Marco Castelluccio [:marco] from comment #16)

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

I am running mach try fuzzy with the full list of jobs and selecting all the ccov ones. I guess we're not running the fission tests under coverage on mozilla-central, so the tests look a little worse than they could otherwise be.

Yeah, probably also a few other jobs. The best way to select exactly what you need is the small hack I described at the end of comment 14.

Depends on: 1596275
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/56a7bf975576
raise the minimum gcc version to 7; r=dmajor
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: