1536848 - [meta] Require GCC 7

Emilio Cobos Álvarez (:emilio)

Reporter

Description

•

5 years ago

No description provided.

Marco Castelluccio [:marco]

Updated

•

5 years ago

Comment 1

•

5 years ago

First attempt at:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfe8a1f3acbf35fd71cfbeb553eb6d4a2b8da411

Sixgill is crashing, and I still hit bug 1467153. There were multiple fishy bits in the build scripts, and I pushed with those and bug 1467153 worked-around in:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfce3c566e2075aabb5d0050dc3a8ba0c3289693

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 2

•

5 years ago

Err, that didn't get very far, https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f514e07d1857d0d198c416cda85561aaf0511de should be closer...

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: meta

Mike Hommey [:glandium]

Comment 3

•

5 years ago

Mmmmm it might be better to not touch toolchains. Especially clang.

Marco Castelluccio [:marco]

Comment 4

•

5 years ago

I think you'll also need some of the options from the patch in bug 1410217.

sjw

Updated

•

5 years ago

Depends on: 1410959

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 5

•

5 years ago

Looks like nathan is self-assigned to a dupe of this, so I'll dupe that one to here since this one has more information.

Assignee: nobody → nfroyd

Depends on: 1560667

Emilio Cobos Álvarez (:emilio)

Reporter

Updated

•

5 years ago

Blocks: cxx17

Nathan Froyd [:froydnj]

Assignee

Comment 7

•

5 years ago

Attached file Bug 1536848 - raise the minimum gcc version to 7; r=#build — Details

We need this for full C++17 support.

Depends on D51450

Marco Castelluccio [:marco]

Comment 8

•

5 years ago

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

Nathan Froyd [:froydnj]

Assignee

Updated

•

5 years ago

Depends on: 1593673

Nathan Froyd [:froydnj]

Assignee

Comment 9

•

5 years ago

(In reply to Marco Castelluccio [:marco] from comment #8)

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

That's a good point. The ccov jobs build just fine with GCC 7, what else do I need to test to make sure the upgrade works?

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Comment 10

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #9)

(In reply to Marco Castelluccio [:marco] from comment #8)

The code coverage build is still using GCC 6, we were planning to switch to Clang but haven't done it yet. If we raise the minimum GCC version to 7, we should switch the ccov build to GCC 7 too.

That's a good point. The ccov jobs build just fine with GCC 7, what else do I need to test to make sure the upgrade works?

When I tested GCC 7 at the time, there were a lot of oranges in tests (I think they were mostly timeouts). Did you run tests too? Are they looking more or less as good as on mozilla-central?

Once you have a try build with all tests (well, the same tests as we run on mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=ccov), we can see if the overall coverage percentage is more or less the same as a basic sanity check (you can use the script mentioned in the "Generate report locally" paragraph at https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Measuring_Code_Coverage_on_Firefox, or just ping me and I'll do it).

Flags: needinfo?(mcastelluccio)

Nathan Froyd [:froydnj]

Assignee

Comment 11

•

5 years ago

Hm, apparently the timeout situation hasn't gotten any better since the last time:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f58cc80362ee5a7b74d497ef310b0932b056a09

What can we do here? The GCC 7 update is blocking other work.

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Comment 12

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #11)

Hm, apparently the timeout situation hasn't gotten any better since the last time:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f58cc80362ee5a7b74d497ef310b0932b056a09

What can we do here? The GCC 7 update is blocking other work.

The two options are 1) switching the build to Clang (bug 1499663); 2) switch to GCC 7. I'm afraid though that in both cases there will be test failures/timeouts to root out.
We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

Flags: needinfo?(mcastelluccio)

Nathan Froyd [:froydnj]

Assignee

Comment 13

•

5 years ago

OK, so I'll start by saying this is completely my fault--I didn't realize the coverage build was so sensitive to the version of GCC being used, and failed to factor that into the amount of work that the C++17 work would require. I should have scoped the project out better.

That being said, having a decent amount of work held back by tier 2 jobs that aren't even present in the default list of try jobs is frustrating. I am assuming here that we don't want to shut them all off. But neither do I want to pause the C++17 work for a quarter or more determining how to handle hundreds of failing tests.

Options that I can see:

Shut the jobs off for the time being (listed for completeness)
Demote the jobs to tier 3 (only orange ones, if possible) -- not sure this is really possible
skip-if = ccov or the equivalent for as many things as possible (which could be strictly worse than option 2)
Increase timeouts some more -- I don't think this will actually solve anything, as there's a number of tests that fail without timing out, which suggests that the tests themselves are sensitive to the ordering of events in the test.

We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

I guess this is a mix of option 4 and option 3 above? Where does one modify the timeouts and the chunking of the coverage tests?

Flags: needinfo?(mcastelluccio)

Marco Castelluccio [:marco]

Comment 14

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #13)

OK, so I'll start by saying this is completely my fault--I didn't realize the coverage build was so sensitive to the version of GCC being used, and failed to factor that into the amount of work that the C++17 work would require. I should have scoped the project out better.

That being said, having a decent amount of work held back by tier 2 jobs that aren't even present in the default list of try jobs is frustrating. I am assuming here that we don't want to shut them all off. But neither do I want to pause the C++17 work for a quarter or more determining how to handle hundreds of failing tests.

I think they're tier 2 mostly to reduce cost (running them on autoland or try by default would be costly), but ideally they'd be tier 1.

Options that I can see:

Shut the jobs off for the time being (listed for completeness)

Demote the jobs to tier 3 (only orange ones, if possible) -- not sure this is really possible

skip-if = ccov or the equivalent for as many things as possible (which could be strictly worse than option 2)

Increase timeouts some more -- I don't think this will actually solve anything, as there's a number of tests that fail without timing out, which suggests that the tests themselves are sensitive to the ordering of events in the test.

We could simply try to increase the timeouts/chunking and see where that gets us, then increase the suite single test timeouts and then disable/increase timeout for specific tests that still fail.

I guess this is a mix of option 4 and option 3 above? Where does one modify the timeouts and the chunking of the coverage tests?

Yes.
The chunking and job-level timeouts can be modified in the yaml configuration files like taskcluster/ci/test/mochitest.yml (just search for "cov").
The per-suite test-level timeouts can be modified in suite-specific configuration files (e.g. https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/testing/mochitest/runtests.py#1941).
The individual test-level timeouts also are suite specific, but they can usually be modified in the test files themselves (e.g. by using requestLongerTimeout, https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/toolkit/components/antitracking/test/browser/head.js#57) or in the configuration files linked to the tests.

It might also be worth trying with Clang (but bug 1509665 would have to be fixed first) and/or after moving the tests from debug to opt (I tried this a while ago, without changing the compiler, and the jobs were faster but there was a crash which I did not have time to look into: https://hg.mozilla.org/try/rev/e09ca3c7ce755d91e4362ebb1e0a38bcdf5ab196).

(Make sure you only run the tasks that we are currently running on mozilla-central, otherwise the situation will look more orange than it actually is. You should be able to do that by commenting out https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/tools/tryselect/selectors/fuzzy.py#44, then run "mach try fuzzy" and select all "linux64-ccov/debug" jobs).

Flags: needinfo?(mcastelluccio)

Nathan Froyd [:froydnj]

Assignee

Comment 15

•

5 years ago

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Marco Castelluccio [:marco]

Comment 16

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

If you look at https://treeherder.mozilla.org/pushhealth.html?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c (though I'm not sure if it's accurate), there seem to be 509 failures, but some are listed twice and a bit more than 100 are individual test timeouts.

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

Nathan Froyd [:froydnj]

Assignee

Comment 17

•

5 years ago

(In reply to Marco Castelluccio [:marco] from comment #16)

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

Just to make sure that we are talking about the same thing, when I say "suites timeout", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275129199&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where you get "[taskcluster:error] Task timeout after 7200 seconds. Force killing container." When I say "individual test failures", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113960&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where individual tests within a suite are giving TEST-UNEXPECTED-FAIL. "Individual test timeouts" would be more like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113941&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

which also has the entire suite timing out, as above.

(In reply to Marco Castelluccio [:marco] from comment #16)

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

I am running mach try fuzzy with the full list of jobs and selecting all the ccov ones. I guess we're not running the fission tests under coverage on mozilla-central, so the tests look a little worse than they could otherwise be.

Marco Castelluccio [:marco]

Comment 18

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #17)

(In reply to Marco Castelluccio [:marco] from comment #16)

(In reply to Nathan Froyd [:froydnj] from comment #15)

Bumping the timeouts (somewhat conservatively for reftests, aggressively for everything else) and actually waiting for things to finish:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c&selectedJob=275113895

There are suites that still timeout, but there are also hundreds of individual tests that just fail, likely due to timing issues of one sort or another. Note that that's actual failures, not individual test timeouts (though there are some of those), at least so far as the error messages suggest.

Most of the failures are still due to suite timeout as far as I can see, maybe we can try an even more aggressive bump just to see how much time they would actually need to finish (and then decide on whether we want to actually have a longer timeout, or increase the chunking).

Just to make sure that we are talking about the same thing, when I say "suites timeout", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275129199&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where you get "[taskcluster:error] Task timeout after 7200 seconds. Force killing container." When I say "individual test failures", I mean things like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113960&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

where individual tests within a suite are giving TEST-UNEXPECTED-FAIL. "Individual test timeouts" would be more like:

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=275113941&revision=9f4f9c0addea99dee743d6c1dcf019d4e5df9e0c

which also has the entire suite timing out, as above.

Yep! Most of the job failures are in the first bucket if I'm not mistaken.

(In reply to Marco Castelluccio [:marco] from comment #16)

Also, it'd be better to just trigger jobs that we actually trigger on mozilla-central, to avoid seeing things more orange than they actually are (see my previous comment for an easy way to do that).

I am running mach try fuzzy with the full list of jobs and selecting all the ccov ones. I guess we're not running the fission tests under coverage on mozilla-central, so the tests look a little worse than they could otherwise be.

Yeah, probably also a few other jobs. The best way to select exactly what you need is the small hack I described at the end of comment 14.

Marco Castelluccio [:marco]

Updated

•

5 years ago

Depends on: 1596275

Pulsebot

Comment 19

•

5 years ago

Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/56a7bf975576
raise the minimum gcc version to 7; r=dmajor

Bogdan Tara[:bogdan_tara | bogdant]

Comment 20

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/56a7bf975576

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox72: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla72