Closed Bug 1318187 Opened 8 years ago Closed 7 years ago

Ensure that we actually test Stylo's parallel traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests

Categories

(Core :: CSS Parsing and Computation, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: bholley, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Stylo])

Attachments

(1 file)

I was just looking at a crash dump from linux64 try, and noticed that it was using sequential traversal. This is presumably because the VM configuration causes our heuristics to decide that there's no point to parallel traversal.

We should investigate what's going on, and see whether we can either improve our VM configuration or artificially force parallel traversal.

Not really urgent but something to do before shipping.
Blocks: stylo-tooling
No longer blocks: stylo
Summary: stylo: Ensure that we actually test parallel traversal on CI → Ensure that we actually test the parallel stylo traversal on CI
Whiteboard: [Stylo]
I guess we may want to test both parallel and sequential traversal. Should we have an additional set of tasks to run tests in the other way?
Bug 1347399 indicates that this is getting to be a more urgent problem.

(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #1)
> I guess we may want to test both parallel and sequential traversal. Should
> we have an additional set of tasks to run tests in the other way?

It would be nice to do that in a way that doesn't increase our automation load.

A reasonable compromise might be to run the parallel traversal for the e10s tests, and the sequential traversal for the non-e10s tests. This basically means passing different environmental variables to the two test configurations - STYLO_THREADS=1 and STYLO_THREADS=4.

jgriffin, can you find someone to help us out with this?
Flags: needinfo?(jgriffin)
Redirecting to ted since jgriffin is out.
Flags: needinfo?(jgriffin) → needinfo?(ted)
Note: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e19102a1ba20356ad368935ae93619193fb3137 is a green try run with a patch to force parallel traversal (and another to fix bug 1347399). We should flip this on now to be sure it stays green.
See also bug 1348754.
So one thing that came up in IRC discussion yesterday is that the taskcluster workers are configured to use a single core, because when jobs were being migrated from buildbot that matched the configuration of those workers, and we were seeing lots of extra intermittent failures when running with extra cores.

If we want to revisit that decision for Stylo tests I think that'd be fine, but it might require some work to make things green.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> If we want to revisit that decision for Stylo tests I think that'd be fine,
> but it might require some work to make things green.

Ted, did these intermittent test failures on multi-core taskcluster workers affect all sorts of tests? Fixing other teams' intermittent test failures does not sound like something the Stylo team wants to take on. :)

Can we limit the tests that run on multi-core workers just to the reftests relevant to style? We might get some testing value by enabling STYLO_THREADS=4 even on single-core workers until we can fix those multi-core test failures.
Flags: needinfo?(ted)
Summary: Ensure that we actually test the parallel stylo traversal on CI → Ensure that we actually test the parallel stylo traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests
Priority: -- → P2
Summary: Ensure that we actually test the parallel stylo traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests → stylo: Ensure that we actually test Stylo's parallel traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests
To be clear, I'm not asking for multicore testing (at least not yet). I just want to enabled parallel traversal on the existing single-core testers. That will tell us when we're e.g. tripping an NS_IsMainThread() assertion on an FFI call.

So my ask in this bug is just to get help setting up an environmental variable in the appropriate configurations. Should be straightforward for someone familiar with that stuff.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> To be clear, I'm not asking for multicore testing (at least not yet). I just
> want to enabled parallel traversal on the existing single-core testers. That
> will tell us when we're e.g. tripping an NS_IsMainThread() assertion on an
> FFI call.
> 
> So my ask in this bug is just to get help setting up an environmental
> variable in the appropriate configurations. Should be straightforward for
> someone familiar with that stuff.

This is straightforward, and I can look into doing this.  Reftests only, or mochitests too?

Do we care if flipping this makes the test jobs fall over, or fall over harder than they were already doing (i.e. assertions and whatnot)?
Flags: needinfo?(bobbyholley)
(In reply to Nathan Froyd [:froydnj] from comment #9)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> > To be clear, I'm not asking for multicore testing (at least not yet). I just
> > want to enabled parallel traversal on the existing single-core testers. That
> > will tell us when we're e.g. tripping an NS_IsMainThread() assertion on an
> > FFI call.
> > 
> > So my ask in this bug is just to get help setting up an environmental
> > variable in the appropriate configurations. Should be straightforward for
> > someone familiar with that stuff.
> 
> This is straightforward, and I can look into doing this.

Thank you!

> Reftests only, or
> mochitests too?

Both, ideally.

> 
> Do we care if flipping this makes the test jobs fall over, or fall over
> harder than they were already doing (i.e. assertions and whatnot)?

I tested a few weeks ago and we were green. However, I think bug 1351200 is currently going to make everything crash. I'll try to get that accelerated, but in the mean time let's get this patch ready to go so we can land it a soon as we're green.
Flags: needinfo?(ted)
Flags: needinfo?(bobbyholley)
Nathan said he would enable parallel traversal for Stylo's e10s tests.

We still want sequential traversal for Stylo's non-e10s tests so we are testing both parallel and sequential code paths.
Assignee: nobody → nfroyd
Summary: stylo: Ensure that we actually test Stylo's parallel traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests → Ensure that we actually test Stylo's parallel traversal on CI: enable parallel traversal in e10s tests, sequential traversal in non-e10s tests
OK, so the test definitions for stylo's test jobs live here:

http://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/tests.yml#913 (mochitest, e10s and non-e10s)
http://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/tests.yml#935 (mochitest-chrome, non-e10s only (?))
http://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/tests.yml#1050 (reftest, e10s only, bug 1343301, bug 1339604)

The comments for the reftest disabling suggest that we don't care about the non-e10 case because those tests are going away for 57 anyway...but we still care about the non-e10s case in the here and now for everything else?

For reftest, we can pass in an option to set the STYLO_THREADS environment variable appropriately always--assuming that we're not going to turn on non-e10s there.  For mochitest-chrome, the right thing already happens, apparently...or we adopt the same solution for mochitest-chrome, see below.

For mochitest...I see two options:

1) We duplicate the definition in the .yml file, one for non-e10s and one for e10s; the e10s one then gets whatever appropriate options we need.

2) We write a transform to check whether the test job is for stylo and has e10s enabled; if so, we add whatever we need to turn on the appropriate options.

The second solution is obviously a bit more elegant and general, but does depend on the transforms being run in a particular order (which may already be guaranteed).  Dustin, does this solution sound like the right way to approach things?
Flags: needinfo?(dustin)
I only understand about half of that, but what I've got is "e10s and non-e10s for these jobs are substantially different".  In that case, I think your option (1) makes more sense and will give you the flexibility and readability you need.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> I only understand about half of that, but what I've got is "e10s and
> non-e10s for these jobs are substantially different".  In that case, I think
> your option (1) makes more sense and will give you the flexibility and
> readability you need.

OK.  Does knowing that "substantially different" means "whether we pass a command-line option to the mochitest test runner" change your mind?  I'm totally willing to cut-and-paste here, just want to make sure we're on the same page before I do that.
Flags: needinfo?(dustin)
Oh, no, that doesn't seem so bad.  Thanks for the clarification.  In that case, I think (2) is the right solution, and you'd just need to make sure your transform occurs after "e10s: both" is split into e10s: true and e10s: false.  Transforms within a single file run in order, and the order of files is given in taskcluster/ci/tests/kind.yml.
Flags: needinfo?(dustin)
Depends on: 1351200
We'd like to ensure that both parallel and serial traversal in Stylo are
tested on automation.  Since e10s is the future, we've chosen to force
parallel traversal on during e10s tests, and force serial traversal on
during non-e10s tests.

You can see it working sort-of-as-intended here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e477c06323a0981f88649605ad270ea012d0f68e

The e10s tests are falling over due to 1351200, so we'll have to wait to land
this until at least that bug is fixed.

Dustin for the taskcluster transform bit, Chris for the mozharness change.
Attachment #8855316 - Flags: review?(dustin)
Attachment #8855316 - Flags: review?(cmanchester)
Attachment #8855316 - Flags: review?(dustin) → review+
Comment on attachment 8855316 [details] [diff] [review]
turn on parallel Stylo traversal for e10s tests

Review of attachment 8855316 [details] [diff] [review]:
-----------------------------------------------------------------

Can't we set environment variables directly from TaskCluster? It looks like we can from taskcluster/taskgraph/transforms/job/mozharness_test.py
Attachment #8855316 - Flags: review?(cmanchester)
(In reply to Chris Manchester (:chmanchester) from comment #17)
> Can't we set environment variables directly from TaskCluster? It looks like
> we can from taskcluster/taskgraph/transforms/job/mozharness_test.py

We could do that, but we'd have to implement the environment variable in three different places for the (currently supported) three different worker types.  Admittedly, we're only running on Linux atm, so we only need it for one worker type, but presumably we'll start testing on more platforms at some point, and it'd be nice to not have to remember to go back and fix the other workers at that point.  WDYT?
Flags: needinfo?(cmanchester)
(In reply to Nathan Froyd [:froydnj] from comment #18)
> (In reply to Chris Manchester (:chmanchester) from comment #17)
> > Can't we set environment variables directly from TaskCluster? It looks like
> > we can from taskcluster/taskgraph/transforms/job/mozharness_test.py
> 
> We could do that, but we'd have to implement the environment variable in
> three different places for the (currently supported) three different worker
> types.  Admittedly, we're only running on Linux atm, so we only need it for
> one worker type, but presumably we'll start testing on more platforms at
> some point, and it'd be nice to not have to remember to go back and fix the
> other workers at that point.  WDYT?

Someone may correct me on this, but reading the first patch in bug 1343327 makes me think we should be setting these as environment variables directly from TC. Given the sort of things I'm seeing in that file it rather seems like this will end up being de-duplicated as a matter of bringing more platforms on line. Let's check in with someone more involved with the migration.
Flags: needinfo?(cmanchester)
Depends on: 1355097
(In reply to Chris Manchester (:chmanchester) from comment #19)
> (In reply to Nathan Froyd [:froydnj] from comment #18)
> > (In reply to Chris Manchester (:chmanchester) from comment #17)
> > > Can't we set environment variables directly from TaskCluster? It looks like
> > > we can from taskcluster/taskgraph/transforms/job/mozharness_test.py
> > 
> > We could do that, but we'd have to implement the environment variable in
> > three different places for the (currently supported) three different worker
> > types.  Admittedly, we're only running on Linux atm, so we only need it for
> > one worker type, but presumably we'll start testing on more platforms at
> > some point, and it'd be nice to not have to remember to go back and fix the
> > other workers at that point.  WDYT?
> 
> Someone may correct me on this, but reading the first patch in bug 1343327
> makes me think we should be setting these as environment variables directly
> from TC. Given the sort of things I'm seeing in that file it rather seems
> like this will end up being de-duplicated as a matter of bringing more
> platforms on line. Let's check in with someone more involved with the
> migration.

Do you have someone in mind?  I asked in #taskcluster:

9:37 AM <froydnj> is it preferred to set environment variables in taskcluster code directly, or to set them in mozharness or whatever framework is controlling the job?
9:39 AM <@dustin> froydnj: depends on the var, but generally in mozharness
9:39 AM <@dustin> froydnj: especially for gecko tests
9:40 AM <froydnj> dustin: ok, that's what I thought.  thanks

which would support the command-line-option approach taken in this patch; the approach taken in this bug is also the approach taken by things like the --allow-software-gl-layers mozharness option.  My (limited) understanding is that Stylo is also going to be basically x86-64 Linux testing only until it gets turned on in Nightly, at which point there's not going to be a lot of test bringup on other platforms prior to that, so it'd be *really* nice if things just magically worked when we flipped the mozconfig switch for Stylo on Windows and Mac, rather than having to remember to twiddle things here.
Flags: needinfo?(cmanchester)
Depends on: 1354772
I guess that's fine then.

Obviously something's going to need to be twiddled when we want this to work outside linux, this patch is only checking against "linux64-stylo/" as is.
Flags: needinfo?(cmanchester)
All the deps are fixed - is this green now?
Flags: needinfo?(nfroyd)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #22)
> All the deps are fixed - is this green now?

Nope, filed bug 1355814, bug 1355813, and bug 1355811 for issues that have come up.
Depends on: 1355811, 1355813, 1355814
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #23)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #22)
> > All the deps are fixed - is this green now?
> 
> Nope, filed bug 1355814, bug 1355813, and bug 1355811 for issues that have
> come up.

These are all from Manish's font stuff - he's going to dig into them today.
Manish is going to disable the font metric stuff for now to unblock this.
Flags: needinfo?(manishearth)
Depends on: 1356122
Disabled
Flags: needinfo?(manishearth)
Nathan, fyi: the Stylo team is now asking for a new "linux64-stylo-sequential" platform (bug 1356122) that will run the Stylo tests in sequential mode on central and try. All non-e10s tests (not just Stylo's) are going away in 57 and we need to continue running both parallel and sequential Stylo modes.

As long as we have the non-e10s tests running on autoland and inbound for 55-57, it would be nice to be able to run them in sequential mode. linux64-stylo-sequential would only run tests on central, while sequential non-e10s could catch regressions sooner on autoland or inbound.
Flags: needinfo?(nfroyd)
Flags: needinfo?(nfroyd)
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b43d7c9d5a5a
turn on parallel Stylo traversal for e10s tests; r=dustin,chmanchester
https://hg.mozilla.org/mozilla-central/rev/b43d7c9d5a5a
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: