Bug 1586790 Comment 14 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

James Graham [:jgraham]

on 2021-02-24 09:43:38 PST

If I understand correctly, it's not just disabled tests you care about. In wpt, there are basically three categories of differences you might care about:

* Tests that are disabled in fission but not on other configurations. These tests either don't run at all (when whole test files are disabled) or are run but the results are ignored (when specific subtests are disabled; this is rare). The wpt sync never disables tests; it's only done by humans.
* Tests that have a fixed expectation that's different between fission and non-fission configurations e.g. `expected: FAIL` for fission, but `expected: PASS` for non-fission. These are things which clearly need to be fixed or at least understood.
* Tests that have an intermittent result of some kind, especially one that differs between fission and non-fission. This is problematic because we don't have a great system for telling which of the intermittent results actually occur in practice. For example `expected: PASS` in non-fission and `expected: [PASS, FAIL]` in fission might have been a one-time failure that the sync added that's now a perma-pass or it might be a perma-fail. This can also affect cases where there isn't a fission-specific expectation e.g. `expected: [PASS, FAIL]` in both fission and non-fission could hide a test that permafails in fission and perma-passes in non-fission.

Now answering the questions:

> can we re-enable all currently disabled WPTs for Fission, and then let the wpt-sync script disable the still failing ones, so that we don't have to individually check each intermittent to see if it can be re-enabled?

It makes sense to go through the expectation ini files and remove any non-perma expectations that differ between fission and non-fission, and re-enable any tests that run on fission, and then run all of that through try with some rebuilds, and then use the `mach wpt-update` command to update the expectations to the observed results. We can't get the wpt-sync to do this directly but we can do basically the same thing it would do. To make updating the expectations as straightforward as possible, we should try to run all the configurations found on mozilla-central. For comparison [1] is a recent wpt-sync try push; note that it uses `--disable-target-task-filter` to enable some additional tasks.

> Can the wpt-sync script or an adjacent script file a new bug under this meta bug for each Fission WPT that gets disabled so we have a consolidated list? Filing a new bug for each Fission disabled WPT should continue for each wpt-sync run so we don't miss tests that get disabled for Fission.

Getting the sync to do this would be quite non-trivial. The current way the sync files bugs is by looking at the per-PR results, bot on GitHub and in Gecko CI. We don't have fission runs in either of those places yet, and we don't have a mechanism to do something special with failures that only happen in a specific configuration. We also don't have any integration between bug filing and the try pushes we do immediately before landing, so we would miss anything that failed in that case which had passed in the per-PR run; this is fairly common in general.

Instead of tying this to the sync directly I suggest writing a job that will run on central pushes and create an artifact of ini file differences representing regressions between fission and non-fission. This is pretty straightforward and will ensure that we capture all the differences that are annotated in the ini file. It might still miss cases where the expectation is intermittent across configurations but the results are actually different between fission and non-fission. We could capture these by looking at the actual recorded results, but without historical data there are likely to be false-positives from tests that are actually just intermittent.

If you have a complete list of fission regressions, the remaining question is how to track those.  The wpt-sync uses an external metadata repo to check if there are already bugs filed for a specific test failure. I don't think we want to reuse that here. Also the fact that the sync files bugs per-pr means that we can keep the volume of bugs filed down to a reasonable level. The problem with auto-filing bugs in general is that it's very hard to script a solution to "are these issues the same bug or a different bug". Humans don't do this perfectly but they are at least better at making informed guesses ;)

Given all of this, I'd prefer if actual bugs were filed by people. Of course we can still figure out some way to ensure that you know when there are fission regressions which are not associated with any bug (e.g. new ones). The main question in doing this is where we want the association between test result and bug to live. This can go in the wpt metadata (I think there's some precendent for that in the fission project, and certainly there is in general). That has the advantage that it's easy for the script that summarizes the regressions to tag each one with a bug number. The big disadvantage is that it means you need to make an actual m-c commit to update the annotations. The other option is that the association lives outside the source tree and we have some way (i.e. script) to update this with the latest data from mozilla-central. That could be in bugzilla if we find some way to pack the data into the bugs, but it's not designed for it. It could probably be something like google sheets assuming there's some API we could use to update a sheet.

Does that make sense?

Revision 1 by

James Graham [:jgraham]

on 2021-02-24 09:49:09 PST

If I understand correctly, it's not just disabled tests you care about. In wpt, there are basically three categories of differences you might care about:

* Tests that are disabled in fission but not on other configurations. These tests either don't run at all (when whole test files are disabled) or are run but the results are ignored (when specific subtests are disabled; this is rare). The wpt sync never disables tests; it's only done by humans.
* Tests that have a fixed expectation that's different between fission and non-fission configurations e.g. `expected: FAIL` for fission, but `expected: PASS` for non-fission. These are things which clearly need to be fixed or at least understood.
* Tests that have an intermittent result of some kind, especially one that differs between fission and non-fission. This is problematic because we don't have a great system for telling which of the intermittent results actually occur in practice. For example `expected: PASS` in non-fission and `expected: [PASS, FAIL]` in fission might have been a one-time failure that the sync added that's now a perma-pass or it might be a perma-fail. This can also affect cases where there isn't a fission-specific expectation e.g. `expected: [PASS, FAIL]` in both fission and non-fission could hide a test that permafails in fission and perma-passes in non-fission.

Now answering the questions:

> can we re-enable all currently disabled WPTs for Fission, and then let the wpt-sync script disable the still failing ones, so that we don't have to individually check each intermittent to see if it can be re-enabled?

It makes sense to go through the expectation ini files and remove any non-perma expectations that differ between fission and non-fission, and re-enable any tests that run on fission, and then run all of that through try with some rebuilds, and then use the `mach wpt-update` command to update the expectations to the observed results. We can't get the wpt-sync to do this directly but we can do basically the same thing it would do. To make updating the expectations as straightforward as possible, we should try to run all the configurations found on mozilla-central. For comparison [1] is a recent wpt-sync try push; note that it uses `--disable-target-task-filter` to enable some additional tasks.

> Can the wpt-sync script or an adjacent script file a new bug under this meta bug for each Fission WPT that gets disabled so we have a consolidated list? Filing a new bug for each Fission disabled WPT should continue for each wpt-sync run so we don't miss tests that get disabled for Fission.

Getting the sync to do this would be quite non-trivial. The current way the sync files bugs is by looking at the per-PR results, bot on GitHub and in Gecko CI. We don't have fission runs in either of those places yet, and we don't have a mechanism to do something special with failures that only happen in a specific configuration. We also don't have any integration between bug filing and the try pushes we do immediately before landing, so we would miss anything that failed in that case which had passed in the per-PR run; this is fairly common in general.

Instead of tying this to the sync directly I suggest writing a job that will run on central pushes and create an artifact of ini file differences representing regressions between fission and non-fission. This is pretty straightforward and will ensure that we capture all the differences that are annotated in the ini file. It might still miss cases where the expectation is intermittent across configurations but the results are actually different between fission and non-fission. We could capture these by looking at the actual recorded results, but without historical data there are likely to be false-positives from tests that are actually just intermittent.

If you have a complete list of fission regressions, the remaining question is how to track those.  The wpt-sync uses an external metadata repo to check if there are already bugs filed for a specific test failure. I don't think we want to reuse that here. Also the fact that the sync files bugs per-pr means that we can keep the volume of bugs filed down to a reasonable level. The problem with auto-filing bugs in general is that it's very hard to script a solution to "are these issues the same bug or a different bug". Humans don't do this perfectly but they are at least better at making informed guesses ;)

Given all of this, I'd prefer if actual bugs were filed by people. Of course we can still figure out some way to ensure that you know when there are fission regressions which are not associated with any bug (e.g. new ones). The main question in doing this is where we want the association between test result and bug to live. This can go in the wpt metadata (I think there's some precendent for that in the fission project, and certainly there is in general). That has the advantage that it's easy for the script that summarizes the regressions to tag each one with a bug number. The big disadvantage is that it means you need to make an actual m-c commit to update the annotations. The other option is that the association lives outside the source tree and we have some way (i.e. script) to update this with the latest data from mozilla-central. That could be in bugzilla if we find some way to pack the data into the bugs, but it's not designed for it. It could probably be something like google sheets assuming there's some API we could use to update a sheet.

Does that make sense?

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=026810e69f92c9d345500a8d45e77b13b2c3edfc

Back to Bug 1586790 Comment 14