Closed Bug 1305877 Opened 3 years ago Closed 1 year ago

web-platform-test referrer-policy directory takes too long

Categories

(Testing :: web-platform-tests, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(7 files, 1 obsolete file)

By default, the WPT harness chunks tests using an "equal_time" chunker. This groups tests such that everything under a/b/c is in the same group.

Either the directory grouping behavior is insufficient, buggy, or a certain directory takes way too long to execute because the W7 job on Windows is taking forever - 100 or more minutes in some cases. A number of WPT chunks take <20 minutes.

A log (https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win64-debug/1474989110/autoland_win8_64-debug_test-web-platform-tests-7-bm110-tests1-windows-build167.txt.gz) shows the first test:

11:26:49     INFO - TEST-START | /referrer-policy/no-referrer-when-downgrade/meta-referrer/cross-origin/http-http/img-tag/insecure-protocol.keep-origin-redirect.http.html

and the last test:

12:59:44     INFO - TEST-START | /referrer-policy/strict-origin/attr-referrer/same-origin/http-https/img-tag/upgrade-protocol.swap-origin-redirect.http.html

All tests under referrer-policy/. Without digging in too deeply, first impression is the directory chunker is not going 3 levels deep as advertised.

If the WPT tests run in random order and aren't as brittle as other tests, I vote for using the "hash" chunker to chunk them. The results will be consistent and we should have similar numbers of tests in all chunks.
Comment on attachment 8795552 [details]
Bug 1305877 - Change WPT chunking default to directory hash;

https://reviewboard.mozilla.org/r/81568/#review80244

::: testing/mozharness/scripts/web_platform_tests.py:169
(Diff revision 1)
>  
> +        # Chunk by filename hash because chunking by directory (the default
> +        # behavior) results in some chunks taking much longer to execute than
> +        # others. Also, exercising tests in a more random order introduces more
> +        # chaos and has a better chance at stumbling upon bugs.
> +        cmd.append('--chunk-type=hash')

So, in general the idea appeals to me (which is why there's a hash chunker to begin with). But I am pretty worried about breaking developer expectations running the tests in a toally random order. I also worry that it will make the total runtime worse because tests that set the same prefs, or use the same protocol, will no longer be run together, so forcing a relatively expensive slow-path to be taken more often.
Attachment #8795552 - Flags: review?(james) → review-
Comment on attachment 8795551 [details]
Bug 1305877 - Make HashChunker stable;

https://reviewboard.mozilla.org/r/81566/#review80242
Attachment #8795551 - Flags: review?(james) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=233355e0803e&selectedJob=28154611 is a working try run; this does at least seem to even out the chunk runtime with 5 chunks, less so with 12. But I still don't think it's actually a good idea.
Comment on attachment 8795552 [details]
Bug 1305877 - Change WPT chunking default to directory hash;

https://reviewboard.mozilla.org/r/81568/#review80244

> So, in general the idea appeals to me (which is why there's a hash chunker to begin with). But I am pretty worried about breaking developer expectations running the tests in a toally random order. I also worry that it will make the total runtime worse because tests that set the same prefs, or use the same protocol, will no longer be run together, so forcing a relatively expensive slow-path to be taken more often.

It doesn't strictly run them in totally random order: it roughly runs every Nth test. But yes, that could break developer expectations.

That being said, I challenge the assertion that there are developer expectations here. I think developers care most that they can run a test and that they can reproduce a test failure. Mach commands already ensure they can run tests. `mach web-platform-tests` already exposes chunking. So as long as chunking behavior is the same locally as it is in automation, that should go a long way towards reproducing failures. One-click loaners and further unifying the execution environment (e.g. using `mach` in automation) will close most of the remaining gap.

As for making total runtime worse by not running tests together, I'm willing to believe the claim, but I want to see data first. Fortunately we can get this data rather easily.

As for similar tests no longer running together, I argue this is a good thing. I've long argued that we should randomize test execution order. This will tease out poorly written tests that have introduced subtle execution order dependencies. This will tease out platform features that also have behavior that differs depending on execution order. Randomizing test execution order better reflects the nature of the average user interacting with the browser and therefore I believe it to be a superior testing method. If you look at https://treeherder.mozilla.org/#/jobs?repo=try&revision=75736a8fa5a8e4bf540c0defd2c70bc2fb544210, the failure in chunk 47 does seem to indicate we flushed out something wonky.

Out of curiosity, how do other browser vendors execute WPT? I know Chrome (at least used to) execute tests in random order and limited the max number of tests in a browser process. If a test failed, they produced a document (JSON possibly) listing the tests that were executed so developers could reproduce that sequence easily. I wonder if they do that for WPT today?

This bug was filed because W7 on Windows is dangerously close to max execution time. Furthermore EqualTimeChunker is yielding highly inconsistent test execution times across chunks. **Something needs to be done.** Taking every Nth test (via hashing or introducing a new chunker to literally take every Nth test) or grouping the tests in equal sized chunks without regards for directory boudaries seem like the simplest solutions from where I'm standing. But I'm not the expert on WPT that you are.
James: this is blocking my work to run WPT from source checkout since the source checkout can bump us over the 2 hour job execution time limit. I'd prefer to resolve this chunking problem this week. Please tell me what solution you prefer.
Flags: needinfo?(james)
I have run the experiment to see if this affects overall performance, and it seems that it doesn't make a substantial difference.

With that in mind, I think I'm OK with taking this approach, precisely because it doesn't randomise the order of the tests, except during rechunking. I think anything else we do (other than test number % N, which would rearrange the tests too often) will be less effective because slow running tests aren't randomly distributed, but occur in clumps. However I *do* think this is a loss for developer ergonomics. In particular I think that people will be surprised that 

./mach path/to/directory/containing/failing/test

doesn't rerun the same sequence of tests as in automation; that currently works as expected in every other test harness. I also think that it will make rechunking the tests more problematic as that's the only time at which the sequence of tests will change. I imagine this approach could later cause problems with any attempt to dynamically pick a number of chunks at runtime.

One way to avoid those problems might be to hash based on the path excluding the filename, although that would obviously be less effective at equalizing the chunk sizes. Nevertheless it could be worth running the experiment.

In any case we need to change the mach command to use the same kind of chunking as mozharness. I also think that we should adjust the number of chunks to be a power of two, and maintain that property, so that when we go from N to 2N chunks a test in chunk i either ends up in chunk i or chunk N+i.
Flags: needinfo?(james)
(In reply to James Graham [:jgraham] from comment #9)
> In any case we need to change the mach command to use the same kind of
> chunking as mozharness.

I'm not sure what you mean. mach's default is currently to run a single chunk. And it is possible to pass arguments to yield the same chunking behavior as automation. So what do we need to change?

> I also think that we should adjust the number of
> chunks to be a power of two, and maintain that property, so that when we go
> from N to 2N chunks a test in chunk i either ends up in chunk i or chunk N+i.

I'd like to think this doesn't matter. If someone wants to reproduce what automation does exactly, they can plug in the chunk counts from automation. We can provide tools to make that easier if it is too difficult.

I just have a strong reaction against establishing chunks such that there is order to the distribution of tests in chunks. Grouping of tests within the same chunk is fine. But relying on the same test/group to be consistently in or around chunk N is a bit too far, IMO. I want to see more chaos, not less.
The Try push using the directory hashing mode showed some consistent failures that I can't wrap my head around. Could you please take a look? https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cb558a3879affd898d21cfe595df05e3a1ffaff
Flags: needinfo?(james)
(In reply to Gregory Szorc [:gps] from comment #13)
> (In reply to James Graham [:jgraham] from comment #9)
> > In any case we need to change the mach command to use the same kind of
> > chunking as mozharness.
> 
> I'm not sure what you mean. mach's default is currently to run a single
> chunk. And it is possible to pass arguments to yield the same chunking
> behavior as automation. So what do we need to change?

if kwargs["total_chunks"] > 1 and kwargs["chunk_type"] is None:
    kwargs["chunk_type"] = "hash"

> > I also think that we should adjust the number of
> > chunks to be a power of two, and maintain that property, so that when we go
> > from N to 2N chunks a test in chunk i either ends up in chunk i or chunk N+i.
> 
> I'd like to think this doesn't matter. If someone wants to reproduce what
> automation does exactly, they can plug in the chunk counts from automation.
> We can provide tools to make that easier if it is too difficult.

This isn't about reproducing. It's about the fact that we have tools that track e.g. chunk N failed because test X was intermittent. Now, those tools are *wrong* but they're also easy to write given our current infra and so are quite common. This strategy will allow us to keep using that kind of tool and change the chunk size in a way that will be easy to comprehend compared to the alternative of "tests randomly end up in different chunks".

> I just have a strong reaction against establishing chunks such that there is
> order to the distribution of tests in chunks. Grouping of tests within the
> same chunk is fine. But relying on the same test/group to be consistently in
> or around chunk N is a bit too far, IMO. I want to see more chaos, not less.

Sure, if our goal is to flush out issues in the tests I agree that this is worthwhile. But your proposal doesn't actually help much because the people who see the failing tests are infra people and only when the chunking changes. Going from "I have a new orange when the chunk size changes" to "I have a developer with the time any inclination to fix the underlying issue (which may be a test problem, or may be a gecko problem)" is non-trivial; historically we have not been very successful at getting attention on intermittent oranges, which are a problem that affects developers directly, so getting attention on a problem that doesn't affect them directly seems even harder.

If you were to propose a tier-3 suite running wpt in a totally randomised order (maybe using something from the build as the random seed), I would totally support that. If we got that to be consistently green I would support making that the default way to run these tests. But I don't support conflating this work to equalise chunk sizes with a "more chaos" manifesto because I don't think this is actually going to help you achieve that goal.
Based on the logs it looks like we are ending up with some windows that don't get closed from previous tests, despite the best effort of the test harness. Therefore cross-test contamination seems quite plausible. The directory structure here also means that chunk-by-directory is effectively chunk by file (I hate the mixed-content tests). I need to try running the whole chunk to get a better idea (unfortunately running wpt on one-click-loaners doesn't seem to work yet).
Flags: needinfo?(james)
Comment on attachment 8795552 [details]
Bug 1305877 - Change WPT chunking default to directory hash;

https://reviewboard.mozilla.org/r/81568/#review81008

::: testing/mozharness/scripts/web_platform_tests.py:157
(Diff revision 2)
>  
> +        # Chunk by filename hash because chunking by directory (the default
> +        # behavior) results in some chunks taking much longer to execute than
> +        # others. Also, exercising tests in a more random order introduces more
> +        # chaos and has a better chance at stumbling upon bugs.
> +        cmd.append('--chunk-type=hash')

Subject to getting a green try run ofc.

::: testing/mozharness/scripts/web_platform_tests.py:157
(Diff revision 2)
>  
> +        # Chunk by filename hash because chunking by directory (the default
> +        # behavior) results in some chunks taking much longer to execute than
> +        # others. Also, exercising tests in a more random order introduces more
> +        # chaos and has a better chance at stumbling upon bugs.
> +        cmd.append('--chunk-type=hash')

Subject to getting a green try run ofc.
Attachment #8795552 - Flags: review?(james) → review+
Comment on attachment 8796327 [details]
Bug 1305877 - Add a directory hash chunker;

https://reviewboard.mozilla.org/r/82212/#review81006

::: testing/web-platform/harness/wptrunner/testloader.py:51
(Diff revision 1)
>      def __call__(self, manifest):
>          chunk_index = self.chunk_number - 1
>          for test_path, tests in manifest:
> -            h = int(hashlib.md5(test_path).hexdigest(), 16)
> +            # We hash the directory of the test so all tests in the same
> +            # directory end up in the same chunk.
> +            h = int(hashlib.md5(os.path.dirname(test_path)).hexdigest(), 16)

I'm happy with this approach or the previous one. I think this is safer, but it clearly doesn't equalise chunk times nearly as well.

(I still think making #chunks a power of two would be a big help).

I'm tempeted to leave both options in the code as hash and hash_directory or so.
Attachment #8796327 - Flags: review?(james) → review+
Comment on attachment 8796663 [details]
Bug 1305877 - Disable wpt navigation test causing hang later in the test run,

https://reviewboard.mozilla.org/r/82430/#review81052
Attachment #8796663 - Flags: review?(gps) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/38c9ddd81b65
Disable wpt navigation test causing hang later in the test run, r=gps
Attachment #8796689 - Attachment is obsolete: true
Attachment #8796689 - Flags: review?(gps)
jgraham: somehow - I don't know how - we seem to have a new perma orange as a result of the dir chunking. https://treeherder.mozilla.org/#/jobs?repo=try&revision=e48a917c4d7d1d85c03eeaaa9f31b7b0b1a775b0. Search for UnicodeDecodeError in the logs of W8. I can has sad.
Flags: needinfo?(james)
Keywords: leave-open
The grouping of tests between your Try run and my latest one are different. I think my latest try push (without the explicit --chunk-type argument to mozharness) is somehow defaulting to "none". I may re-add that explicit argument. If it is green, I'll push since you did r+ the mozharness script change before.

I guess this means that if shuffle test execution order, we can expect tons of intermittent failures. Ugh.
(In reply to James Graham [:jgraham] from comment #32)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=a500f16134c5&filter-
> searchStr=web-platform-tests seems to be clean fwiw.

Looking again, I don't think your push switched away from the "equal_time" default chunker. I've performed multiple pushes - including a rebase onto a central commit from today - and every time the "dir_hash" chunker sees a failure in W8 on Linux. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=c52b7f8166229079653fcecc515baad228e93429 from today and https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ac2476da9eeead89267ef50d4cb031bf4a4e510 from yesterday.

mixed-content/optionally-blockable/no-opt-in/cross-origin-http/link-prefetch-tag/top-level/keep-scheme-redirect/no-opt-in-allows.https.html is the test that's raising. But I suspect test state is "corrupted" before it gets to this test...
Are you sure you didn't lose my patch to disable testing/web-platform/tests/html/browsers/the-window-object/apis-for-creating-and-navigating-browsing-contexts-by-name/close_script_defer.html? Running the chunk locally it was obvious that that test was leaving the browser in a bad state, and things passed locally once the test was disabled.
Flags: needinfo?(james)
Yes. Disabling of that test landed in 38c9ddd81b65 and is already on central. My latest (failing) try push is rebased on top of that commit.
Attachment #8796661 - Flags: review?(james)
Attachment #8796662 - Flags: review?(james)
Comment on attachment 8797279 [details]
Bug 1305877 - Disable appcache-ordering-main.https.html;

https://reviewboard.mozilla.org/r/82880/#review81518

::: testing/web-platform/meta/service-workers/service-worker/appcache-ordering-main.https.ini:4
(Diff revision 1)
> +[appcache-ordering-main.https.html]
> +  type: testharness
> +  expected:
> +    if e10s and ((os == "win") or ((os == "linux") and (bits = 64)): TIMEOUT

I think you may need to add the subtest explicitly:

    [appcache-ordering-main.https.html]
      type: testharness
      expected:
        if e10s and ((os == "win") or ((os == "linux") and (bits = 64)): TIMEOUT
      [serviceworkers take priority over appcaches]
        expected: if e10s and ((os == "win") or ((os == "linux") and (bits = 64)): NOTRUN
Comment on attachment 8797279 [details]
Bug 1305877 - Disable appcache-ordering-main.https.html;

https://reviewboard.mozilla.org/r/82880/#review81734
Attachment #8797279 - Flags: review?(james) → review+
Comment on attachment 8797279 [details]
Bug 1305877 - Disable appcache-ordering-main.https.html;

https://reviewboard.mozilla.org/r/82880/#review81764
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/83f4e83b4f1a
Make HashChunker stable; r=jgraham
https://hg.mozilla.org/integration/autoland/rev/93d193ddedc9
Add a directory hash chunker; r=jgraham
https://hg.mozilla.org/integration/autoland/rev/60ab329a382b
Change WPT chunking default to directory hash; r=jgraham
https://hg.mozilla.org/integration/autoland/rev/604451ab5829
Disable appcache-ordering-main.https.html; r=jgraham
I followed up with bkelly about the disabled test in bug 1295331.
Blocks: 1295331
Depends on: 1310307
The leave-open keyword is there and there is no activity for 6 months.
:gps, maybe it's time to close this bug?
Flags: needinfo?(gps)
James, wdyt ?
Flags: needinfo?(gps) → needinfo?(james)
Version: Version 3 → unspecified
IDK why this was leave-open.
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(james)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.