Closed Bug 1238435 Opened 10 years ago Closed 9 years ago

Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html, /html/dom/reflection-grouping.html followed by busting the whole run

Categories

(Testing :: web-platform-tests, defect)

defect
Not set
critical

Tracking

(e10s+, firefox46 fixed, firefox47 fixed, firefox48 fixed)

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---
firefox46 --- fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: philor, Assigned: jgraham)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [e10s-orangeblockers])

Attachments

(1 file, 4 obsolete files)

Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run → Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run
Summary: Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run → Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html, /html/dom/reflection-grouping.html followed by busting the whole run
See Also: → 1230148
Ms2ger, do you have any ideas how these tests could be busting the harness? This started after some new tests landed in a later chunk pushing these tests from chunk 4 to chunk 3.
Flags: needinfo?(Ms2ger)
Does it happen on try? I can try poking a bit.
Flags: needinfo?(Ms2ger)
Brasstacks shows its triggered on m-i, m-c, b-i, and f-t. I assume it will trigger on try as well. Here is a try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8df76c0e0e3a
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b78a4c59ad8d says yes, you can hit it on try without much problem.
NI in case you did not see comment 4. Any help is appreciated while jgraham is on PTO. Thanks!
Flags: needinfo?(Ms2ger)
It seems to me that the test is hitting an internet harness issue. It never really starts the test: 15:50:18 INFO - TEST-START | /html/dom/reflection-forms.html 15:50:18 INFO - Setting pref dom.forms.inputmode (true) 15:50:19 INFO - PROCESS | 2017 | ++DOCSHELL 0x7f1721e11800 == 29 [pid = 2068] [id = 520] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 40 (0x7f17239f6c00) [pid = 2068] [serial = 1604] [outer = (nil)] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 41 (0x7f17239f7c00) [pid = 2068] [serial = 1605] [outer = 0x7f17239f6c00] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 42 (0x7f17291d2c00) [pid = 2068] [serial = 1606] [outer = 0x7f17239f6c00] 15:50:19 INFO - PROCESS | 2017 | [Parent 2017] WARNING: Suboptimal indexes for the SQL statement 0x7fd54f8322d0 (http://mzl.la/1FuID0j).: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/storage/mozStoragePrivateHelpers.cpp, line 114 15:50:19 INFO - PROCESS | 2017 | [Parent 2017] WARNING: Suboptimal indexes for the SQL statement 0x7fd54f8321a0 (http://mzl.la/1FuID0j).: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/storage/mozStoragePrivateHelpers.cpp, line 114 15:50:20 INFO - PROCESS | 2017 | ++DOMWINDOW == 43 (0x7f17291db000) [pid = 2068] [serial = 1607] [outer = 0x7f17239f6c00] 15:50:20 INFO - PROCESS | 2017 | [Child 2068] WARNING: attempt to modify an immutable nsStandardURL: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/netwerk/base/nsStandardURL.cpp, line 1302 15:50:29 INFO - PROCESS | 2017 | --DOCSHELL 0x7f17312a4000 == 28 [pid = 2068] [id = 519] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 42 (0x7f17239f7c00) [pid = 2068] [serial = 1605] [outer = (nil)] [url = about:blank] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 41 (0x7f17291d2c00) [pid = 2068] [serial = 1606] [outer = (nil)] [url = about:blank] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 40 (0x7f17291de000) [pid = 2068] [serial = 1601] [outer = (nil)] [url = http://web-platform.test:8000/html/dom/reflection-embedded.html] 15:50:41 INFO - PROCESS | 2017 | --DOMWINDOW == 39 (0x7f1730e8dc00) [pid = 2068] [serial = 1603] [outer = (nil)] [url = about:blank] 15:53:19 INFO - PROCESS | 2017 | MARIONETTE LOG: INFO: Timeout fired 15:53:19 INFO - PROCESS | 2017 | JavaScript error: executormarionette.py, line 33: Error: Permission denied to access property "timeout" 15:53:24 INFO - TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html | expected OK 15:53:24 INFO - TEST-INFO took 185535ms
Lets temporarily disable the test on linux debug until James gets back and has time to look at it. https://treeherder.mozilla.org/#/jobs?repo=try&revision=f12ff2b1929a
Flags: needinfo?(Ms2ger)
Attachment #8707479 - Flags: review?(Ms2ger)
Comment on attachment 8707479 [details] [diff] [review] Temporarilly disable html/dom/reflection-forms.html wpt test on linux debug. r=ms2ger This just moved the failure to another test...
Attachment #8707479 - Flags: review?(Ms2ger)
When I'm running `./mach web-platform-tests --this-chunk=3 --total-chunks=8 --e10s` locally, I'm getting crashes in all runs: 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "FATAL ERROR: Non-local network connections are disabled and a connection attempt to site.example (92.242.140.2) was made." 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "You should only access hostnames available via the test networking proxy (if running mochitests) or from a test-specific httpd.js server (if running xpcshell tests). Browser services should be disabled or redirected to a local server." 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "Hit MOZ_CRASH(Attempting to connect to non-local address!) at /home/yury/Work/mozilla-central/netwerk/base/nsSocketTransport2.cpp:1253" However running via `./mach web-platform-tests testing/web-platform/tests/html/dom/reflection-embedded.html` does not result in crash. (I disabled most of the tests in the chunk via https://pastebin.mozilla.org/8856611 for convenience) Ben, we probably need to disable all reflection tests since reflection.js tests "site.example" URLs.
It's not clear to me that that's related, but feel free to update the URLs in reflection.js.
Attachment #8707479 - Attachment is obsolete: true
Attachment #8707844 - Flags: review?(Ms2ger)
Comment on attachment 8707844 [details] [diff] [review] Disable wpt on linux that might use site.example Review of attachment 8707844 [details] [diff] [review]: ----------------------------------------------------------------- Please instead update reflection.js to use the value of `location.host` rather than `site.example`.
Attachment #8707844 - Flags: review?(Ms2ger) → review-
Blocks: e10s-tests
tracking-e10s: --- → +
I don't think that will work though, at my location location.host can be resolved to some IP (due to DNS setup). Can the same problem be happening in our testing environment?
Comment on attachment 8707857 [details] [diff] [review] Renames site.example to location.host Review of attachment 8707857 [details] [diff] [review]: ----------------------------------------------------------------- The *value* of `location.host`, not the literal string "location.host".
Attachment #8707857 - Flags: review?(Ms2ger) → review-
(In reply to :Ms2ger from comment #16) > Please instead update reflection.js to use the value of `location.host` > rather than `site.example`. The tests are failing now with: TEST-UNEXPECTED-FAIL | /html/dom/reflection-embedded.html | img.lowsrc: setAttribute() to "//web-platform.test:8000/path???@#l" followed by IDL get - assert_equals: expected "http://web-platform.test:8000/path???@#l" but got "//web-platform.test:8000/path???@#l"
Linux64 debug wpt-e10s-3 hidden on all trunk trees and try. Give a shout when it gets better.
Attachment #8707873 - Attachment is obsolete: true
(In reply to Yury Delendik (:yury) from comment #21) > (In reply to :Ms2ger from comment #16) > > Please instead update reflection.js to use the value of `location.host` > > rather than `site.example`. > > The tests are failing now with: > > TEST-UNEXPECTED-FAIL | /html/dom/reflection-embedded.html | img.lowsrc: > setAttribute() to "//web-platform.test:8000/path???@#l" followed by IDL get > - assert_equals: expected "http://web-platform.test:8000/path???@#l" but got > "//web-platform.test:8000/path???@#l" You'll need to update `testing/web-platform/meta/html/dom/reflection-embedded.html.ini` where it mentions site.example.
So to be clear that I understand what's going on here: * After adding the streams tests something (possibly unrelated to the streams tests, possibly related to chunking) happened that causes W3/linux debug (only) to fail after restart. * This bug contains a patch to change the reflection tests to not inadvertently access an external server, but there is no particular evidence that's related to the intermittent failure, except that a crash in those tests will force a restart which will then cause the above bug to manifest itself. * As a result we are currently hiding W3 on linux debug. Is that right? Just making sure I didn't miss anything
I believe that's correct. Any help figuring this out would be appreciated!
Blocks: 1241297
And hidden on mozilla-aurora.
If we're starting to see OF reports of this issue is because on TC we're running with 12 chunks instead of 8 and the issue has moved to chunk 4. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc%20W%284%29&group_state=expanded&exclusion_profile=false It seems that changing the chunking is causing new issues to pop. I will go back to 8 in the chunk size and see if it goes away.
Latest wpt update moved this into chunk 5 on TC, and I'm not in the mood to play merge-and-tree whac-a-mole so I've just hidden both 4 and 5.
Severity: major → critical
Why do we do this? https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/harness/wptrunner/wptrunner.py#235 I tried running the test manually and I'm putting directly into pdb: /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/arQ-ntrqTT-BbuWRFKGkuQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=12 --this-chunk=5 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil html/dom/reflection-forms.html
The issue was that was using it in combination with --this-chunk/--total-chunk
Right, that is a bug; it should either be removed or put behind some kind of flag.
What ended up landing in here? I don't see any recent changes to this file: https://hg.mozilla.org/mozilla-central/filelog/584870f1cbc5d060a57e147ce249f736956e2b62/testing/web-platform/meta/html/dom/reflection-forms.html.ini It seems that it dissapeared after the 15th: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1238435&startday=2016-01-11&endday=2016-01-22&tree=all Even though I still face it for the TaskCluster tasks.
That's the wrong "what happened?" to be looking at, there you are looking at when I hid it, so it stopped being starred. The one to look at is on January 30th, when a w-p-t update from bug 1244461 moved things around so that in the buildbot 8-chunk it moved back into a chunk where it was happy, but in your 12-chunk it did not.
Thanks philor! It seems I never moved it back to 8 chunks (I thought I had)
Going back to 8 chunks for TaskCluster has cleared this issue. For now, this bug is a ticking bomb. Could someone investigate this? You can make this issue to happen on try with a different chunking. You can probably reproduce this issue locally inside of docker. You can also use --interactive in your push to see it happen live on the production machines inside of docker.
:jgraham, I was under the impression we ran web-platform-tests in a --run-by-dir (where the browser is a fresh browser for the specific directory). Assuming that, are we splitting a directory up when changing chunks? Can you help us figure out what the root cause might be so this bug can turn into something actionable and we can start playing with more chunks?
Flags: needinfo?(james)
We aren't using --run-by-dir in production at the moment for wpt, although the harness has support. It's hard to tell what's going on here without inspecting what's going on when the tests fail. The TEST-UNEXPECTED-TIMEOUT might just be that the test is long-running and hits the limit of even our long timeout. If that's the case it's possible to split up the tests further or something. The fact that sometimes the run is then busted is more troublesome because it means that either there's a bug in the harness or the server is getting into a bad state for some reason.
Flags: needinfo?(james)
ok, that is good to know. In fact this could be related to docker running the tests roughly 20% slower than buildbot- that is the case with many of our tests that we hit timeouts we were not hitting before.
jmaher: remember that this bug was originally filed for Buildbot rather than TaskCluster. As philor pointed out in comment 36, some tests moved from one chunk to another and things started working again. In other words, justr trying to clarify that TC/docker is slower but Buildbot was hitting this as well.
Getting quite old. Modified my previous exclusion to just buildbot's wpt-3 on debug linux64 for Gecko 46, since it's going to ride that train into 46's grave, created a new one for taskcluster's wpt-3 since I'm just ignoring every result from it.
And we're back to failing on trunk with buildbot's wpt-3.
This is probably the top wpt intermittent at the moment and appears to be harness related IIUC? Is there a better component for it to be in otherwise?
Flags: needinfo?(Ms2ger)
I don't think there's a better component. The fundamental problem here is that I haven't got a good way to reproduce the issue, so it's hard to tell exactly what's going on. Disabling these tests might help a bit but it wouldn't fix the underlying issue, so we would just have an even harder to reproduce serious intermittent.
Flags: needinfo?(Ms2ger)
Could you finish your patch to see if it helps?
Flags: needinfo?(ydelendik)
:jgraham, This is a linux64 only error- is this on buildbot or taskcluster? This has 79 instances in the last week, that is a lot given the fact the trees were closed often. Assuming you cannot reproduce this or don't have other ideas, can you please disable these tests asap?
Flags: needinfo?(james)
Done, but I can't guarantee that the problem won't manifest itself in some other way, since we still don't understand it at all.
Given https://treeherder.mozilla.org/logviewer.html#?job_id=155462&repo=ash, I don't have high hopes of disabling our way to victory here.
Whiteboard: [e10s-orangeblockers]
I now have a more plausible theory for what's going on here. It seems some other tests in the chunk were trying to send a 4½ Mb image one byte at a time with a pause between each byte. Not only was this very silly, it also exposes the fact that the trickle function in wpt is hilariously inefficient, and (in this case) allocates multiple objects, including a callback function, per byte of input. So I suspect what happens is that these tests run, the server memory usage balloons and it is later killed by the OOM killer (or just unable to serve requests because it's so busy trying to GC all those objects). So I will try fixing the tests and wptserve and see if that helps in this case.
Flags: needinfo?(james)
Attachment #8732088 - Flags: review?(Ms2ger) → review+
Comment on attachment 8732088 [details] MozReview Request: Bug 1238435 - Use a slow-loading image in some tests that will load after a minute and a half rather than seven and a half weeks, r=Ms2ger https://reviewboard.mozilla.org/r/40981/#review37555
(In reply to Ryan VanderMeulen [:RyanVM] from comment #61) > https://hg.mozilla.org/mozilla-central/rev/7657ca3b8c00 Looks like this fixed it for good! https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e87d90350f8 Rev 7657ca3b8c00 will need uplift to Aurora/Beta to take care of the failures there as well.
Assignee: nobody → james
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(ydelendik)
Keywords: leave-open
Resolution: --- → FIXED
Whiteboard: [e10s-orangeblockers] → [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta]
Target Milestone: --- → mozilla48
Let's backout 246c969ea49f, then.
(In reply to :Ms2ger from comment #64) > Let's backout 246c969ea49f, then. Yeah, I've got it queued for the next time I'm pushing.
Whiteboard: [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta] → [e10s-orangeblockers][checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: