Closed
Bug 1238435
Opened 8 years ago
Closed 8 years ago
Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html, /html/dom/reflection-grouping.html followed by busting the whole run
Categories
(Testing :: web-platform-tests, defect)
Testing
web-platform-tests
Tracking
(e10s+, firefox46 fixed, firefox47 fixed, firefox48 fixed)
RESOLVED
FIXED
mozilla48
People
(Reporter: philor, Assigned: jgraham)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure, Whiteboard: [e10s-orangeblockers])
Attachments
(1 file, 4 obsolete files)
Busting the whole run isn't very nice behavior. https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=19539321 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=19541676 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=19542521 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=19547542 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=19553990
Reporter | ||
Updated•8 years ago
|
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run → Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run
Reporter | ||
Updated•8 years ago
|
Summary: Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html followed by busting the whole run → Intermittent e10s TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html, /html/dom/reflection-embedded.html, /html/dom/reflection-grouping.html followed by busting the whole run
Comment 1•8 years ago
|
||
Ms2ger, do you have any ideas how these tests could be busting the harness? This started after some new tests landed in a later chunk pushing these tests from chunk 4 to chunk 3.
Flags: needinfo?(Ms2ger)
Comment 3•8 years ago
|
||
Brasstacks shows its triggered on m-i, m-c, b-i, and f-t. I assume it will trigger on try as well. Here is a try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8df76c0e0e3a
Reporter | ||
Comment 4•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b78a4c59ad8d says yes, you can hit it on try without much problem.
Comment hidden (Intermittent Failures Robot) |
Comment 6•8 years ago
|
||
NI in case you did not see comment 4. Any help is appreciated while jgraham is on PTO. Thanks!
Flags: needinfo?(Ms2ger)
Comment hidden (Intermittent Failures Robot) |
Comment 8•8 years ago
|
||
It seems to me that the test is hitting an internet harness issue. It never really starts the test: 15:50:18 INFO - TEST-START | /html/dom/reflection-forms.html 15:50:18 INFO - Setting pref dom.forms.inputmode (true) 15:50:19 INFO - PROCESS | 2017 | ++DOCSHELL 0x7f1721e11800 == 29 [pid = 2068] [id = 520] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 40 (0x7f17239f6c00) [pid = 2068] [serial = 1604] [outer = (nil)] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 41 (0x7f17239f7c00) [pid = 2068] [serial = 1605] [outer = 0x7f17239f6c00] 15:50:19 INFO - PROCESS | 2017 | ++DOMWINDOW == 42 (0x7f17291d2c00) [pid = 2068] [serial = 1606] [outer = 0x7f17239f6c00] 15:50:19 INFO - PROCESS | 2017 | [Parent 2017] WARNING: Suboptimal indexes for the SQL statement 0x7fd54f8322d0 (http://mzl.la/1FuID0j).: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/storage/mozStoragePrivateHelpers.cpp, line 114 15:50:19 INFO - PROCESS | 2017 | [Parent 2017] WARNING: Suboptimal indexes for the SQL statement 0x7fd54f8321a0 (http://mzl.la/1FuID0j).: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/storage/mozStoragePrivateHelpers.cpp, line 114 15:50:20 INFO - PROCESS | 2017 | ++DOMWINDOW == 43 (0x7f17291db000) [pid = 2068] [serial = 1607] [outer = 0x7f17239f6c00] 15:50:20 INFO - PROCESS | 2017 | [Child 2068] WARNING: attempt to modify an immutable nsStandardURL: file /builds/slave/m-in-l64-d-0000000000000000000/build/src/netwerk/base/nsStandardURL.cpp, line 1302 15:50:29 INFO - PROCESS | 2017 | --DOCSHELL 0x7f17312a4000 == 28 [pid = 2068] [id = 519] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 42 (0x7f17239f7c00) [pid = 2068] [serial = 1605] [outer = (nil)] [url = about:blank] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 41 (0x7f17291d2c00) [pid = 2068] [serial = 1606] [outer = (nil)] [url = about:blank] 15:50:37 INFO - PROCESS | 2017 | --DOMWINDOW == 40 (0x7f17291de000) [pid = 2068] [serial = 1601] [outer = (nil)] [url = http://web-platform.test:8000/html/dom/reflection-embedded.html] 15:50:41 INFO - PROCESS | 2017 | --DOMWINDOW == 39 (0x7f1730e8dc00) [pid = 2068] [serial = 1603] [outer = (nil)] [url = about:blank] 15:53:19 INFO - PROCESS | 2017 | MARIONETTE LOG: INFO: Timeout fired 15:53:19 INFO - PROCESS | 2017 | JavaScript error: executormarionette.py, line 33: Error: Permission denied to access property "timeout" 15:53:24 INFO - TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-forms.html | expected OK 15:53:24 INFO - TEST-INFO took 185535ms
Comment 9•8 years ago
|
||
Lets temporarily disable the test on linux debug until James gets back and has time to look at it. https://treeherder.mozilla.org/#/jobs?repo=try&revision=f12ff2b1929a
Flags: needinfo?(Ms2ger)
Attachment #8707479 -
Flags: review?(Ms2ger)
Updated•8 years ago
|
Keywords: leave-open
Comment 10•8 years ago
|
||
Comment on attachment 8707479 [details] [diff] [review] Temporarilly disable html/dom/reflection-forms.html wpt test on linux debug. r=ms2ger This just moved the failure to another test...
Attachment #8707479 -
Flags: review?(Ms2ger)
Comment 11•8 years ago
|
||
When I'm running `./mach web-platform-tests --this-chunk=3 --total-chunks=8 --e10s` locally, I'm getting crashes in all runs: 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "FATAL ERROR: Non-local network connections are disabled and a connection attempt to site.example (92.242.140.2) was made." 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "You should only access hostnames available via the test networking proxy (if running mochitests) or from a test-specific httpd.js server (if running xpcshell tests). Browser services should be disabled or redirected to a local server." 1:18.93 PROCESS_OUTPUT: ProcessReader (pid:2902) "Hit MOZ_CRASH(Attempting to connect to non-local address!) at /home/yury/Work/mozilla-central/netwerk/base/nsSocketTransport2.cpp:1253" However running via `./mach web-platform-tests testing/web-platform/tests/html/dom/reflection-embedded.html` does not result in crash. (I disabled most of the tests in the chunk via https://pastebin.mozilla.org/8856611 for convenience) Ben, we probably need to disable all reflection tests since reflection.js tests "site.example" URLs.
Comment hidden (Intermittent Failures Robot) |
Comment 13•8 years ago
|
||
It's not clear to me that that's related, but feel free to update the URLs in reflection.js.
Comment 14•8 years ago
|
||
Checking without "site.example" URL tests https://treeherder.mozilla.org/#/jobs?repo=try&revision=57e46988740f&selectedJob=15428225
Comment 15•8 years ago
|
||
Disables tests only for DEBUG and on Linux. https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab52b9bf1722
Attachment #8707479 -
Attachment is obsolete: true
Attachment #8707844 -
Flags: review?(Ms2ger)
Comment 16•8 years ago
|
||
Comment on attachment 8707844 [details] [diff] [review] Disable wpt on linux that might use site.example Review of attachment 8707844 [details] [diff] [review]: ----------------------------------------------------------------- Please instead update reflection.js to use the value of `location.host` rather than `site.example`.
Attachment #8707844 -
Flags: review?(Ms2ger) → review-
Updated•8 years ago
|
Blocks: e10s-tests
tracking-e10s:
--- → +
Comment 17•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5de4cf5c623c
Attachment #8707844 -
Attachment is obsolete: true
Attachment #8707857 -
Flags: review?(Ms2ger)
Comment 18•8 years ago
|
||
I don't think that will work though, at my location location.host can be resolved to some IP (due to DNS setup). Can the same problem be happening in our testing environment?
Comment 19•8 years ago
|
||
Comment on attachment 8707857 [details] [diff] [review] Renames site.example to location.host Review of attachment 8707857 [details] [diff] [review]: ----------------------------------------------------------------- The *value* of `location.host`, not the literal string "location.host".
Attachment #8707857 -
Flags: review?(Ms2ger) → review-
Comment 20•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d62c764e6baf
Attachment #8707857 -
Attachment is obsolete: true
Comment 21•8 years ago
|
||
(In reply to :Ms2ger from comment #16) > Please instead update reflection.js to use the value of `location.host` > rather than `site.example`. The tests are failing now with: TEST-UNEXPECTED-FAIL | /html/dom/reflection-embedded.html | img.lowsrc: setAttribute() to "//web-platform.test:8000/path???@#l" followed by IDL get - assert_equals: expected "http://web-platform.test:8000/path???@#l" but got "//web-platform.test:8000/path???@#l"
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 23•8 years ago
|
||
Linux64 debug wpt-e10s-3 hidden on all trunk trees and try. Give a shout when it gets better.
Updated•8 years ago
|
Attachment #8707873 -
Attachment is obsolete: true
Comment 24•8 years ago
|
||
(In reply to Yury Delendik (:yury) from comment #21) > (In reply to :Ms2ger from comment #16) > > Please instead update reflection.js to use the value of `location.host` > > rather than `site.example`. > > The tests are failing now with: > > TEST-UNEXPECTED-FAIL | /html/dom/reflection-embedded.html | img.lowsrc: > setAttribute() to "//web-platform.test:8000/path???@#l" followed by IDL get > - assert_equals: expected "http://web-platform.test:8000/path???@#l" but got > "//web-platform.test:8000/path???@#l" You'll need to update `testing/web-platform/meta/html/dom/reflection-embedded.html.ini` where it mentions site.example.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 26•8 years ago
|
||
So to be clear that I understand what's going on here: * After adding the streams tests something (possibly unrelated to the streams tests, possibly related to chunking) happened that causes W3/linux debug (only) to fail after restart. * This bug contains a patch to change the reflection tests to not inadvertently access an external server, but there is no particular evidence that's related to the intermittent failure, except that a crash in those tests will force a restart which will then cause the above bug to manifest itself. * As a result we are currently hiding W3 on linux debug. Is that right? Just making sure I didn't miss anything
Comment 27•8 years ago
|
||
I believe that's correct. Any help figuring this out would be appreciated!
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 29•8 years ago
|
||
And hidden on mozilla-aurora.
Comment 30•8 years ago
|
||
If we're starting to see OF reports of this issue is because on TC we're running with 12 chunks instead of 8 and the issue has moved to chunk 4. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc%20W%284%29&group_state=expanded&exclusion_profile=false It seems that changing the chunking is causing new issues to pop. I will go back to 8 in the chunk size and see if it goes away.
Reporter | ||
Comment 31•8 years ago
|
||
Latest wpt update moved this into chunk 5 on TC, and I'm not in the mood to play merge-and-tree whac-a-mole so I've just hidden both 4 and 5.
Severity: major → critical
Comment 32•8 years ago
|
||
Why do we do this? https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/harness/wptrunner/wptrunner.py#235 I tried running the test manually and I'm putting directly into pdb: /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/web-platform/runtests.py --log-raw=- --log-raw=/home/worker/workspace/build/blobber_upload_dir/wpt_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --symbols-path=https://queue.taskcluster.net/v1/task/arQ-ntrqTT-BbuWRFKGkuQ/artifacts/public/build/target.crashreporter-symbols.zip --stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk --test-type=testharness --total-chunks=12 --this-chunk=5 --prefs-root=/home/worker/workspace/build/tests/web-platform/prefs --processes=1 --config=/home/worker/workspace/build/tests/web-platform/wptrunner.ini --ca-cert-path=/home/worker/workspace/build/tests/web-platform/certs/cacert.pem --host-key-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.key --host-cert-path=/home/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem --certutil-binary=/home/worker/workspace/build/tests/bin/certutil html/dom/reflection-forms.html
Comment 33•8 years ago
|
||
The issue was that was using it in combination with --this-chunk/--total-chunk
Assignee | ||
Comment 34•8 years ago
|
||
Right, that is a bug; it should either be removed or put behind some kind of flag.
Comment 35•8 years ago
|
||
What ended up landing in here? I don't see any recent changes to this file: https://hg.mozilla.org/mozilla-central/filelog/584870f1cbc5d060a57e147ce249f736956e2b62/testing/web-platform/meta/html/dom/reflection-forms.html.ini It seems that it dissapeared after the 15th: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1238435&startday=2016-01-11&endday=2016-01-22&tree=all Even though I still face it for the TaskCluster tasks.
Reporter | ||
Comment 36•8 years ago
|
||
That's the wrong "what happened?" to be looking at, there you are looking at when I hid it, so it stopped being starred. The one to look at is on January 30th, when a w-p-t update from bug 1244461 moved things around so that in the buildbot 8-chunk it moved back into a chunk where it was happy, but in your 12-chunk it did not.
Comment 37•8 years ago
|
||
Thanks philor! It seems I never moved it back to 8 chunks (I thought I had)
Comment 38•8 years ago
|
||
Going back to 8 chunks for TaskCluster has cleared this issue. For now, this bug is a ticking bomb. Could someone investigate this? You can make this issue to happen on try with a different chunking. You can probably reproduce this issue locally inside of docker. You can also use --interactive in your push to see it happen live on the production machines inside of docker.
Comment 39•8 years ago
|
||
:jgraham, I was under the impression we ran web-platform-tests in a --run-by-dir (where the browser is a fresh browser for the specific directory). Assuming that, are we splitting a directory up when changing chunks? Can you help us figure out what the root cause might be so this bug can turn into something actionable and we can start playing with more chunks?
Flags: needinfo?(james)
Assignee | ||
Comment 40•8 years ago
|
||
We aren't using --run-by-dir in production at the moment for wpt, although the harness has support. It's hard to tell what's going on here without inspecting what's going on when the tests fail. The TEST-UNEXPECTED-TIMEOUT might just be that the test is long-running and hits the limit of even our long timeout. If that's the case it's possible to split up the tests further or something. The fact that sometimes the run is then busted is more troublesome because it means that either there's a bug in the harness or the server is getting into a bad state for some reason.
Flags: needinfo?(james)
Comment 41•8 years ago
|
||
ok, that is good to know. In fact this could be related to docker running the tests roughly 20% slower than buildbot- that is the case with many of our tests that we hit timeouts we were not hitting before.
Comment 42•8 years ago
|
||
jmaher: remember that this bug was originally filed for Buildbot rather than TaskCluster. As philor pointed out in comment 36, some tests moved from one chunk to another and things started working again. In other words, justr trying to clarify that TC/docker is slower but Buildbot was hitting this as well.
Comment 43•8 years ago
|
||
It seems that going from 12 chunks to 8 is causing the wpt jobs to retry as if it was infra issues (bug 1246176): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=tc-W%20%283&group_state=expanded&fromchange=648e852030e3&exclusion_profile=false
Reporter | ||
Comment 44•8 years ago
|
||
Getting quite old. Modified my previous exclusion to just buildbot's wpt-3 on debug linux64 for Gecko 46, since it's going to ride that train into 46's grave, created a new one for taskcluster's wpt-3 since I'm just ignoring every result from it.
Reporter | ||
Comment 45•8 years ago
|
||
And we're back to failing on trunk with buildbot's wpt-3.
Comment hidden (Intermittent Failures Robot) |
Comment 47•8 years ago
|
||
This is probably the top wpt intermittent at the moment and appears to be harness related IIUC? Is there a better component for it to be in otherwise?
Flags: needinfo?(Ms2ger)
Assignee | ||
Comment 48•8 years ago
|
||
I don't think there's a better component. The fundamental problem here is that I haven't got a good way to reproduce the issue, so it's hard to tell exactly what's going on. Disabling these tests might help a bit but it wouldn't fix the underlying issue, so we would just have an even harder to reproduce serious intermittent.
Flags: needinfo?(Ms2ger)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 52•8 years ago
|
||
:jgraham, This is a linux64 only error- is this on buildbot or taskcluster? This has 79 instances in the last week, that is a lot given the fact the trees were closed often. Assuming you cannot reproduce this or don't have other ideas, can you please disable these tests asap?
Flags: needinfo?(james)
Assignee | ||
Comment 54•8 years ago
|
||
Done, but I can't guarantee that the problem won't manifest itself in some other way, since we still don't understand it at all.
Comment 55•8 years ago
|
||
Given https://treeherder.mozilla.org/logviewer.html#?job_id=155462&repo=ash, I don't have high hopes of disabling our way to victory here.
Comment 56•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/246c969ea49f
Updated•8 years ago
|
Whiteboard: [e10s-orangeblockers]
Assignee | ||
Comment 57•8 years ago
|
||
I now have a more plausible theory for what's going on here. It seems some other tests in the chunk were trying to send a 4½ Mb image one byte at a time with a pause between each byte. Not only was this very silly, it also exposes the fact that the trickle function in wpt is hilariously inefficient, and (in this case) allocates multiple objects, including a callback function, per byte of input. So I suspect what happens is that these tests run, the server memory usage balloons and it is later killed by the OOM killer (or just unable to serve requests because it's so busy trying to GC all those objects). So I will try fixing the tests and wptserve and see if that helps in this case.
Flags: needinfo?(james)
Assignee | ||
Comment 58•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/40981/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/40981/
Attachment #8732088 -
Flags: review?(Ms2ger)
Updated•8 years ago
|
Attachment #8732088 -
Flags: review?(Ms2ger) → review+
Comment 59•8 years ago
|
||
Comment on attachment 8732088 [details] MozReview Request: Bug 1238435 - Use a slow-loading image in some tests that will load after a minute and a half rather than seven and a half weeks, r=Ms2ger https://reviewboard.mozilla.org/r/40981/#review37555
Comment 61•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/7657ca3b8c00
Comment hidden (Intermittent Failures Robot) |
Comment 63•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #61) > https://hg.mozilla.org/mozilla-central/rev/7657ca3b8c00 Looks like this fixed it for good! https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e87d90350f8 Rev 7657ca3b8c00 will need uplift to Aurora/Beta to take care of the failures there as well.
Assignee: nobody → james
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox47:
--- → affected
status-firefox48:
--- → fixed
Flags: needinfo?(ydelendik)
Keywords: leave-open
Resolution: --- → FIXED
Whiteboard: [e10s-orangeblockers] → [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta]
Target Milestone: --- → mozilla48
Comment 64•8 years ago
|
||
Let's backout 246c969ea49f, then.
Comment 65•8 years ago
|
||
(In reply to :Ms2ger from comment #64) > Let's backout 246c969ea49f, then. Yeah, I've got it queued for the next time I'm pushing.
Comment 66•8 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/d28c43c1f217
Whiteboard: [e10s-orangeblockers][checkin-needed-aurora][checkin-needed-beta] → [e10s-orangeblockers][checkin-needed-beta]
Comment 67•8 years ago
|
||
Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/1e994715d314
Comment 68•8 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/7fad9b1e32fd https://hg.mozilla.org/releases/mozilla-beta/rev/b997674eb546
Whiteboard: [e10s-orangeblockers][checkin-needed-beta] → [e10s-orangeblockers]
You need to log in
before you can comment on or make changes to this bug.
Description
•