Intermittent dom/crypto/test/test_WebCrypto_Workers.html | Test timed out.

RESOLVED INCOMPLETE

Status

()

Core
DOM: Security
P3
normal
RESOLVED INCOMPLETE
4 months ago
a month ago

People

(Reporter: Treeherder Bug Filer, Unassigned)

Tracking

({bulk-close-intermittents, intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [domsecurity-intermittent][stockwell unknown])

(Reporter)

Description

4 months ago
treeherder
Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=107017145&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/eMbsJ-rGRoS_6C7heHLYMg/runs/0/artifacts/public/logs/live_backing.log
Priority: -- → P3
Whiteboard: [domsecurity-intermittent]

Comment 1

4 months ago
8 failures in 814 pushes (0.01 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 3

Platform breakdown:
* linux32: 5
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-06-12&endday=2017-06-18&tree=all
this is a linux debug failure- which started June 14th and on the 16th/17th the failure rate increased and has been fairly consistent:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-06-12&endday=2017-06-23&tree=all

here is a recent linux64 log:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=109432872&lineNumber=4046

an related text:
[task 2017-06-23T02:08:04.344963Z] 02:08:04     INFO - TEST-START | dom/crypto/test/test_WebCrypto_Workers.html
[task 2017-06-23T02:08:04.609558Z] 02:08:04     INFO - GECKO(1636) | ++DOMWINDOW == 20 (0xd9069800) [pid = 1677] [serial = 36] [outer = 0xdaedc000]
[task 2017-06-23T02:08:06.786626Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 19 (0xd701f800) [pid = 1677] [serial = 30] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_RSA_PSS.html]
[task 2017-06-23T02:08:06.788955Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 18 (0xd7041c00) [pid = 1677] [serial = 24] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_Normalize.html]
[task 2017-06-23T02:08:06.790412Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 17 (0xd9ea3400) [pid = 1677] [serial = 22] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_JWK.html]
[task 2017-06-23T02:08:06.792926Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 16 (0xd7242000) [pid = 1677] [serial = 32] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_Reject_Generating_Keys_Without_Usages.html]
[task 2017-06-23T02:08:06.797010Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 15 (0xd701bc00) [pid = 1677] [serial = 28] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_RSA_OAEP.html]
[task 2017-06-23T02:08:06.801651Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 14 (0xd7026c00) [pid = 1677] [serial = 31] [outer = (nil)] [url = http://mochi.test:8888/tests/SimpleTest/iframe-between-tests.html]
[task 2017-06-23T02:08:06.809054Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 13 (0xd703c400) [pid = 1677] [serial = 20] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_Import_Multiple_Identical_Keys.html]
[task 2017-06-23T02:08:06.811307Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 12 (0xd9ed4400) [pid = 1677] [serial = 16] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_ECDSA.html]
[task 2017-06-23T02:08:06.818212Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 11 (0xd7249400) [pid = 1677] [serial = 26] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_PBKDF2.html]
[task 2017-06-23T02:08:06.822956Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 10 (0xd9e1d000) [pid = 1677] [serial = 14] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_ECDH.html]
[task 2017-06-23T02:08:06.825758Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 9 (0xe8f27c00) [pid = 1677] [serial = 8] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto.html]
[task 2017-06-23T02:08:06.829537Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 8 (0xd9ee8000) [pid = 1677] [serial = 12] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_DH.html]
[task 2017-06-23T02:08:06.831982Z] 02:08:06     INFO - GECKO(1636) | --DOMWINDOW == 7 (0xd9ea6c00) [pid = 1677] [serial = 18] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_HKDF.html]
[task 2017-06-23T02:08:08.043324Z] 02:08:08     INFO - GECKO(1636) | [Child 1677] WARNING: Unable to find interface object on global: file /home/worker/workspace/build/src/dom/base/nsDOMClassInfo.cpp, line 1807
[task 2017-06-23T02:08:09.817066Z] 02:08:09     INFO - GECKO(1636) | MEMORY STAT | vsize 707MB | residentFast 98MB | heapAllocated 19MB
[task 2017-06-23T02:08:10.359171Z] 02:08:10     INFO - TEST-OK | dom/crypto/test/test_WebCrypto_Workers.html | took 6005ms
[task 2017-06-23T02:08:12.457565Z] 02:08:12     INFO - GECKO(1636) | --DOMWINDOW == 6 (0xd7041800) [pid = 1677] [serial = 33] [outer = (nil)] [url = http://mochi.test:8888/tests/SimpleTest/iframe-between-tests.html]
[task 2017-06-23T02:08:21.316946Z] 02:08:21     INFO - GECKO(1636) | --DOMWINDOW == 5 (0xd724f000) [pid = 1677] [serial = 35] [outer = (nil)] [url = http://mochi.test:8888/tests/SimpleTest/iframe-between-tests.html]
[task 2017-06-23T02:08:21.321160Z] 02:08:21     INFO - GECKO(1636) | --DOMWINDOW == 4 (0xd7245c00) [pid = 1677] [serial = 34] [outer = (nil)] [url = http://mochi.test:8888/tests/dom/crypto/test/test_WebCrypto_Structured_Cloning.html]
[task 2017-06-23T02:12:59.271678Z] 02:12:59     INFO - GECKO(1636) | [Parent 1636] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /home/worker/workspace/build/src/netwerk/base/nsChannelClassifier.cpp, line 316
[task 2017-06-23T02:16:59.326236Z] 02:16:59     INFO - GECKO(1636) | [Parent 1636] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /home/worker/workspace/build/src/netwerk/base/nsChannelClassifier.cpp, line 316
[task 2017-06-23T02:16:59.345594Z] 02:16:59     INFO - GECKO(1636) | [Parent 1636] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80070057: file /home/worker/workspace/build/src/netwerk/base/nsChannelClassifier.cpp, line 316
[task 2017-06-23T02:18:24.996084Z] 02:18:24     INFO - TEST-INFO | started process screentopng
[task 2017-06-23T02:18:25.605053Z] 02:18:25     INFO - TEST-INFO | screentopng: exit 0
[task 2017-06-23T02:18:25.614986Z] 02:18:25     INFO - TEST-UNEXPECTED-FAIL | dom/crypto/test/test_WebCrypto_Workers.html | Test timed out. 
[task 2017-06-23T02:18:25.615368Z] 02:18:25     INFO -     reportError@SimpleTest/TestRunner.js:121:7
[task 2017-06-23T02:18:25.615678Z] 02:18:25     INFO -     TestRunner._checkForHangs@SimpleTest/TestRunner.js:142:7
[task 2017-06-23T02:18:25.617029Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.620088Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.623350Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.625330Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.627228Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.629397Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.631383Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.633353Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.637250Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.639222Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.641271Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.643185Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.645113Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.647033Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.649124Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.651079Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.653802Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.655888Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.657765Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.659601Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.661428Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.663228Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.665073Z] 02:18:25     INFO -     setTimeout handler*TestRunner._checkForHangs@SimpleTest/TestRunner.js:163:5
[task 2017-06-23T02:18:25.666913Z] 02:18:25     INFO -     TestRunner.runTests@SimpleTest/TestRunner.js:380:5
[task 2017-06-23T02:18:25.670526Z] 02:18:25     INFO -     RunSet.runtests@SimpleTest/setup.js:194:3
[task 2017-06-23T02:18:25.672827Z] 02:18:25     INFO -     RunSet.runall@SimpleTest/setup.js:173:5
[task 2017-06-23T02:18:25.674755Z] 02:18:25     INFO -     hookupTests@SimpleTest/setup.js:266:5
[task 2017-06-23T02:18:25.676572Z] 02:18:25     INFO - parseTestManifest@http://mochi.test:8888/manifestLibrary.js:36:5
[task 2017-06-23T02:18:25.678498Z] 02:18:25     INFO - getTestManifest/req.onload@http://mochi.test:8888/manifestLibrary.js:49:11
[task 2017-06-23T02:18:25.680365Z] 02:18:25     INFO - EventHandlerNonNull*getTestManifest@http://mochi.test:8888/manifestLibrary.js:45:3
[task 2017-06-23T02:18:25.682173Z] 02:18:25     INFO -     hookup@SimpleTest/setup.js:246:5
[task 2017-06-23T02:18:25.684134Z] 02:18:25     INFO - EventHandlerNonNull*@http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&manifestFile=tests.json&dumpOutputDirectory=%2Ftmp&cleanupCrashes=true:11:1
[task 2017-06-23T02:18:25.996258Z] 02:18:25    ERROR - [SimpleTest.finish()] this test already called finish!
[task 2017-06-23T02:18:26.016026Z] 02:18:26     INFO - TEST-UNEXPECTED-ERROR | dom/crypto/test/test_WebCrypto_Workers.html | called finish() multiple times
[task 2017-0


in fact we get a lot more info in the log, it appears we call finish many times.  This sounds like we are receiving events or stuck in a loop where we finsish but never clean up the handler.

:ckerschb, I see you are the triage owner for the dom:security component, could you find someone from the dom:security team to look into this in the next couple weeks?
Flags: needinfo?(ckerschb)
Whiteboard: [domsecurity-intermittent] → [domsecurity-intermittent][stockwell needswork]
as a note, this is run at a reduced frequency via SETA, so on high load every 5th push, overall I think we average out to every 4th push.  So 40 failures in 880 pushes, looks like 1 failure/22 pushes, but if we run infrequently this is more like 1 failure for every 5 or 6 pushes.
I am trying to retrigger to see if there is a smaller range of commits that caused this:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20debug%20mochitest-e10s-3&tochange=1b084196d6c8f89152212a6d3a2d9bc694c3391d&fromchange=5a73c211f58d6dca5b6938970311111aeb8c6692
the retriggers didn't find the root cause, going back another 20 changesets:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20debug%20mochitest-e10s-3&tochange=1b084196d6c8f89152212a6d3a2d9bc694c3391d&fromchange=11fc0da380a2f3fcdee7f557aef97fe031646844&selectedJob=106796851

I suspect we might have to go back further.
(In reply to Joel Maher ( :jmaher) from comment #2)
> :ckerschb, I see you are the triage owner for the dom:security component,
> could you find someone from the dom:security team to look into this in the
> next couple weeks?

Tim, any chance you could take that on?
Flags: needinfo?(ckerschb) → needinfo?(ttaubert)

Comment 7

4 months ago
31 failures in 892 pushes (0.035 failures/push) were associated with this bug in the last 7 days.   

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 12
* mozilla-central: 8
* autoland: 7
* try: 3
* pine: 1

Platform breakdown:
* linux32: 16
* linux64: 14
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-06-19&endday=2017-06-25&tree=all
This looks weird. It reports TEST-OK for test_WebCrypto_Workers.html, so that test has finished, but then the suite just hangs for 10 minutes.
Flags: needinfo?(ttaubert)

Comment 9

4 months ago
26 failures in 718 pushes (0.036 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 13
* autoland: 9
* mozilla-central: 3
* try: 1

Platform breakdown:
* linux32: 25
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-06-26&endday=2017-07-02&tree=all

Comment 10

3 months ago
23 failures in 656 pushes (0.035 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 10
* autoland: 6
* mozilla-central: 4
* try: 2
* pine: 1

Platform breakdown:
* linux32: 17
* windows7-32-vm: 3
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-07-03&endday=2017-07-09&tree=all
:ttaubert, any chance you are making progress on this bug?
Flags: needinfo?(ttaubert)

Comment 12

3 months ago
22 failures in 720 pushes (0.031 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 7
* autoland: 6
* try: 4
* mozilla-central: 2
* oak: 1
* mozilla-beta: 1
* cedar: 1

Platform breakdown:
* linux32: 18
* linux64: 2
* windows8-64: 1
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-07-10&endday=2017-07-16&tree=all
Looks like forced termination of workers could be a problem here. The last message printed before we time out is:

> TEST-PASS | dom/crypto/test/test_WebCrypto_Workers.html | Web crypto in terminating Worker

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-beta&job_id=113339254&lineNumber=2804

That test was added in bug 1281874. Maybe Ben has an idea what's going wrong here? Maybe a deadlock of some sort?
Flags: needinfo?(ttaubert) → needinfo?(bkelly)
It looks like I can easily reproduce this locally by just running:

./mach mochitest dom/crypto/test/test_WebCrypto_Workers.html
Well, any of the dom/crypto tests seem to time out after completion. Not sure if that's an issue with the test runner/suite itself. The mochitest runner is killed after 330s of inactivity.

Comment 16

3 months ago
(In reply to Tim Taubert [:ttaubert] from comment #13)
> Looks like forced termination of workers could be a problem here.

The whole point of the test was to validate webcrypto handles worker shutdown correctly.  In that past it just caused UAF which was quite bad.

I'm sorry, but someone familiar with webcrypto needs to see where the test is getting stuck.  What async step is not firing, etc.  Then we can try to diagnose how this can happen in the worker shutdown code.

For example, during worker shutdown WorkerRunnable::Dispatch() can fail.  In cases where it needs to be infallible code should fall back to dispatching a control runnable.
Flags: needinfo?(bkelly)
(In reply to Tim Taubert [:ttaubert] from comment #15)
> Well, any of the dom/crypto tests seem to time out after completion. Not
> sure if that's an issue with the test runner/suite itself. The mochitest
> runner is killed after 330s of inactivity.

I mistakenly thought that the process would quit when running single tests, looks like it doesn't. I can't reproduce any of this on my machine, unfortunately.

Comment 18

3 months ago
Can you add debug to see if this fails in automation when the test times out?

http://searchfox.org/mozilla-central/source/dom/crypto/WebCryptoTask.cpp#422
Flags: needinfo?(ttaubert)

Comment 19

3 months ago
It would also be useful to know if Cancel() is being called:

http://searchfox.org/mozilla-central/source/dom/crypto/WebCryptoTask.cpp#443

Comment 20

3 months ago
Also note, we should probably change this:

http://searchfox.org/mozilla-central/source/dom/crypto/WebCryptoTask.cpp#387

To use nsIGlobalObject::EventTargetFor() so the runnables are properly labeled on main thread.  This should work for Workers as well as of today's inbound.

Comment 21

3 months ago
13 failures in 822 pushes (0.016 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* pine: 7
* autoland: 4
* mozilla-inbound: 1
* mozilla-central: 1

Platform breakdown:
* linux32: 11
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-07-17&endday=2017-07-23&tree=all

Comment 22

3 months ago
12 failures in 1008 pushes (0.012 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 4
* mozilla-central: 2
* try: 1

Platform breakdown:
* linux32: 5
* windows7-32: 3
* linux64: 2
* windows8-64: 1
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1373107&startday=2017-07-24&endday=2017-07-30&tree=all
(In reply to Ben Kelly [:bkelly] from comment #18)
> Can you add debug to see if this fails in automation when the test times out?
> 
> http://searchfox.org/mozilla-central/source/dom/crypto/WebCryptoTask.cpp#422

I unfortunately don't think I'll have time to do this in the next weeks, or probably until 57. I agree that the best way forward here might be to add a bunch of debug info and either just land them or push to try and retrigger until we hit it.
Flags: needinfo?(ttaubert)

Comment 24

3 months ago
I was actually hoping this would be fixed by bug 1340941, but it seems we've had a couple triggers since that landed.

Updated

3 months ago
Whiteboard: [domsecurity-intermittent][stockwell needswork] → [domsecurity-intermittent][stockwell unknown]
Status: NEW → RESOLVED
Last Resolved: a month ago
Keywords: bulk-close-intermittents
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.