browser_blocking* test cause ENOMEM worker crashes on Linux asan, workers keep on retrying until it turns into exception
Categories
(Testing :: Mochitest, defect, P2)
Tracking
(Not tracked)
People
(Reporter: bogdan_tara, Unassigned)
References
Details
Attachments
(4 files, 3 obsolete files)
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
Ehsan this seems to start when bug 1570802 landed. Could you, please, take a look?
Comment 2•5 years ago
|
||
I had a look, and I'm not sure what to do with this. The jobs that turn into exception don't seem to produce a log (at least as far as I can tell). They only have a "live.log" log but when I try to view those link they all time out for me, so I can't actually get an idea of what's running during this test and what happens that causes it to not finish.
Looking at the Linux64 asan bc5 e10s test on the prior push I see that it doesn't include the antitracking tests which makes this a very weird coincidence, since I would have expected that if this is something related to bug 1570802, then antitracking tests would run within this suite.
Is there any way to get a log from these test runs?
Reporter | ||
Comment 3•5 years ago
|
||
Ehsan, thank you for looking.
I found logs for "retry" runs, eg. :
https://taskcluster-artifacts.net/XECr5BNLTQiXHC_56h_21Q/4/public/logs/live_backing.log
Is this expected? [taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown
[task 2019-08-08T12:44:30.446Z] 12:44:30 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_URLBarSetURI.js
[task 2019-08-08T12:44:34.488Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2019-08-08T12:44:34.489Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT heapAllocated not supported in this build configuration.
[task 2019-08-08T12:44:34.491Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT | vsize 20975057MB | residentFast 1108MB
[task 2019-08-08T12:44:34.493Z] 12:44:34 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_URLBarSetURI.js | took 4050ms
[task 2019-08-08T12:44:34.529Z] 12:44:34 INFO - checking window state
[task 2019-08-08T12:44:34.550Z] 12:44:34 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:34.566Z] 12:44:34 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_formatValue.js
[task 2019-08-08T12:44:50.880Z] 12:44:50 INFO - GECKO(2468) | MEMORY STAT | vsize 20974981MB | residentFast 1286MB
[task 2019-08-08T12:44:50.881Z] 12:44:50 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_formatValue.js | took 16318ms
[task 2019-08-08T12:44:50.923Z] 12:44:50 INFO - checking window state
[task 2019-08-08T12:44:50.923Z] 12:44:50 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:50.958Z] 12:44:50 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_hiddenFocus.js
[task 2019-08-08T12:44:51.567Z] 12:44:51 INFO - GECKO(2468) | MEMORY STAT | vsize 20975084MB | residentFast 1331MB
[task 2019-08-08T12:44:51.568Z] 12:44:51 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_hiddenFocus.js | took 618ms
[task 2019-08-08T12:44:51.609Z] 12:44:51 INFO - checking window state
[task 2019-08-08T12:44:51.629Z] 12:44:51 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:51.658Z] 12:44:51 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow.js
[task 2019-08-08T12:44:53.771Z] 12:44:53 INFO - GECKO(2468) | MEMORY STAT | vsize 20975116MB | residentFast 1373MB
[task 2019-08-08T12:44:53.773Z] 12:44:53 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow.js | took 2118ms
[task 2019-08-08T12:44:53.814Z] 12:44:53 INFO - checking window state
[task 2019-08-08T12:44:53.816Z] 12:44:53 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:53.825Z] 12:44:53 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow_resize.js
[task 2019-08-08T12:44:56.009Z] 12:44:56 INFO - GECKO(2468) | MEMORY STAT | vsize 20975128MB | residentFast 1209MB
[task 2019-08-08T12:44:56.012Z] 12:44:56 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow_resize.js | took 2186ms
[task 2019-08-08T12:44:56.054Z] 12:44:56 INFO - checking window state
[task 2019-08-08T12:44:56.055Z] 12:44:56 INFO - GECKO(2468) | must wait for focus
[task 2019-08-08T12:44:56.071Z] 12:44:56 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:56.075Z] 12:44:56 INFO - Console message: [JavaScript Error: "uncaught exception: Object" {file: "resource://testing-common/PromiseTestUtils.jsm" line: 112}]
[task 2019-08-08T12:44:56.092Z] 12:44:56 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_tooltip.js
[task 2019-08-08T12:44:56.696Z] 12:44:56 INFO - GECKO(2468) | MEMORY STAT | vsize 20975124MB | residentFast 1209MB
[task 2019-08-08T12:44:56.696Z] 12:44:56 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_tooltip.js | took 609ms
[task 2019-08-08T12:44:56.734Z] 12:44:56 INFO - checking window state
[task 2019-08-08T12:44:56.750Z] 12:44:56 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:56.754Z] 12:44:56 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_trimURLs.js
[task 2019-08-08T12:44:57.341Z] 12:44:57 INFO - GECKO(2468) | MEMORY STAT | vsize 20975144MB | residentFast 1241MB
[task 2019-08-08T12:44:57.342Z] 12:44:57 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_trimURLs.js | took 586ms
[task 2019-08-08T12:44:57.382Z] 12:44:57 INFO - checking window state
[task 2019-08-08T12:44:57.403Z] 12:44:57 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:57.411Z] 12:44:57 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarLoadRace.js
[task 2019-08-08T12:44:58.661Z] 12:44:58 INFO - GECKO(2468) | JavaScript error: resource://gre/modules/sessionstore/Utils.jsm, line 27: TypeError: aStream.available is not a function
[task 2019-08-08T12:44:58.766Z] 12:44:58 INFO - GECKO(2468) | MEMORY STAT | vsize 20975171MB | residentFast 1292MB
[task 2019-08-08T12:44:58.766Z] 12:44:58 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarLoadRace.js | took 1355ms
[task 2019-08-08T12:44:58.793Z] 12:44:58 INFO - checking window state
[task 2019-08-08T12:44:58.814Z] 12:44:58 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:58.814Z] 12:44:58 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_action_searchengine.js
[task 2019-08-08T12:44:59.637Z] 12:44:59 INFO - GECKO(2468) | MEMORY STAT | vsize 20975183MB | residentFast 1313MB
[task 2019-08-08T12:44:59.638Z] 12:44:59 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_action_searchengine.js | took 819ms
[task 2019-08-08T12:44:59.666Z] 12:44:59 INFO - checking window state
[task 2019-08-08T12:44:59.687Z] 12:44:59 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:59.691Z] 12:44:59 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_action_searchengine_alias.js
[task 2019-08-08T12:45:00.512Z] 12:45:00 INFO - GECKO(2468) | MEMORY STAT | vsize 20975189MB | residentFast 1326MB
[task 2019-08-08T12:45:00.513Z] 12:45:00 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_action_searchengine_alias.js | took 816ms
[task 2019-08-08T12:45:00.552Z] 12:45:00 INFO - checking window state
[task 2019-08-08T12:45:00.573Z] 12:45:00 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:00.590Z] 12:45:00 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_backspaced.js
[task 2019-08-08T12:45:04.079Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975159MB | residentFast 1272MB
[task 2019-08-08T12:45:04.079Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_backspaced.js | took 3485ms
[task 2019-08-08T12:45:04.119Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.140Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.160Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_canonize.js
[task 2019-08-08T12:45:04.357Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Waiting for URL: http://example.com/
[task 2019-08-08T12:45:04.373Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: onStateChange f0001:http://example.com/
[task 2019-08-08T12:45:04.373Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Document start: http://example.com/
[task 2019-08-08T12:45:04.470Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Waiting for URL: http://www.exam.com/
[task 2019-08-08T12:45:04.486Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: onStateChange f0001:http://www.exam.com/
[task 2019-08-08T12:45:04.486Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Document start: http://www.exam.com/
[task 2019-08-08T12:45:04.528Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975162MB | residentFast 1276MB
[task 2019-08-08T12:45:04.529Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_canonize.js | took 376ms
[task 2019-08-08T12:45:04.557Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.598Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.598Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_caretNotAtEnd.js
[task 2019-08-08T12:45:04.901Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975159MB | residentFast 1276MB
[task 2019-08-08T12:45:04.902Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_caretNotAtEnd.js | took 297ms
[task 2019-08-08T12:45:04.942Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.966Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.987Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_firstResult.js
[task 2019-08-08T12:45:06.496Z] 12:45:06 INFO - GECKO(2468) | MEMORY STAT | vsize 20975148MB | residentFast 1291MB
[task 2019-08-08T12:45:06.496Z] 12:45:06 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_firstResult.js | took 1508ms
[task 2019-08-08T12:45:06.540Z] 12:45:06 INFO - checking window state
[task 2019-08-08T12:45:06.560Z] 12:45:06 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:06.562Z] 12:45:06 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_paste.js
[task 2019-08-08T12:45:07.013Z] 12:45:07 INFO - GECKO(2468) | MEMORY STAT | vsize 20975145MB | residentFast 1282MB
[task 2019-08-08T12:45:07.014Z] 12:45:07 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_paste.js | took 451ms
[task 2019-08-08T12:45:07.049Z] 12:45:07 INFO - checking window state
[task 2019-08-08T12:45:07.070Z] 12:45:07 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:07.077Z] 12:45:07 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_placeholder.js
[task 2019-08-08T12:45:09.709Z] 12:45:09 INFO - GECKO(2468) | MEMORY STAT | vsize 20975129MB | residentFast 1300MB
[task 2019-08-08T12:45:09.710Z] 12:45:09 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_placeholder.js | took 2626ms
[task 2019-08-08T12:45:09.746Z] 12:45:09 INFO - checking window state
[task 2019-08-08T12:45:09.761Z] 12:45:09 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:09.763Z] 12:45:09 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_preserve.js
[task 2019-08-08T12:45:12.181Z] 12:45:12 INFO - GECKO(2468) | MEMORY STAT | vsize 20975094MB | residentFast 1263MB
[task 2019-08-08T12:45:12.182Z] 12:45:12 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_preserve.js | took 2420ms
[task 2019-08-08T12:45:12.238Z] 12:45:12 INFO - checking window state
[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown
[taskcluster 2019-08-08 12:45:12.651Z] === Task Finished ===
[taskcluster 2019-08-08 12:45:12.651Z] Unsuccessful task run with exit code: -1 completed in 272.774 seconds
Reporter | ||
Comment 4•5 years ago
|
||
ps: i asked for the exception log on irc.
Comment 5•5 years ago
|
||
(In reply to Bogdan Tara[:bogdan_tara] from comment #3)
Ehsan, thank you for looking.
I found logs for "retry" runs, eg. :
https://taskcluster-artifacts.net/XECr5BNLTQiXHC_56h_21Q/4/public/logs/live_backing.log
Is this expected? [taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown
Hmm, that is most definitely not expected. That message on the last line isn't coming from code in the tree as far as I can tell based on searching in searchfox. It looks like the whole task took about 5 minutes, so this can't be caused by things having taken too long... Also as I had suspected none of the antitracking tests have been running in this job, so the test changes in the patch for bug 1570802 cannot possibly be the culprit. As far as the code changes in that patch go, to the best of my knowledge, that code is not executed by anything in the tree except the tests in toolkit/components/antitracking
, which again aren't running here, so in all honesty I don't see how the code changes even could have caused this. I'm really out of ideas.
Can we ask help from someone more familiar with TaskCluster by any chance?
Reporter | ||
Comment 6•5 years ago
|
||
Dave, could this be caused by the switch to m5.large?
Comment 7•5 years ago
|
||
Any idea why these job gets killed after ~35 minutes?
[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown
Log is from https://tools.taskcluster.net/groups/JafjMBAkSim35ccRf4TOdA/tasks/XECr5BNLTQiXHC_56h_21Q/runs/4
Comment 8•5 years ago
|
||
worker-shutdown means that we received notice from the cloud provider that the spot instance is being terminated.
Comment 9•5 years ago
|
||
Could this be excessive resource usage?
Comment 10•5 years ago
|
||
FWIW I'm also testing a backout of my patch on try, just in case...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf037efb83ea782c1d065e18b178e21e4f912af
Comment 11•5 years ago
|
||
(In reply to :Ehsan Akhgari from comment #10)
FWIW I'm also testing a backout of my patch on try, just in case...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf037efb83ea782c1d065e18b178e21e4f912af
To my surprise, that push came back green! Now I'm pushing just the code changes without the test changes to see if that would break the job...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e33b0393d780d77c32c1b26573bcac073d32ce4
I still do not understand at all what's going on here. I verified once again that the code that I changed in this patch doesn't get executed in the tests that run as part of this test suite. So you would think that my patch would have no impact on the test suite...
Reporter | ||
Comment 12•5 years ago
|
||
I also did a try run and got the bc5 job running successfully:
Reporter | ||
Comment 13•5 years ago
|
||
I'm seeing a retry on my test backout, this time on bc10.
No logs atm for task run 0.
https://tools.taskcluster.net/groups/JbSb3wcQS3aU9rU98jP_6g/tasks/QwwjM_5vQhqGivp7ZBqlRQ/runs/0/logs/public%2Flogs%2Flive.log
Comment 14•5 years ago
|
||
I'm doing more try pushes. Currently just the code changes are green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e33b0393d780d77c32c1b26573bcac073d32ce4. I'm doing another push with the one test I removed from browser.ini https://treeherder.mozilla.org/#/jobs?repo=try&revision=f68cf2e2fbfb0129fc3c99833f53a5d2a5903b35 as well as one with the rest of the test changes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c7b8337b1b3e872e98d154231f50413ae749152
Comment 15•5 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)
Could this be excessive resource usage?
:dustin, is that something you've seen before? Or is this possibly the billing term/time for that spot request expired? I'm assuming it isn't a hardware failure terminate/retire since it is a spot instance.
Comment 16•5 years ago
|
||
:dustin, is that something you've seen before?
I'm not sure what "excessive resource usage" means here.
Or is this possibly the billing term/time for that spot request expired?
The worker-shutdown
errors do mean that. It's not to do with time (Amazon does per-second billing), but with Amazon pre-empting our use of that instance.
It may be that this particular instance type behaves this way. I don't know the details of the AWS spot "market" anymore, but as of a few years ago it's not much of a "market" at all in the sense that Amazon just sets prices. Maybe when they raise prices, all instances at the lower price-point get terminated?
Comment 17•5 years ago
|
||
OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.
By sheer luck I was looking at this taskcluster log view window from this other try push on top of the above and there I saw log messages like this:
[task 2019-08-08T17:39:13.687Z] 17:39:13 INFO - GECKO(3714) | MEMORY STAT | vsize 20976516MB | residentFast 3472MB
[task 2019-08-08T17:39:13.687Z] 17:39:13 INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_partitionedSharedWorkers.js | took 14949ms
[task 2019-08-08T17:39:13.723Z] 17:39:13 INFO - checking window state
[task 2019-08-08T17:39:13.743Z] 17:39:13 INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:13.743Z] 17:39:13 INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_permissionInNormalWindows.js
[task 2019-08-08T17:39:23.087Z] 17:39:23 INFO - GECKO(3714) | MEMORY STAT | vsize 20976484MB | residentFast 3560MB
[task 2019-08-08T17:39:23.088Z] 17:39:23 INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_permissionInNormalWindows.js | took 9342ms
[task 2019-08-08T17:39:23.112Z] 17:39:23 INFO - checking window state
[task 2019-08-08T17:39:23.140Z] 17:39:23 INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:23.140Z] 17:39:23 INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_permissionInPrivateWindows.js
[task 2019-08-08T17:39:37.194Z] 17:39:37 INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:37.487Z] 17:39:37 INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:37.771Z] 17:39:37 INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:39.255Z] 17:39:39 INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:40.714Z] 17:39:40 INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:41.386Z] 17:39:41 INFO - GECKO(3714) | MEMORY STAT | vsize 20976568MB | residentFast 3559MB
[task 2019-08-08T17:39:41.403Z] 17:39:41 INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_permissionInPrivateWindows.js | took 18245ms
[task 2019-08-08T17:39:41.435Z] 17:39:41 INFO - checking window state
[task 2019-08-08T17:39:41.436Z] 17:39:41 INFO - GECKO(3714) | must wait for focus
[task 2019-08-08T17:39:41.471Z] 17:39:41 INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:41.507Z] 17:39:41 INFO - Console message: [JavaScript Error: "uncaught exception: Object" {file: "resource://testing-common/PromiseTestUtils.jsm" line: 112}]
[task 2019-08-08T17:39:41.510Z] 17:39:41 INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_referrerDefaultPolicy.js
That's where the log stopped.
So I think what happened here is that removing that one test file changed the test suite chunking decisions, and maybe now the test is taking too long to finish or something (the tests in toolkit/components/antitracking/test/browser
are well known to take their sweet time to run!) and it causes this problem.
I can easily add that test back for now to work around this...
Comment 18•5 years ago
|
||
Comment 19•5 years ago
|
||
Pushed by eakhgari@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c Add back browser_storageAccessPrivateWindow.js to undo the test chunking changes which broke Linux64 ASan BC5 tests;
Comment 20•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #16)
:dustin, is that something you've seen before?
I'm not sure what "excessive resource usage" means here.
I read that as equivalent to throttling our overall spot pool (?).
Or is this possibly the billing term/time for that spot request expired?
The
worker-shutdown
errors do mean that. It's not to do with time (Amazon does per-second billing), but with Amazon pre-empting our use of that instance.
Thx, then maybe we need to flag these tests as interrupted and not failing or catch-all-exception.
It may be that this particular instance type behaves this way. I don't know the details of the AWS spot "market" anymore, but as of a few years ago it's not much of a "market" at all in the sense that Amazon just sets prices. Maybe when they raise prices, all instances at the lower price-point get terminated?
If the docs are correct, it says they'll change the price for the instance up to the max we set:
In the circumstance EC2 needs to reclaim your Spot instance it can be for two possible reasons, with the primary one being Amazon EC2 capacity requirements (e.g. On Demand or Reserved Instance usage). Secondarily, if you have chosen to set a “maximum Spot price” and the Spot price rises above this, your instance will be reclaimed with a two-minute notification
https://aws.amazon.com/ec2/faqs/#Spot_instances
Comment 21•5 years ago
•
|
||
(In reply to :Ehsan Akhgari from comment #17)
OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.
Maybe this exposed the "[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown" because of running longer. And that it just happened to be instances that aws took away from us for capacity for non-spot users.
Comment 22•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #21)
(In reply to :Ehsan Akhgari from comment #17)
OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.
Maybe this exposed the "[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown" because of running longer. And that it just happened to be instances that aws took away from us for capacity for non-spot users.
That makes perfect sense.
Comment 23•5 years ago
|
||
Thx, then maybe we need to flag these tests as interrupted and not failing or catch-all-exception.
The worker-shutdown exception is (or at least, was) translated into a blue? purple? color in treeherder which indicates that. If that's no longer the case, that's a bug in threeherder.
I think we may be talking about different errors here? The log that ehsan linked to in comment 17 failed with exception claim-expired, which occurs when the worker disappears. Often this is a result of an OOM or forkbomb or other instance-killing issue in the tests. These are difficult to debug, unfortunately. I'm glad this was tracked to a source-code change, as it's especially difficult to debug these things when they are caused by subtle differences between instance types.
If the docs are correct, it says they'll change the price for the instance up to the max we set:
Our maxPrice is $8.00, and the spot price has hovered around $0.034 for months. So, this must be due to demand for on-demand instances.
Comment 24•5 years ago
|
||
After this patch, the test chunking changed and now bc5 fails on Windows 10 x64 Quantum Render. :-(
Comment 26•5 years ago
|
||
Pushed by ccoroiu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f8dfdd9df98f Increase Windows 10 x64 QR debug browser-chrome max runtime to 90 minutes. r=gbrown on a CLOSED TREE
Comment 27•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/2a975bc4c88c
https://hg.mozilla.org/mozilla-central/rev/f8dfdd9df98f
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 29•5 years ago
|
||
This is still failing:
:Ehsan Akhgari can you take a look?
Updated•5 years ago
|
Comment 30•5 years ago
|
||
Comment 31•5 years ago
|
||
I have no other ideas besides increasing the test timeout.
Comment 32•5 years ago
|
||
Please see grbown's comment on https://phabricator.services.mozilla.com/D41371. He doesn't believe this will help.
I'll be afk for a few hours, and don't really have any ideas what to do with this. Please feel free to land the patch I attached or back out whatever is needed.
Resigning from this bug because I don't believe I know enough about TaskCluster to be the correct assignee here.
Comment 33•5 years ago
|
||
https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with claim-expired
. Who can investigate this permafailure?
Dave, is it possible to get the log for one of those and the reason why it gets terminated or system stats like memory?
Comment 34•5 years ago
|
||
Comment 35•5 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)
https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with
claim-expired
. Who can investigate this permafailure?Dave, is it possible to get the log for one of those and the reason why it gets terminated or system stats like memory?
I attached the log for the worker id i-0078cc110cbf454a6 which failed at 2019-08-09T12:32:49.202Z on test-linux64-asan/opt-mochitest-browser-chrome-e10s-10 task id: Jm8GoUc6R6qIHRifzj06yw and had started at 2019-08-09T11:57:21.817Z
I'll look in aws to see if we have some system stats like memory collected.
Comment 36•5 years ago
|
||
It looks like disk read activity was high at the end.
We have CPU usage, disk activity, and network. Not memory.
If you have access to the aws project, you can view the chart here:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~(metrics~(~(~'AWS*2fEC2~'CPUUtilization~'InstanceId~'i-0078cc110cbf454a6~(yAxis~'right~period~60~stat~'Average))~(~'.~'EBSReadOps~'.~'.~(period~60~stat~'Average))~(~'.~'EBSIOBalance*25~'.~'.~(period~60~stat~'Average))~(~'.~'EBSWriteBytes~'.~'.~(period~60~stat~'Average))~(~'.~'EBSByteBalance*25~'.~'.~(period~60~stat~'Average))~(~'.~'EBSWriteOps~'.~'.~(period~60~stat~'Average))~(~'.~'EBSReadBytes~'.~'.~(period~60~stat~'Average))~(~'.~'NetworkIn~'.~'.~(period~60~stat~'Average))~(~'.~'NetworkOut~'.~'.~(period~60~stat~'Average))~(~'.~'NetworkPacketsIn~'.~'.~(period~60~stat~'Average))~(~'.~'NetworkPacketsOut~'.~'.~(period~60~stat~'Average))~(~'.~'StatusCheckFailed_Instance~'.~'.~(period~60~stat~'Average))~(~'.~'StatusCheckFailed~'.~'.~(period~60~stat~'Average))~(~'.~'StatusCheckFailed_System~'.~'.~(period~60~stat~'Average)))~view~'timeSeries~stacked~false~region~'us-east-1~start~'-PT12H~end~'P0D~liveData~false);query=~'*7bAWS*2fEC2*2cInstanceId*7d*20InstanceId*3di-0078cc110cbf454a6
Comment 37•5 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)
https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with
claim-expired
. Who can investigate this permafailure?
claim-expired means the task had expired, right? Is it somehow getting too-short of an expiration set on the task definition?
Comment 38•5 years ago
|
||
No, claim-expired means the claim expired. Which means the worker crashed or disappeared or was otherwise unable to renew its claim on the task.
Comment 39•5 years ago
|
||
I notice the log in comment 34 contains:
Aug 09 05:28:11 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: Uncaught Exception! Attempting to report to Sentry and crash.
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: Error: spawn ENOMEM
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at _errnoException (util.js:1019:11)
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at ChildProcess.spawn (internal/child_process.js:325:11)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at exports.spawn (child_process.js:494:9)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at Object.exports.execFile (child_process.js:209:15)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at exports.exec (child_process.js:139:18)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at Object.check (/home/ubuntu/docker_worker/node_modules/diskspace/diskspace.js:56:3)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at Object.module.exports (/home/ubuntu/docker_worker/src/lib/stats/host_metrics.js:43:13)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at ontimeout (timers.js:469:11)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at tryOnTimeout (timers.js:304:5)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: at Timer.listOnTimeout (timers.js:264:5)
(I don't know what that means / don't know if it is significant.)
Comment 40•5 years ago
|
||
I cannot load the test log for most of the retries or exceptions, but I found a few that I could examine. One of those ended abruptly during toolkit/components/antitracking/test/browser/browser_blockingIndexedDb.js. I pushed to try with that test skipped on asan, but the retries continued.
Comment 41•5 years ago
|
||
Next I tried skipping all the browser_blocking* tests and there were no persistent retries:
Comment 42•5 years ago
|
||
It's significant -- it means the worker got ENOMEM, which would crash the worker and then lead to a claim-expired.
Comment 43•5 years ago
|
||
dhouse, can you double-check that these new instances have the same memory as the old? I looked yesterday and it seemed that they have 512M more, but maybe I misread something?
Comment 44•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] (away Aug 10-18) from comment #41)
Next I tried skipping all the browser_blocking* tests and there were no persistent retries:
One this that is "special" about these tests is that they open and close a lot of browser windows, FWIW. Maybe tens of windows per each test file...
Updated•5 years ago
|
Comment 45•5 years ago
|
||
Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/8cb96bd4a262 Disable browser_blocking* tests on Linux asan to fix permafailing chunk. a=Aryx CLOSED TREE
Comment hidden (Intermittent Failures Robot) |
Comment 47•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment 49•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #43)
dhouse, can you double-check that these new instances have the same memory as the old? I looked yesterday and it seemed that they have 512M more, but maybe I misread something?
I'm reading it as 512M more also. An external site that scrapes the aws docs show the same: https://www.ec2instances.info/?selected=m3.large,m5.large
Comment 50•5 years ago
|
||
(In reply to Pulsebot from comment #45)
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/8cb96bd4a262
Disable browser_blocking* tests on Linux asan to fix permafailing chunk.
a=Aryx CLOSED TREE
Sebastian, what is the process for re-enabling these tests? Are we waiting for a resolution to this bug? Thanks!
Comment 51•5 years ago
|
||
The ENOMEM worker crashes with these tests have to be resolved to run them again on Linux x64 asan. If there is anything in a captured profile if the test runs locally which points to the reason, we could get someone more specific to look at it.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 52•5 years ago
|
||
(In reply to Pulsebot from comment #19)
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c
Add back browser_storageAccessPrivateWindow.js to undo the test chunking
changes which broke Linux64 ASan BC5 tests;
It seems that the thing that broke the camel's back was this patch.
This try push reverts that changeset and enables all of these tests on Linux ASan and everything seems to be as green as things can be these days:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e6cc2d057ba9a4e56b53776b17f22eb1bfc69c4d
I'm gonna submit a patch to revert this all.
Comment 53•5 years ago
|
||
Updated•5 years ago
|
Comment 54•5 years ago
|
||
(In reply to :Ehsan Akhgari from comment #52)
(In reply to Pulsebot from comment #19)
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c
Add back browser_storageAccessPrivateWindow.js to undo the test chunking
changes which broke Linux64 ASan BC5 tests;It seems that the thing that broke the camel's back was this patch.
This try push reverts that changeset and enables all of these tests on Linux ASan and everything seems to be as green as things can be these days:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e6cc2d057ba9a4e56b53776b17f22eb1bfc69c4dI'm gonna submit a patch to revert this all.
I did another push on a new baseline from m-c https://treeherder.mozilla.org/#/jobs?repo=try&revision=df4dccf2ca913f43375a41b7e87536ca8336e3f8 which was worse, it showed that some other test directory had changed somewhere outside of toolkit/components/antitracking/test/browser which had shifted tests around again and reintroduced the same problem. :-(
So re-enabling these tests would create a time bomb... Sigh
Comment 55•5 years ago
|
||
The priority flag is not set for this bug.
:ahal, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 56•5 years ago
|
||
Bug 1572271 may be helpful here. It reduces the memory usage of these tests significantly...
Comment 58•5 years ago
|
||
Comment 59•5 years ago
|
||
Pushed by eakhgari@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6cb67960abb5 Backout changeset 2a975bc4c88c because it wasn't useful to eliminate the test failures
Updated•5 years ago
|
Comment 60•5 years ago
|
||
bugherder |
Comment 61•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?
Comment 62•4 years ago
|
||
Looks like the fix was backed out so still an issue.
Updated•2 years ago
|
Description
•