Open Bug 1572393 Opened 5 years ago Updated 2 years ago

browser_blocking* test cause ENOMEM worker crashes on Linux asan, workers keep on retrying until it turns into exception

Categories

(Testing :: Mochitest, defect, P2)

Version 3
x86_64
Linux
defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: bogdan_tara, Unassigned)

References

Details

Attachments

(4 files, 3 obsolete files)

Summary: Intermittent Linux asan bc5 keeps on retrying until it turns into exception → Almost permafail Linux asan bc5 keeps on retrying until it turns into exception

Ehsan this seems to start when bug 1570802 landed. Could you, please, take a look?

Flags: needinfo?(ehsan)

I had a look, and I'm not sure what to do with this. The jobs that turn into exception don't seem to produce a log (at least as far as I can tell). They only have a "live.log" log but when I try to view those link they all time out for me, so I can't actually get an idea of what's running during this test and what happens that causes it to not finish.

Looking at the Linux64 asan bc5 e10s test on the prior push I see that it doesn't include the antitracking tests which makes this a very weird coincidence, since I would have expected that if this is something related to bug 1570802, then antitracking tests would run within this suite.

Is there any way to get a log from these test runs?

Flags: needinfo?(ehsan) → needinfo?(btara)

Ehsan, thank you for looking.

I found logs for "retry" runs, eg. :
https://taskcluster-artifacts.net/XECr5BNLTQiXHC_56h_21Q/4/public/logs/live_backing.log
Is this expected? [taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown

[task 2019-08-08T12:44:30.446Z] 12:44:30 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_URLBarSetURI.js
[task 2019-08-08T12:44:34.488Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2019-08-08T12:44:34.489Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT heapAllocated not supported in this build configuration.
[task 2019-08-08T12:44:34.491Z] 12:44:34 INFO - GECKO(2468) | MEMORY STAT | vsize 20975057MB | residentFast 1108MB
[task 2019-08-08T12:44:34.493Z] 12:44:34 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_URLBarSetURI.js | took 4050ms
[task 2019-08-08T12:44:34.529Z] 12:44:34 INFO - checking window state
[task 2019-08-08T12:44:34.550Z] 12:44:34 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:34.566Z] 12:44:34 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_formatValue.js
[task 2019-08-08T12:44:50.880Z] 12:44:50 INFO - GECKO(2468) | MEMORY STAT | vsize 20974981MB | residentFast 1286MB
[task 2019-08-08T12:44:50.881Z] 12:44:50 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_formatValue.js | took 16318ms
[task 2019-08-08T12:44:50.923Z] 12:44:50 INFO - checking window state
[task 2019-08-08T12:44:50.923Z] 12:44:50 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:50.958Z] 12:44:50 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_hiddenFocus.js
[task 2019-08-08T12:44:51.567Z] 12:44:51 INFO - GECKO(2468) | MEMORY STAT | vsize 20975084MB | residentFast 1331MB
[task 2019-08-08T12:44:51.568Z] 12:44:51 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_hiddenFocus.js | took 618ms
[task 2019-08-08T12:44:51.609Z] 12:44:51 INFO - checking window state
[task 2019-08-08T12:44:51.629Z] 12:44:51 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:51.658Z] 12:44:51 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow.js
[task 2019-08-08T12:44:53.771Z] 12:44:53 INFO - GECKO(2468) | MEMORY STAT | vsize 20975116MB | residentFast 1373MB
[task 2019-08-08T12:44:53.773Z] 12:44:53 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow.js | took 2118ms
[task 2019-08-08T12:44:53.814Z] 12:44:53 INFO - checking window state
[task 2019-08-08T12:44:53.816Z] 12:44:53 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:53.825Z] 12:44:53 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow_resize.js
[task 2019-08-08T12:44:56.009Z] 12:44:56 INFO - GECKO(2468) | MEMORY STAT | vsize 20975128MB | residentFast 1209MB
[task 2019-08-08T12:44:56.012Z] 12:44:56 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_overflow_resize.js | took 2186ms
[task 2019-08-08T12:44:56.054Z] 12:44:56 INFO - checking window state
[task 2019-08-08T12:44:56.055Z] 12:44:56 INFO - GECKO(2468) | must wait for focus
[task 2019-08-08T12:44:56.071Z] 12:44:56 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:56.075Z] 12:44:56 INFO - Console message: [JavaScript Error: "uncaught exception: Object" {file: "resource://testing-common/PromiseTestUtils.jsm" line: 112}]
[task 2019-08-08T12:44:56.092Z] 12:44:56 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_tooltip.js
[task 2019-08-08T12:44:56.696Z] 12:44:56 INFO - GECKO(2468) | MEMORY STAT | vsize 20975124MB | residentFast 1209MB
[task 2019-08-08T12:44:56.696Z] 12:44:56 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_tooltip.js | took 609ms
[task 2019-08-08T12:44:56.734Z] 12:44:56 INFO - checking window state
[task 2019-08-08T12:44:56.750Z] 12:44:56 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:56.754Z] 12:44:56 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarInput_trimURLs.js
[task 2019-08-08T12:44:57.341Z] 12:44:57 INFO - GECKO(2468) | MEMORY STAT | vsize 20975144MB | residentFast 1241MB
[task 2019-08-08T12:44:57.342Z] 12:44:57 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarInput_trimURLs.js | took 586ms
[task 2019-08-08T12:44:57.382Z] 12:44:57 INFO - checking window state
[task 2019-08-08T12:44:57.403Z] 12:44:57 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:57.411Z] 12:44:57 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_UrlbarLoadRace.js
[task 2019-08-08T12:44:58.661Z] 12:44:58 INFO - GECKO(2468) | JavaScript error: resource://gre/modules/sessionstore/Utils.jsm, line 27: TypeError: aStream.available is not a function
[task 2019-08-08T12:44:58.766Z] 12:44:58 INFO - GECKO(2468) | MEMORY STAT | vsize 20975171MB | residentFast 1292MB
[task 2019-08-08T12:44:58.766Z] 12:44:58 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_UrlbarLoadRace.js | took 1355ms
[task 2019-08-08T12:44:58.793Z] 12:44:58 INFO - checking window state
[task 2019-08-08T12:44:58.814Z] 12:44:58 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:58.814Z] 12:44:58 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_action_searchengine.js
[task 2019-08-08T12:44:59.637Z] 12:44:59 INFO - GECKO(2468) | MEMORY STAT | vsize 20975183MB | residentFast 1313MB
[task 2019-08-08T12:44:59.638Z] 12:44:59 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_action_searchengine.js | took 819ms
[task 2019-08-08T12:44:59.666Z] 12:44:59 INFO - checking window state
[task 2019-08-08T12:44:59.687Z] 12:44:59 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:44:59.691Z] 12:44:59 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_action_searchengine_alias.js
[task 2019-08-08T12:45:00.512Z] 12:45:00 INFO - GECKO(2468) | MEMORY STAT | vsize 20975189MB | residentFast 1326MB
[task 2019-08-08T12:45:00.513Z] 12:45:00 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_action_searchengine_alias.js | took 816ms
[task 2019-08-08T12:45:00.552Z] 12:45:00 INFO - checking window state
[task 2019-08-08T12:45:00.573Z] 12:45:00 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:00.590Z] 12:45:00 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_backspaced.js
[task 2019-08-08T12:45:04.079Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975159MB | residentFast 1272MB
[task 2019-08-08T12:45:04.079Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_backspaced.js | took 3485ms
[task 2019-08-08T12:45:04.119Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.140Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.160Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_canonize.js
[task 2019-08-08T12:45:04.357Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Waiting for URL: http://example.com/
[task 2019-08-08T12:45:04.373Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: onStateChange f0001:http://example.com/
[task 2019-08-08T12:45:04.373Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Document start: http://example.com/
[task 2019-08-08T12:45:04.470Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Waiting for URL: http://www.exam.com/
[task 2019-08-08T12:45:04.486Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: onStateChange f0001:http://www.exam.com/
[task 2019-08-08T12:45:04.486Z] 12:45:04 INFO - GECKO(2468) | waitForDocLoadAndStopIt: Document start: http://www.exam.com/
[task 2019-08-08T12:45:04.528Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975162MB | residentFast 1276MB
[task 2019-08-08T12:45:04.529Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_canonize.js | took 376ms
[task 2019-08-08T12:45:04.557Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.598Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.598Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_caretNotAtEnd.js
[task 2019-08-08T12:45:04.901Z] 12:45:04 INFO - GECKO(2468) | MEMORY STAT | vsize 20975159MB | residentFast 1276MB
[task 2019-08-08T12:45:04.902Z] 12:45:04 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_caretNotAtEnd.js | took 297ms
[task 2019-08-08T12:45:04.942Z] 12:45:04 INFO - checking window state
[task 2019-08-08T12:45:04.966Z] 12:45:04 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:04.987Z] 12:45:04 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_firstResult.js
[task 2019-08-08T12:45:06.496Z] 12:45:06 INFO - GECKO(2468) | MEMORY STAT | vsize 20975148MB | residentFast 1291MB
[task 2019-08-08T12:45:06.496Z] 12:45:06 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_firstResult.js | took 1508ms
[task 2019-08-08T12:45:06.540Z] 12:45:06 INFO - checking window state
[task 2019-08-08T12:45:06.560Z] 12:45:06 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:06.562Z] 12:45:06 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_paste.js
[task 2019-08-08T12:45:07.013Z] 12:45:07 INFO - GECKO(2468) | MEMORY STAT | vsize 20975145MB | residentFast 1282MB
[task 2019-08-08T12:45:07.014Z] 12:45:07 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_paste.js | took 451ms
[task 2019-08-08T12:45:07.049Z] 12:45:07 INFO - checking window state
[task 2019-08-08T12:45:07.070Z] 12:45:07 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:07.077Z] 12:45:07 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_placeholder.js
[task 2019-08-08T12:45:09.709Z] 12:45:09 INFO - GECKO(2468) | MEMORY STAT | vsize 20975129MB | residentFast 1300MB
[task 2019-08-08T12:45:09.710Z] 12:45:09 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_placeholder.js | took 2626ms
[task 2019-08-08T12:45:09.746Z] 12:45:09 INFO - checking window state
[task 2019-08-08T12:45:09.761Z] 12:45:09 INFO - GECKO(2468) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T12:45:09.763Z] 12:45:09 INFO - TEST-START | browser/components/urlbar/tests/browser/browser_autoFill_preserve.js
[task 2019-08-08T12:45:12.181Z] 12:45:12 INFO - GECKO(2468) | MEMORY STAT | vsize 20975094MB | residentFast 1263MB
[task 2019-08-08T12:45:12.182Z] 12:45:12 INFO - TEST-OK | browser/components/urlbar/tests/browser/browser_autoFill_preserve.js | took 2420ms
[task 2019-08-08T12:45:12.238Z] 12:45:12 INFO - checking window state

[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown
[taskcluster 2019-08-08 12:45:12.651Z] === Task Finished ===
[taskcluster 2019-08-08 12:45:12.651Z] Unsuccessful task run with exit code: -1 completed in 272.774 seconds

Flags: needinfo?(btara) → needinfo?(ehsan)

ps: i asked for the exception log on irc.

(In reply to Bogdan Tara[:bogdan_tara] from comment #3)

Ehsan, thank you for looking.

I found logs for "retry" runs, eg. :
https://taskcluster-artifacts.net/XECr5BNLTQiXHC_56h_21Q/4/public/logs/live_backing.log
Is this expected? [taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown

Hmm, that is most definitely not expected. That message on the last line isn't coming from code in the tree as far as I can tell based on searching in searchfox. It looks like the whole task took about 5 minutes, so this can't be caused by things having taken too long... Also as I had suspected none of the antitracking tests have been running in this job, so the test changes in the patch for bug 1570802 cannot possibly be the culprit. As far as the code changes in that patch go, to the best of my knowledge, that code is not executed by anything in the tree except the tests in toolkit/components/antitracking, which again aren't running here, so in all honesty I don't see how the code changes even could have caused this. I'm really out of ideas.

Can we ask help from someone more familiar with TaskCluster by any chance?

Flags: needinfo?(ehsan)

Dave, could this be caused by the switch to m5.large?

Flags: needinfo?(dhouse)

Any idea why these job gets killed after ~35 minutes?

[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown

Log is from https://tools.taskcluster.net/groups/JafjMBAkSim35ccRf4TOdA/tasks/XECr5BNLTQiXHC_56h_21Q/runs/4

Flags: needinfo?(dustin)

worker-shutdown means that we received notice from the cloud provider that the spot instance is being terminated.

Flags: needinfo?(dustin)

FWIW I'm also testing a backout of my patch on try, just in case...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf037efb83ea782c1d065e18b178e21e4f912af

(In reply to :Ehsan Akhgari from comment #10)

FWIW I'm also testing a backout of my patch on try, just in case...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf037efb83ea782c1d065e18b178e21e4f912af

To my surprise, that push came back green! Now I'm pushing just the code changes without the test changes to see if that would break the job...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e33b0393d780d77c32c1b26573bcac073d32ce4

I still do not understand at all what's going on here. I verified once again that the code that I changed in this patch doesn't get executed in the tests that run as part of this test suite. So you would think that my patch would have no impact on the test suite...

I'm seeing a retry on my test backout, this time on bc10.

No logs atm for task run 0.
https://tools.taskcluster.net/groups/JbSb3wcQS3aU9rU98jP_6g/tasks/QwwjM_5vQhqGivp7ZBqlRQ/runs/0/logs/public%2Flogs%2Flive.log

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)

Could this be excessive resource usage?

:dustin, is that something you've seen before? Or is this possibly the billing term/time for that spot request expired? I'm assuming it isn't a hardware failure terminate/retire since it is a spot instance.

Flags: needinfo?(dhouse) → needinfo?(dustin)

:dustin, is that something you've seen before?

I'm not sure what "excessive resource usage" means here.

Or is this possibly the billing term/time for that spot request expired?

The worker-shutdown errors do mean that. It's not to do with time (Amazon does per-second billing), but with Amazon pre-empting our use of that instance.

It may be that this particular instance type behaves this way. I don't know the details of the AWS spot "market" anymore, but as of a few years ago it's not much of a "market" at all in the sense that Amazon just sets prices. Maybe when they raise prices, all instances at the lower price-point get terminated?

Flags: needinfo?(dustin)

OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.

By sheer luck I was looking at this taskcluster log view window from this other try push on top of the above and there I saw log messages like this:

[task 2019-08-08T17:39:13.687Z] 17:39:13     INFO - GECKO(3714) | MEMORY STAT | vsize 20976516MB | residentFast 3472MB
[task 2019-08-08T17:39:13.687Z] 17:39:13     INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_partitionedSharedWorkers.js | took 14949ms
[task 2019-08-08T17:39:13.723Z] 17:39:13     INFO - checking window state
[task 2019-08-08T17:39:13.743Z] 17:39:13     INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:13.743Z] 17:39:13     INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_permissionInNormalWindows.js
[task 2019-08-08T17:39:23.087Z] 17:39:23     INFO - GECKO(3714) | MEMORY STAT | vsize 20976484MB | residentFast 3560MB
[task 2019-08-08T17:39:23.088Z] 17:39:23     INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_permissionInNormalWindows.js | took 9342ms
[task 2019-08-08T17:39:23.112Z] 17:39:23     INFO - checking window state
[task 2019-08-08T17:39:23.140Z] 17:39:23     INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:23.140Z] 17:39:23     INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_permissionInPrivateWindows.js
[task 2019-08-08T17:39:37.194Z] 17:39:37     INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:37.487Z] 17:39:37     INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:37.771Z] 17:39:37     INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:39.255Z] 17:39:39     INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:40.714Z] 17:39:40     INFO - GECKO(3714) | JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 912: TypeError: aBrowser is null
[task 2019-08-08T17:39:41.386Z] 17:39:41     INFO - GECKO(3714) | MEMORY STAT | vsize 20976568MB | residentFast 3559MB
[task 2019-08-08T17:39:41.403Z] 17:39:41     INFO - TEST-OK | toolkit/components/antitracking/test/browser/browser_permissionInPrivateWindows.js | took 18245ms
[task 2019-08-08T17:39:41.435Z] 17:39:41     INFO - checking window state
[task 2019-08-08T17:39:41.436Z] 17:39:41     INFO - GECKO(3714) | must wait for focus
[task 2019-08-08T17:39:41.471Z] 17:39:41     INFO - GECKO(3714) | JavaScript error: resource://testing-common/PromiseTestUtils.jsm, line 112: uncaught exception: Object
[task 2019-08-08T17:39:41.507Z] 17:39:41     INFO - Console message: [JavaScript Error: "uncaught exception: Object" {file: "resource://testing-common/PromiseTestUtils.jsm" line: 112}]
[task 2019-08-08T17:39:41.510Z] 17:39:41     INFO - TEST-START | toolkit/components/antitracking/test/browser/browser_referrerDefaultPolicy.js

That's where the log stopped.

So I think what happened here is that removing that one test file changed the test suite chunking decisions, and maybe now the test is taking too long to finish or something (the tests in toolkit/components/antitracking/test/browser are well known to take their sweet time to run!) and it causes this problem.

I can easily add that test back for now to work around this...

Blocks: 1570802
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c
Add back browser_storageAccessPrivateWindow.js to undo the test chunking changes which broke Linux64 ASan BC5 tests;

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #16)

:dustin, is that something you've seen before?
I'm not sure what "excessive resource usage" means here.

I read that as equivalent to throttling our overall spot pool (?).

Or is this possibly the billing term/time for that spot request expired?

The worker-shutdown errors do mean that. It's not to do with time (Amazon does per-second billing), but with Amazon pre-empting our use of that instance.

Thx, then maybe we need to flag these tests as interrupted and not failing or catch-all-exception.

It may be that this particular instance type behaves this way. I don't know the details of the AWS spot "market" anymore, but as of a few years ago it's not much of a "market" at all in the sense that Amazon just sets prices. Maybe when they raise prices, all instances at the lower price-point get terminated?

If the docs are correct, it says they'll change the price for the instance up to the max we set:

In the circumstance EC2 needs to reclaim your Spot instance it can be for two possible reasons, with the primary one being Amazon EC2 capacity requirements (e.g. On Demand or Reserved Instance usage). Secondarily, if you have chosen to set a “maximum Spot price” and the Spot price rises above this, your instance will be reclaimed with a two-minute notification
https://aws.amazon.com/ec2/faqs/#Spot_instances

(In reply to :Ehsan Akhgari from comment #17)

OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.

Maybe this exposed the "[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown" because of running longer. And that it just happened to be instances that aws took away from us for capacity for non-spot users.

(In reply to Dave House [:dhouse] from comment #21)

(In reply to :Ehsan Akhgari from comment #17)

OK, this is indeed caused by my patch! It seems that the removing browser_storageAccessPrivateWindow.js from browser.ini causes this.

Maybe this exposed the "[taskcluster:error] Task has been aborted prematurely. Reason: worker-shutdown" because of running longer. And that it just happened to be instances that aws took away from us for capacity for non-spot users.

That makes perfect sense.

Thx, then maybe we need to flag these tests as interrupted and not failing or catch-all-exception.

The worker-shutdown exception is (or at least, was) translated into a blue? purple? color in treeherder which indicates that. If that's no longer the case, that's a bug in threeherder.

I think we may be talking about different errors here? The log that ehsan linked to in comment 17 failed with exception claim-expired, which occurs when the worker disappears. Often this is a result of an OOM or forkbomb or other instance-killing issue in the tests. These are difficult to debug, unfortunately. I'm glad this was tracked to a source-code change, as it's especially difficult to debug these things when they are caused by subtle differences between instance types.

If the docs are correct, it says they'll change the price for the instance up to the max we set:

Our maxPrice is $8.00, and the spot price has hovered around $0.034 for months. So, this must be due to demand for on-demand instances.

After this patch, the test chunking changed and now bc5 fails on Windows 10 x64 Quantum Render. :-(

Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f8dfdd9df98f
Increase Windows 10 x64 QR debug browser-chrome max runtime to 90 minutes. r=gbrown on a CLOSED TREE
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Assignee: nobody → ehsan
Status: RESOLVED → REOPENED
Flags: needinfo?(ehsan)
Resolution: FIXED → ---
Target Milestone: mozilla70 → ---

I have no other ideas besides increasing the test timeout.

Flags: needinfo?(ehsan)

Please see grbown's comment on https://phabricator.services.mozilla.com/D41371. He doesn't believe this will help.

I'll be afk for a few hours, and don't really have any ideas what to do with this. Please feel free to land the patch I attached or back out whatever is needed.

Resigning from this bug because I don't believe I know enough about TaskCluster to be the correct assignee here.

Assignee: ehsan → nobody
Flags: needinfo?(aciure)

https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with claim-expired. Who can investigate this permafailure?

Dave, is it possible to get the log for one of those and the reason why it gets terminated or system stats like memory?

Flags: needinfo?(aciure) → needinfo?(dhouse)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)

https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with claim-expired. Who can investigate this permafailure?

Dave, is it possible to get the log for one of those and the reason why it gets terminated or system stats like memory?

I attached the log for the worker id i-0078cc110cbf454a6 which failed at 2019-08-09T12:32:49.202Z on test-linux64-asan/opt-mochitest-browser-chrome-e10s-10 task id: Jm8GoUc6R6qIHRifzj06yw and had started at 2019-08-09T11:57:21.817Z

I'll look in aws to see if we have some system stats like memory collected.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #33)

https://tools.taskcluster.net/groups/ULQlogRdQCCcjQyhTDqUiw/tasks/Jm8GoUc6R6qIHRifzj06yw/runs/0 and reruns fail with claim-expired. Who can investigate this permafailure?

claim-expired means the task had expired, right? Is it somehow getting too-short of an expiration set on the task definition?

Flags: needinfo?(aryx.bugmail)

No, claim-expired means the claim expired. Which means the worker crashed or disappeared or was otherwise unable to renew its claim on the task.

I notice the log in comment 34 contains:

Aug 09 05:28:11 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: Uncaught Exception! Attempting to report to Sentry and crash.
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker: Error: spawn ENOMEM
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at _errnoException (util.js:1019:11)
Aug 09 05:28:14 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at ChildProcess.spawn (internal/child_process.js:325:11)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at exports.spawn (child_process.js:494:9)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at Object.exports.execFile (child_process.js:209:15)
Aug 09 05:28:17 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at exports.exec (child_process.js:139:18)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at Object.check (/home/ubuntu/docker_worker/node_modules/diskspace/diskspace.js:56:3)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at Object.module.exports (/home/ubuntu/docker_worker/src/lib/stats/host_metrics.js:43:13)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at ontimeout (timers.js:469:11)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at tryOnTimeout (timers.js:304:5)
Aug 09 05:28:21 docker-worker.aws-provisioner.us-east-1b.ami-0dedc5db10ec003af.m5-large.i-0078cc110cbf454a6 docker-worker:     at Timer.listOnTimeout (timers.js:264:5)

(I don't know what that means / don't know if it is significant.)

I cannot load the test log for most of the retries or exceptions, but I found a few that I could examine. One of those ended abruptly during toolkit/components/antitracking/test/browser/browser_blockingIndexedDb.js. I pushed to try with that test skipped on asan, but the retries continued.

Next I tried skipping all the browser_blocking* tests and there were no persistent retries:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ec7d17b05ac335e4fdd8376df672b00ca39ab676

It's significant -- it means the worker got ENOMEM, which would crash the worker and then lead to a claim-expired.

dhouse, can you double-check that these new instances have the same memory as the old? I looked yesterday and it seemed that they have 512M more, but maybe I misread something?

(In reply to Geoff Brown [:gbrown] (away Aug 10-18) from comment #41)

Next I tried skipping all the browser_blocking* tests and there were no persistent retries:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ec7d17b05ac335e4fdd8376df672b00ca39ab676

One this that is "special" about these tests is that they open and close a lot of browser windows, FWIW. Maybe tens of windows per each test file...

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/8cb96bd4a262
Disable browser_blocking* tests on Linux asan to fix permafailing chunk. a=Aryx CLOSED TREE
Summary: Almost permafail Linux asan bc5 keeps on retrying until it turns into exception → browser_blocking* test cause ENOMEM worker crashes on Linux asan, workers keep on retrying until it turns into exception
Whiteboard: [stockwell disable-recommended]

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #43)

dhouse, can you double-check that these new instances have the same memory as the old? I looked yesterday and it seemed that they have 512M more, but maybe I misread something?

I'm reading it as 512M more also. An external site that scrapes the aws docs show the same: https://www.ec2instances.info/?selected=m3.large,m5.large

(In reply to Pulsebot from comment #45)

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/8cb96bd4a262
Disable browser_blocking* tests on Linux asan to fix permafailing chunk.
a=Aryx CLOSED TREE

Sebastian, what is the process for re-enabling these tests? Are we waiting for a resolution to this bug? Thanks!

The ENOMEM worker crashes with these tests have to be resolved to run them again on Linux x64 asan. If there is anything in a captured profile if the test runs locally which points to the reason, we could get someone more specific to look at it.

Attachment #9084293 - Attachment is obsolete: true
Attachment #9084370 - Attachment mime type: application/octet-stream → text/plain

(In reply to Pulsebot from comment #19)

Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c
Add back browser_storageAccessPrivateWindow.js to undo the test chunking
changes which broke Linux64 ASan BC5 tests;

It seems that the thing that broke the camel's back was this patch.

This try push reverts that changeset and enables all of these tests on Linux ASan and everything seems to be as green as things can be these days:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e6cc2d057ba9a4e56b53776b17f22eb1bfc69c4d

I'm gonna submit a patch to revert this all.

Attachment #9084922 - Attachment is obsolete: true

(In reply to :Ehsan Akhgari from comment #52)

(In reply to Pulsebot from comment #19)

Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2a975bc4c88c
Add back browser_storageAccessPrivateWindow.js to undo the test chunking
changes which broke Linux64 ASan BC5 tests;

It seems that the thing that broke the camel's back was this patch.

This try push reverts that changeset and enables all of these tests on Linux ASan and everything seems to be as green as things can be these days:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e6cc2d057ba9a4e56b53776b17f22eb1bfc69c4d

I'm gonna submit a patch to revert this all.

I did another push on a new baseline from m-c https://treeherder.mozilla.org/#/jobs?repo=try&revision=df4dccf2ca913f43375a41b7e87536ca8336e3f8 which was worse, it showed that some other test directory had changed somewhere outside of toolkit/components/antitracking/test/browser which had shifted tests around again and reintroduced the same problem. :-(

So re-enabling these tests would create a time bomb... Sigh

The priority flag is not set for this bug.
:ahal, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ahal)

Bug 1572271 may be helpful here. It reduces the memory usage of these tests significantly...

Feel free to adjust as needed.

Flags: needinfo?(ahal)
Priority: -- → P2
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6cb67960abb5
Backout changeset 2a975bc4c88c because it wasn't useful to eliminate the test failures
Attachment #9084103 - Attachment is obsolete: true

The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?

Flags: needinfo?(ahal)

Looks like the fix was backed out so still an issue.

Flags: needinfo?(ahal)
Keywords: leave-open
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: