Intermittent dom/promise/tests/test_promise_argument.html | application crashed [@ RunWatchdog]

REOPENED
Assigned to

Status

()

Core
DOM
P2
normal
REOPENED
9 months ago
2 months ago

People

(Reporter: Treeherder Bug Filer, Assigned: gmierz)

Tracking

(Depends on: 1 bug, {intermittent-failure})

unspecified
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 affected, firefox56 affected)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

9 months ago
treeherder
Filed by: wkocher [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=82263788&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/Xx9bTrSAQrqrs04vcGCnVA/runs/0/artifacts/public/logs/live_backing.log

Comment 1

8 months ago
7 failures in 790 pushes (0.009 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 3
* autoland: 2
* try: 1
* graphics: 1

Platform breakdown:
* linux64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-03-06&endday=2017-03-12&tree=all

Comment 2

8 months ago
7 failures in 777 pushes (0.009 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 2
* autoland: 2
* try: 1
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux32: 4
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-03-13&endday=2017-03-19&tree=all

Comment 3

8 months ago
8 failures in 845 pushes (0.009 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 4
* mozilla-inbound: 2
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux32: 5
* linux64: 2
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-03-27&endday=2017-04-02&tree=all

Comment 4

7 months ago
11 failures in 894 pushes (0.012 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 4
* autoland: 3
* mozilla-central: 2
* try: 1
* oak: 1

Platform breakdown:
* linux32: 10
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-04-10&endday=2017-04-16&tree=all
Depends on: 1358898

Comment 5

7 months ago
14 failures in 817 pushes (0.017 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 4
* graphics: 3
* oak: 1
* mozilla-central: 1

Platform breakdown:
* linux32: 13
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-04-17&endday=2017-04-23&tree=all

Comment 6

7 months ago
12 failures in 883 pushes (0.014 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 4
* autoland: 4
* mozilla-inbound: 2
* try: 1
* graphics: 1

Platform breakdown:
* linux32: 7
* linux64: 4
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-04-24&endday=2017-04-30&tree=all
Duplicate of this bug: 1361529

Comment 8

7 months ago
17 failures in 770 pushes (0.022 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* try: 6
* mozilla-inbound: 5
* autoland: 3
* mozilla-central: 2
* graphics: 1

Platform breakdown:
* linux32: 14
* linux64: 2
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-05-01&endday=2017-05-07&tree=all

Comment 9

6 months ago
22 failures in 879 pushes (0.025 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 8
* mozilla-central: 5
* mozilla-inbound: 3
* mozilla-beta: 3
* graphics: 2
* try: 1

Platform breakdown:
* linux64: 11
* linux32: 10
* linux64-qr: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-05-08&endday=2017-05-14&tree=all

Comment 10

6 months ago
24 failures in 777 pushes (0.031 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 7
* autoland: 7
* try: 4
* mozilla-central: 3
* oak: 2
* graphics: 1

Platform breakdown:
* linux32: 19
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-05-15&endday=2017-05-21&tree=all

Comment 11

6 months ago
10 failures in 891 pushes (0.011 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 3
* autoland: 3
* try: 2
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux32: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-05-22&endday=2017-05-28&tree=all
we are working to run all our tests on ubuntu 16.04, and when running on try we found this error 33% of the time:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d191df67e0a14bbde01312a28d827587fa9e8fb&filter-searchStr=linux%20debug%20mochitest-e10s-4

:overholt, could we get some help in debugging this?
Blocks: 1319782
Flags: needinfo?(overholt)
Luke may know what's up.
Flags: needinfo?(overholt) → needinfo?(luke)

Comment 14

6 months ago
Wow, I hadn't seen this intermittent before.

Andrew, is there anyone who could help give me a bit more background on the nature of this failure?  Looking at the logs, I see test_webassembly_compile.html starting and finishing successfully, and it seems like the failure happens when the subsequent browser shutdown times out.  Since many other tests run in the same test run / browser instance, it seems like any of them could be causing to the shutdown hang.  So one question is: how does the test harness choose who to blame?  I'm guessing test_webassembly_compile.html is being blamed b/c it was the last test to run (due to it going alphabetically last), but maybe someone more familiar with the harness could explain if there is a more sophisticated mechanism?
Flags: needinfo?(luke) → needinfo?(overholt)

Comment 15

6 months ago
(Oops, didn't mean to clear ni?me.)
Flags: needinfo?(luke)
Maybe Joel knows the harness and/or blaming mechanism (I don't, sorry)?
Flags: needinfo?(overholt) → needinfo?(jmaher)
oh, this is my fault- I overlooked something here and see that this is a crash on shutdown.  We can bisect this and figure out which test(s) are responsible for the shutdown crash, I will get back tomorrow with more actionable data.
Flags: needinfo?(jmaher)

Updated

6 months ago
Flags: needinfo?(luke)
coming back to this, I realize that one of the last 2 tests in the mochitest.ini are causing the problem:
https://dxr.mozilla.org/mozilla-central/source/dom/promise/tests/mochitest.ini#19

this leaves:
test_promise_callback_retval.html
test_promise_argument.html


you can see a try push here:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=103704658

I looked at the last two and commenting one out at a time, I see the crash take place- so something in these tests are causing the problem, and it doesn't help that they run at the end of the suite.

Does this help narrow it down?  Are there more things you would like help in narrowing this down?
Flags: needinfo?(luke)

Comment 19

6 months ago
Thanks!  Looks like wasm is off the hook, so forwarding ni? back to Andrew.
Flags: needinfo?(luke) → needinfo?(overholt)

Comment 20

6 months ago
15 failures in 820 pushes (0.018 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 5
* mozilla-central: 4
* graphics: 1

Platform breakdown:
* linux32: 13
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1345284&startday=2017-05-29&endday=2017-06-04&tree=all
(Assignee)

Comment 21

6 months ago
I ran a little test with changing where 'SimpleTest.waitForExplicitFinish()' is called at the top of the script and it seems to have affected the error without changing how often it comes up: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2de2c9438fe0a07fed2d63114ad5eff98033bca2
(Assignee)

Comment 22

6 months ago
I've found a solution or the root cause of this problem, you can see a test run at [1]. It seems to be happening because of the forced garbage collection that this test does. I also don't see why it is necessary, and in my opinion, we can get rid of it.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfb03725715d45e789dad1b69f1fca5c6cfc6fd0
:luke, I see you authored this test, could you weigh in with your thoughts on what :gmierz is proposing, likewise maybe look at the code in the above try push to see if that is reasonable or offer up other suggestions?
Flags: needinfo?(luke)

Comment 24

6 months ago
If we're still only talking about the tests in comment 18 (test_promise_callback_retval.html and test_promise_argument.html), then hg says bz is the author, so forwarding ni? to him.
Flags: needinfo?(luke) → needinfo?(bzbarsky)
Summary: Intermittent dom/promise/tests/test_webassembly_compile.html | application crashed [@ RunWatchdog] → Intermittent dom/promise/tests/test_promise_argument.html | application crashed [@ RunWatchdog]
(Assignee)

Comment 25

6 months ago
So instead of removing the forced garbage collection, I've made it so that the PrefEnv changes made by the two test mentioned in comment 18 are removed after they finish. So far I haven't seen any failures and I am retrying 10 more times at [1]. 

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9aa868009850487e772b02c22ac2277d2cff2b18
(Assignee)

Comment 26

6 months ago
:luke, can you explain to me why the call to SpecialPowers.gc() is needed in this test? Thanks!
(Assignee)

Updated

6 months ago
Flags: needinfo?(luke)
Comment 23 was not about the promise tests.

As far as the promise tests go... I don't see anything in there offhand that should affect shutdown, except maybe the SpecialPowers usage?
Flags: needinfo?(bzbarsky)

Comment 28

6 months ago
(In reply to Boris Zbarsky [:bz] (if a patch has no decent message, automatic r-) from comment #27)
Unless I'm missing something, comment 18 points out two promise tests that alone cause this hang.
Flags: needinfo?(luke)
I can't make sense of comment 18 (and in particular, I don't see how it corresponds to the try run it links to).  If there are details, I would love to know what they are.

Comment 22 is about a change solely to the wasm test that makes the problem go away and comment 23 is a question about that change to the wasm tests.

So far it sounds to me like some combination of all these tests (or at least of gc and "something") is causing the problem.  How I wish we could just record in rr on infra.... :(
Flags: needinfo?(overholt)

Comment 30

6 months ago
Ah, sorry for the noise, I hadn't read the patches in the try push; I had assumed they were for the two tests mentioned in comment 18.  I think I added the System.gc() just to to attempt to reveal and catch bugs: basically to ensure the rooting I did for wasm compilation promises was done correctly.  Moving the waitForExplicitFinish() to the top seems perfectly fine; does that solve the issue or is it necessary to remove System.gc()?  Are there any known problems with calling System.gc()?
(Assignee)

Comment 31

6 months ago
Not a problem :)

Sorry for not removing that line, it didn't help with anything but it did change the error: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2de2c9438fe0a07fed2d63114ad5eff98033bca2

Removing the gc() line is all that was needed to fix the problem. Comment 25 goes a step further and fixes the two tests mentioned in comment 18 by reseting the pushPrefEnv change that they make. I'm not sure of any known problems, but it looks like there's one here.
(Assignee)

Comment 32

6 months ago
Or so far so good for the fix in comment 25, I'm going to run another ~20 tests on it to make sure.
> by reseting the pushPrefEnv change that they make.

The whole point of pushPrefEnv is that it gets undone automatically when your test ends.  So what you're adding is more pushes that then should all get unwound.

Specifically, see the SpecialPowers.flushPrefEnv call that happens inside SimpleTest.finish.
(Assignee)

Comment 34

6 months ago
Thanks for the information and clarification :bz. It did fail eventually also. I now think it's because of where the variables are declared, see [1]. I declared the variables that were within the functions, just outside of them, and after 45 test runs there are still no errors. So what I think is happening here is that the garbage collector destroys those variables before they can be used. This also means that there are no problems with gc().

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e0f394490a9ade0931ae5abada29cdf76bcd42b1
Comment hidden (mozreview-request)

Comment 36

6 months ago
mozreview-review
Comment on attachment 8875326 [details]
Bug 1345284 - Move variable declarations in test_promise_argument.html, and test_promise_callback_retval.html.

https://reviewboard.mozilla.org/r/146744/#review150856

thanks for fixing this.
Attachment #8875326 - Flags: review?(jmaher) → review+
Greg, I don't understand that patch.  What exactly do you think is being destroyed and at what point?  globalWrapper is only touched right after the getGlobalForObject call right there; why would it need to be scoped to anything outside that function?

Similarly, why do we care what "func" is scoped to?  It's not used after the passThroughCallbackPromise call returns.

Comment 38

6 months ago
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab4f6b6cd8dd
Move variable declarations in test_promise_argument.html, and test_promise_callback_retval.html. r=jmaher
Or put another way, I don't see why that patch should change any behavior at all.
And please consider that r- from a module peer... :(
(Assignee)

Comment 41

6 months ago
:bz, I was looking through some common things that cause oranges and found this: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges#Using_objects_without_accounting_for_the_possibility_of_their_death

Which is about the same as what these tests are doing.

In my mind, what happens is that because the gc is called in the other test, the garbage collector gets called again once we get to these tests and unfortunately it happens right between those two lines of code. When we don't force the gc in the other test, then the garbage collection happens elsewhere most often (either before or after those function calls) and that is why it severely reduces the frequency of occurrence. Does this explanation help?
> and found this: https://developer.mozilla.org/en-US/docs/Mozilla

Right, that's very specific to objects which you expect to call bacak to you later but are not owned by anyone other than yourself.  That's not the situation in this case.

> because the gc is called in the other test, the garbage collector gets called again once we get to these
> tests

That's possible, sure, in that it can affect gc scheduling in general...

> and unfortunately it happens right between those two lines of code

Between what two lines of code?

> Does this explanation help?

Sorry, no.
(Assignee)

Comment 43

6 months ago
Sorry, I meant between these two lines: https://dxr.mozilla.org/mozilla-central/source/dom/promise/tests/test_promise_argument.html?q=test_promise_argument.html&redirect_type=direct#19-20

Do you think this is another problem, and if so, what do you think it is?
> Sorry, I meant between these two lines:

Why would it matter if a gc happens between those two lines?  At that point, globalWrapper is clearly alive and referenced...

> Do you think this is another problem, and if so, what do you think it is?

I have no idea what the problem is, but it's not because globalWrapper dies while there's a live variable pointing to it!

Stepping back for a moment, the fundamental problem here is a shutdown hang, right?  Which part of shutdown is hanging?  Have we figured _that_ out?

Comment 45

6 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ab4f6b6cd8dd
Status: NEW → RESOLVED
Last Resolved: 6 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Assignee: nobody → gmierz2
Sure looks like this bug needs to be open still given the recent comments.
Status: RESOLVED → REOPENED
status-firefox55: fixed → affected
status-firefox56: --- → affected
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
(Assignee)

Comment 47

5 months ago
Thank you for reopening this. I spent some time looking at the logs to see if there is anything that stands out during shutdown and I noticed that when the test fails, it uses another DOM window (with a different PID) just before the test ends, and you can see this at [1]. [2] is a passing test, and in there, there isn't a change in the DOM window's PID right before the end of 'test_webassembly_compile.html'. Furthermore, in [1], the shutdown hangs as it tries to close a DOCSHELL with the same PID as the one that is used for the DOM window, at the end of the test, in [1]. I haven't been able to find another passing test with this difference yet though.

One question I have is where would I find the code that deals with the shutdown phase?

[1]: https://pastebin.mozilla.org/9024477
[2]: https://pastebin.mozilla.org/9024479
Probably in xpcom/build/XPCOMInit.cpp in general...

.
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.