Closed Bug 1345284 Opened 7 years ago Closed 1 year ago

Intermittent dom/promise/tests/test_promise_argument.html | application crashed [@ RunWatchdog]

Categories

(Core :: DOM: Core & HTML, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox55 --- affected
firefox56 --- affected

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

we are working to run all our tests on ubuntu 16.04, and when running on try we found this error 33% of the time:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d191df67e0a14bbde01312a28d827587fa9e8fb&filter-searchStr=linux%20debug%20mochitest-e10s-4

:overholt, could we get some help in debugging this?
Blocks: 1319782
Flags: needinfo?(overholt)
Luke may know what's up.
Flags: needinfo?(overholt) → needinfo?(luke)
Wow, I hadn't seen this intermittent before.

Andrew, is there anyone who could help give me a bit more background on the nature of this failure?  Looking at the logs, I see test_webassembly_compile.html starting and finishing successfully, and it seems like the failure happens when the subsequent browser shutdown times out.  Since many other tests run in the same test run / browser instance, it seems like any of them could be causing to the shutdown hang.  So one question is: how does the test harness choose who to blame?  I'm guessing test_webassembly_compile.html is being blamed b/c it was the last test to run (due to it going alphabetically last), but maybe someone more familiar with the harness could explain if there is a more sophisticated mechanism?
Flags: needinfo?(luke) → needinfo?(overholt)
(Oops, didn't mean to clear ni?me.)
Flags: needinfo?(luke)
Maybe Joel knows the harness and/or blaming mechanism (I don't, sorry)?
Flags: needinfo?(overholt) → needinfo?(jmaher)
oh, this is my fault- I overlooked something here and see that this is a crash on shutdown.  We can bisect this and figure out which test(s) are responsible for the shutdown crash, I will get back tomorrow with more actionable data.
Flags: needinfo?(jmaher)
Flags: needinfo?(luke)
coming back to this, I realize that one of the last 2 tests in the mochitest.ini are causing the problem:
https://dxr.mozilla.org/mozilla-central/source/dom/promise/tests/mochitest.ini#19

this leaves:
test_promise_callback_retval.html
test_promise_argument.html


you can see a try push here:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=103704658

I looked at the last two and commenting one out at a time, I see the crash take place- so something in these tests are causing the problem, and it doesn't help that they run at the end of the suite.

Does this help narrow it down?  Are there more things you would like help in narrowing this down?
Flags: needinfo?(luke)
Thanks!  Looks like wasm is off the hook, so forwarding ni? back to Andrew.
Flags: needinfo?(luke) → needinfo?(overholt)
I ran a little test with changing where 'SimpleTest.waitForExplicitFinish()' is called at the top of the script and it seems to have affected the error without changing how often it comes up: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2de2c9438fe0a07fed2d63114ad5eff98033bca2
I've found a solution or the root cause of this problem, you can see a test run at [1]. It seems to be happening because of the forced garbage collection that this test does. I also don't see why it is necessary, and in my opinion, we can get rid of it.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfb03725715d45e789dad1b69f1fca5c6cfc6fd0
:luke, I see you authored this test, could you weigh in with your thoughts on what :gmierz is proposing, likewise maybe look at the code in the above try push to see if that is reasonable or offer up other suggestions?
Flags: needinfo?(luke)
If we're still only talking about the tests in comment 18 (test_promise_callback_retval.html and test_promise_argument.html), then hg says bz is the author, so forwarding ni? to him.
Flags: needinfo?(luke) → needinfo?(bzbarsky)
Summary: Intermittent dom/promise/tests/test_webassembly_compile.html | application crashed [@ RunWatchdog] → Intermittent dom/promise/tests/test_promise_argument.html | application crashed [@ RunWatchdog]
So instead of removing the forced garbage collection, I've made it so that the PrefEnv changes made by the two test mentioned in comment 18 are removed after they finish. So far I haven't seen any failures and I am retrying 10 more times at [1]. 

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9aa868009850487e772b02c22ac2277d2cff2b18
:luke, can you explain to me why the call to SpecialPowers.gc() is needed in this test? Thanks!
Flags: needinfo?(luke)
Comment 23 was not about the promise tests.

As far as the promise tests go... I don't see anything in there offhand that should affect shutdown, except maybe the SpecialPowers usage?
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (if a patch has no decent message, automatic r-) from comment #27)
Unless I'm missing something, comment 18 points out two promise tests that alone cause this hang.
Flags: needinfo?(luke)
I can't make sense of comment 18 (and in particular, I don't see how it corresponds to the try run it links to).  If there are details, I would love to know what they are.

Comment 22 is about a change solely to the wasm test that makes the problem go away and comment 23 is a question about that change to the wasm tests.

So far it sounds to me like some combination of all these tests (or at least of gc and "something") is causing the problem.  How I wish we could just record in rr on infra.... :(
Flags: needinfo?(overholt)
Ah, sorry for the noise, I hadn't read the patches in the try push; I had assumed they were for the two tests mentioned in comment 18.  I think I added the System.gc() just to to attempt to reveal and catch bugs: basically to ensure the rooting I did for wasm compilation promises was done correctly.  Moving the waitForExplicitFinish() to the top seems perfectly fine; does that solve the issue or is it necessary to remove System.gc()?  Are there any known problems with calling System.gc()?
Not a problem :)

Sorry for not removing that line, it didn't help with anything but it did change the error: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2de2c9438fe0a07fed2d63114ad5eff98033bca2

Removing the gc() line is all that was needed to fix the problem. Comment 25 goes a step further and fixes the two tests mentioned in comment 18 by reseting the pushPrefEnv change that they make. I'm not sure of any known problems, but it looks like there's one here.
Or so far so good for the fix in comment 25, I'm going to run another ~20 tests on it to make sure.
> by reseting the pushPrefEnv change that they make.

The whole point of pushPrefEnv is that it gets undone automatically when your test ends.  So what you're adding is more pushes that then should all get unwound.

Specifically, see the SpecialPowers.flushPrefEnv call that happens inside SimpleTest.finish.
Thanks for the information and clarification :bz. It did fail eventually also. I now think it's because of where the variables are declared, see [1]. I declared the variables that were within the functions, just outside of them, and after 45 test runs there are still no errors. So what I think is happening here is that the garbage collector destroys those variables before they can be used. This also means that there are no problems with gc().

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e0f394490a9ade0931ae5abada29cdf76bcd42b1
Comment on attachment 8875326 [details]
Bug 1345284 - Move variable declarations in test_promise_argument.html, and test_promise_callback_retval.html.

https://reviewboard.mozilla.org/r/146744/#review150856

thanks for fixing this.
Attachment #8875326 - Flags: review?(jmaher) → review+
Greg, I don't understand that patch.  What exactly do you think is being destroyed and at what point?  globalWrapper is only touched right after the getGlobalForObject call right there; why would it need to be scoped to anything outside that function?

Similarly, why do we care what "func" is scoped to?  It's not used after the passThroughCallbackPromise call returns.
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab4f6b6cd8dd
Move variable declarations in test_promise_argument.html, and test_promise_callback_retval.html. r=jmaher
Or put another way, I don't see why that patch should change any behavior at all.
And please consider that r- from a module peer... :(
:bz, I was looking through some common things that cause oranges and found this: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges#Using_objects_without_accounting_for_the_possibility_of_their_death

Which is about the same as what these tests are doing.

In my mind, what happens is that because the gc is called in the other test, the garbage collector gets called again once we get to these tests and unfortunately it happens right between those two lines of code. When we don't force the gc in the other test, then the garbage collection happens elsewhere most often (either before or after those function calls) and that is why it severely reduces the frequency of occurrence. Does this explanation help?
> and found this: https://developer.mozilla.org/en-US/docs/Mozilla

Right, that's very specific to objects which you expect to call bacak to you later but are not owned by anyone other than yourself.  That's not the situation in this case.

> because the gc is called in the other test, the garbage collector gets called again once we get to these
> tests

That's possible, sure, in that it can affect gc scheduling in general...

> and unfortunately it happens right between those two lines of code

Between what two lines of code?

> Does this explanation help?

Sorry, no.
Sorry, I meant between these two lines: https://dxr.mozilla.org/mozilla-central/source/dom/promise/tests/test_promise_argument.html?q=test_promise_argument.html&redirect_type=direct#19-20

Do you think this is another problem, and if so, what do you think it is?
> Sorry, I meant between these two lines:

Why would it matter if a gc happens between those two lines?  At that point, globalWrapper is clearly alive and referenced...

> Do you think this is another problem, and if so, what do you think it is?

I have no idea what the problem is, but it's not because globalWrapper dies while there's a live variable pointing to it!

Stepping back for a moment, the fundamental problem here is a shutdown hang, right?  Which part of shutdown is hanging?  Have we figured _that_ out?
https://hg.mozilla.org/mozilla-central/rev/ab4f6b6cd8dd
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Assignee: nobody → gmierz2
Sure looks like this bug needs to be open still given the recent comments.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
Thank you for reopening this. I spent some time looking at the logs to see if there is anything that stands out during shutdown and I noticed that when the test fails, it uses another DOM window (with a different PID) just before the test ends, and you can see this at [1]. [2] is a passing test, and in there, there isn't a change in the DOM window's PID right before the end of 'test_webassembly_compile.html'. Furthermore, in [1], the shutdown hangs as it tries to close a DOCSHELL with the same PID as the one that is used for the DOM window, at the end of the test, in [1]. I haven't been able to find another passing test with this difference yet though.

One question I have is where would I find the code that deals with the shutdown phase?

[1]: https://pastebin.mozilla.org/9024477
[2]: https://pastebin.mozilla.org/9024479
Probably in xpcom/build/XPCOMInit.cpp in general...

.
Priority: -- → P2
Greg, are you going to be able to look at this?
Flags: needinfo?(gmierz2)
:overholt, I don't think I'll have a chance to hack on this in the near future - you can assign someone else on this bug if needed. I'm happy to help answer any questions they may have though.
Flags: needinfo?(gmierz2)
OK, thanks, Greg.

Edgar, can you please take a look when you have a minute? Thanks.
Assignee: gmierz2 → echen
Flags: needinfo?(echen)
Component: DOM → DOM: Core & HTML
Priority: P2 → P3
Assignee: echen → nobody
Flags: needinfo?(echen)
Severity: normal → S3

I do not see any instances of this intermittent any more. But I do see that those tests are enabled only in DEBUG, apparently since they landed (so not as a reaction on the intermittent). Filed bug 1799820 to verify, this can be closed then.

Status: REOPENED → RESOLVED
Closed: 7 years ago1 year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: