Closed Bug 689580 Opened 13 years ago Closed 10 years ago

RANDOM ORANGE: TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed (content-tabs | application crashed (minidump found) - in test_plugin*)

Categories

(Thunderbird :: General, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mconley, Unassigned)

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

At this point, seems to only be appearing on 64 bit builds on OSX.

My theory is that we're attempting to delete the minidumps before they've actually been written.  Patch forthcoming.
Ok, so on closer inspection of http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTrunk/1317102966.1317103963.27461.gz, it looks like the JS Bridge is timing out.

Something I've noticed:  when doing OOP crash on OSX 64, my disk will spin up and my processor will go 100% for a few minutes.  Finally, it'll calm down, and I get the crash notifications UI as expected.

I suspect now that something similar is happening on some of our testing machines:  the crash is happening, OSX goes crazy figuring out what just happened, and then the JS Bridge times out because OSX is taking longer than expected to do something (60 seconds).

So - do we try to speed up our crash handling on OSX (which may or may not be out of Gecko's hands), or do we try to increase the timeout for this particular test?
Whiteboard: [tb-orange]
Summary: RANDOM ORANGE: TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed (content-tabs | application crashed (minidump found)) → RANDOM ORANGE: TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed (content-tabs | application crashed (minidump found) - in test_plugin*)
Attached patch Patch v1 (obsolete) — Splinter Review
Try builds are coming in here:  http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=a83aabec281d

A variation of this patch did rather well a few days ago:http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=f15ec5945c93

If this patch works, I can't really say for sure why.  I was pretty certain that a JS Bridge timeout was occurring, and then the minidumps weren't being cleaned up.  This patch cleans out an alternate directory that contains *pending* minidumps, and also disengages the crash-reporter during runtime (at sid0's suggestion) that might help OSX 64 not choke so much on a crash.

We'll see.
Looks like this fix didn't work:  http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=aec8b56bcf00

So here are a few options:

1)  Disable the plugin crash test for OSX 64-bit
2)  Communicate with the breakpad team to see what is going on with OSX 64 blocking on the main thread during a plugin crash - see if there's something we can do to mitigate it
3)  Increase the timeout time for JS Bridge (this will involve some re-writing of runtest.py, and is more of an "end-of-pipe" solution...)

I'm going to start on option 2 today until I hear otherwise.
Ok, so I was just talking to ted in #breakpad, and he suggested the possibility that the OSX internal crash reporter framework could be at fault here.

On OSX, after a plugin crashes, not only does breakpad take action, but the OSX internal crash reporter wakes up - and that can take a while (could be > 60 seconds...which is awful).  Once the framework is done loading, Mozmill wakes up and realizes that > 60 seconds has gone by since it's heard from Thunderbird, and then shuts down the test as a failure.

That's my theory right now, anyways.

Here are some related bugs about the OSX crash reporter:  

https://bugzilla.mozilla.org/show_bug.cgi?id=607015
https://bugzilla.mozilla.org/show_bug.cgi?id=577661

and here's the bug for bypassing the OSX crash reporter:

https://bugzilla.mozilla.org/show_bug.cgi?id=577673

Ted suggested that I contact smichaud, who is the resident OSX expert, and may have worked around this before.  I've sent him mail and CC'd him on this bug - I'll update when I hear more.
Given what we know about this failure, can we just extend the timeout for now so we can get the tree mostly green again? If we find an alternate solution, we can always pick that up later.
Mark:

Extending the timeout would mean some re-write of runtest.py, since we cannot nicely / directly affect the timeout length when subclassing mozmill.CLI.

A quicker solution would be to disable this test for OSX 64.  Would that be acceptable?

-Mike
Hm - this is interesting.  On trunk, it looks like the tests are not timing out, but crashes *still* aren't being cleared:

http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTrunk&rev=1b02fbd2f3b0
I'd be happy with disabling on OSX 64 for now as we've got coverage on the other two platforms.
Mark:

Alright - I've got something cooking on try right now, to see if it can clear spurious minidumps.

Failing that, I'll have a patch for disabling the test on OSX 64 this afternoon.

-Mike
Comment on attachment 565295 [details] [diff] [review]
[checked in] Temporarily turn off test-plugin-crashing.js on 64-bit OSX

Ok, try server is suffering from its upgrade at the moment. This looks right, so r=me and we'll land it on trunk.
Attachment #565295 - Flags: review+
Comment on attachment 565295 [details] [diff] [review]
[checked in] Temporarily turn off test-plugin-crashing.js on 64-bit OSX

Checked in: http://hg.mozilla.org/comm-central/rev/e2ec39613c79
Attachment #565295 - Attachment description: Temporarily turn of test-plugin-crashing.js on 64-bit OSX → [checked in] Temporarily turn of test-plugin-crashing.js on 64-bit OSX
Attachment #565295 - Attachment description: [checked in] Temporarily turn of test-plugin-crashing.js on 64-bit OSX → [checked in] Temporarily turn off test-plugin-crashing.js on 64-bit OSX
Attachment #565295 - Flags: approval-comm-aurora?
Comment on attachment 565295 [details] [diff] [review]
[checked in] Temporarily turn off test-plugin-crashing.js on 64-bit OSX

I've landed this on aurora so we can get this fixed there:

http://hg.mozilla.org/releases/comm-aurora/rev/2a43d7f6ab1c

Not going to track here it here.
Attachment #565295 - Flags: approval-comm-aurora?
Assignee: nobody → mconley
Target Milestone: --- → Thunderbird 10.0
Assignee: mconley → nobody
Target Milestone: Thunderbird 10.0 → ---
Whiteboard: [tb-orange]
Are we good now?
no TBPL robot hits since 2013-03-20.
Flags: needinfo?(mconley)
Sure, let's close it off.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(mconley)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.