Closed Bug 1377692 Opened 7 years ago Closed 5 years ago

Shutdown debug crash in Mozmill: PROCESS-CRASH | folder-display | application crashed

Categories

(Thunderbird :: Testing Infrastructure, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 68.0

People

(Reporter: jorgk-bmo, Unassigned)

References

Details

(Whiteboard: [Thunderbird-testfailure: Z all debug])

Attachments

(3 files)

First seen Sun Jun 25, 2017, 20:12:10
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=2a185ec18e89412466841a9755c49561f05bf142

Different symptoms on different platforms:
Windows:
PROCESS-CRASH | folder-display | application crashed [@ strnlen + 0xb7]
PROCESS-CRASH | content-tabs | application crashed [@ nptest.dll + 0x26ac] 

Linux:
PROCESS-CRASH | folder-tree-modes | application crashed [@ libc-2.15.so + 0x4ac36]

Mac started crashing Tue Jun 27, 2017, 16:34:12
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=6ccad3aa817af34d04fd8585ba0ce642f0b9193b

PROCESS-CRASH | folder-display | application crashed [@ strlen + 0x12]

During this push on Sat Jul 1, 2017, 15:40:25
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=df37b1887f4dc4170cc657bf4317f5870209485c
we have:
Windows+Mac:
PROCESS-CRASH | folder-display | application crashed [@ strnlen + 0xb7]
PROCESS-CRASH | folder-display | application crashed [@ strlen + 0x12]

Linux:
PROCESS-CRASH | folder-tree-modes | application crashed [@ libc-2.15.so + 0x4ac36]
PROCESS-CRASH | quick-filter-bar | application crashed [@ libc-2.15.so + 0x44a6a]

Looks like this is a shutdown crash at the end of the Mozmill run:
08:46:13     INFO -  INFO Passed: 314
08:46:13     INFO -  INFO Failed: 0
08:46:13     INFO -  INFO Skipped: 0
08:46:13  WARNING -  PROCESS-CRASH | folder-display | application crashed [@ strnlen + 0xb7]

Or on Linux:
09:51:40     INFO -  INFO Passed: 33
09:51:40     INFO -  INFO Failed: 0
09:51:40     INFO -  INFO Skipped: 0
09:51:40  WARNING -  PROCESS-CRASH | folder-tree-modes | application crashed [@ libc-2.15.so + 0x4ac36]
Also see bug 1342858 comment #5.
Whiteboard: [Thunderbird-testfailure: Z all debug]
Looking through the log again, the crash really started were stated in comment #0. So:

M-C last good: c01aa84ded7eb0b3e691f8bcc5cd887c96
M-C first bad: d50abca6521baeae8ac6b07ddf843d63a1

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c01aa84ded7eb0b3e691f8bcc5cd887c96&tochange=d50abca6521baeae8ac6b07ddf843d63a1

Nothing landed on M-C in this range, so M-C changes can't explain it. Also, the crashes on Mac started tow days later, so this is really weird.
OK, I ran |mozmake SOLO_TEST=folder-display mozmill-one| manually I got exactly what the servers report:

...
db left open c:\mozilla-source\comm-central\obj-i686-pc-mingw32\_tests\mozmill\mozmillprofile\Mail\Local Folders\Search1.msf
db left open c:\mozilla-source\comm-central\obj-i686-pc-mingw32\_tests\mozmill\mozmillprofile\Mail\Local Folders\Search2.msf
db left open
INFO Passed: 314
INFO Failed: 0
INFO Skipped: 0
PROCESS-CRASH | folder-display | application crashed [unknown top frame]
Crash dump filename: c:\mozilla-source\comm-central\obj-i686-pc-mingw32\_tests\mozmill\mozmillprofile\minidumps\94dd3ef8-e352-4564-8f6e-478d1ecdc0ef.dmp
MINIDUMP_STACKWALK not set, can't process dump.
TinderboxPrint: folder-display<br/><em class="testfail">CRASH</em>
SUMMARY-PASS | test-archive-messages.js::setupModule
SUMMARY-PASS | test-archive-messages.js::test_batch_archiver
...

The dmp file is not there to inspect after the run and no debugger offered itself.

Aceman, do you have more ideas? Perhaps run this on Linux and see what happens?
Flags: needinfo?(acelists)
This is really puzzling, here we have a run without Mac crash:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=33ae1055884a922fbf05bdb37452f982bdbc0186
The crash is still there on Windows.
I ran |mozmake SOLO_TEST=folder-display mozmill-one| manually and this time I got:
INFO Passed: 314
INFO Failed: 0
INFO Skipped: 0
SUMMARY-PASS | test-archive-messages.js::setupModule

So no crash. But a push today on Treeherder has:
00:33:26     INFO -  INFO Passed: 314
00:33:26     INFO -  INFO Failed: 0
00:33:26     INFO -  INFO Skipped: 0
00:33:26  WARNING -  PROCESS-CRASH | folder-display | application crashed [@ strnlen + 0xb7]
for Windows debug.

Mysterious!
Blocks: 1468691
Ben, you did great fixing the tokeniser debug test failure. Here we have another bug related to debug crashes.

Our debug runs are never green, we always see:
PROCESS-CRASH | attachment | application crashed [@ linux-gate.so + 0xcd9] (Linux)
or
PROCESS-CRASH | content-tabs | application crashed [@ nptest.dll + 0x2910] (Windows).

As far as I'm aware, it's a shutdown crash, you need to run an entire test directory with - say -
  make SOLO_TEST=attachment mozmill-one
and it might crash at the very end. Be aware of bug 1342858 which is another tough one and may be related. To mitigate that open db business a bit, I've already landed bug 1468691.

If you're interested in some detective work, this could be for you. No obligation! If you could find the cause quickly, you'd be my hero ;-)
Flags: needinfo?(acelists) → needinfo?(benc)
Not too much to report, although I did manage to narrow down my testcase a bit (Linux build):

    $ make SOLO_TEST=attachment/test-attachment-events.js mozmill-one

and within test-attachment-events.js the trigger appears to be in `test_attachments_added_on_multiple()`
(I can disable all the other test functions in the file, and that one causes a crash for me).
Will pick it up tomorrow and try and focus an even smaller test case, then pile in with the debugger.
Flags: needinfo?(benc)
No breakthroughs, just a dump of where I'm up to (mainly for my own reference).

This is my stripped-down mozmill test case (in comm/test/mozmill/attachment/).

    $ cd obj-x86_64-pc-linux-gnu/
    $ make SOLO_TEST=attachment/test-attachment-crashcase.js mozmill-one

Attaching gdb on the fly, I see the crash during MOZ_gdk_display_close().

Next steps: get a breakpoint set to single-step through all the actions performed by the offending 2 lines in this cut-down mozmill script, and try and spot it doing anything naughty.
Attached file crashlog.txt
and here's a log of my cut-down test running (and crashing).
Any updates? I'd love to clear this almost perma-red test failure one day.
Flags: needinfo?(benc)
Not yet - I got massively distracted with other stuff, but I'm back on it now, so shall keep you posted.
No solution so far, but a few more thoughts:

The end-symptom of this issue under linux is that during GUI shutdown, one of the third-party library shutdown functions will assert because not all their resources have been freed.

around toolkit/xre/nsAppRunner.cpp:2930, in MOZ_gdk_display_close(), one of these will assert:
cairo_debug_reset_static_data()  (libcario)
FcFini()  (libfontconfig)

I'm guessing the code path on windows is more tolerant of these leftover resources, but I'd guess the underlying problem is still happening there.

I'm rather short on ideas on how to track down the underlying cause. I'll bet its a bad refcount somewhere.
There's such a large surface area in both in JS and C++ land, that I don't really know where to begin.
The best clue is that adding an attachment to an email can trigger it.
I've tried digging into the related code there, but ended up drowning in complexity - I just don't know the code well enough yet.

My manual reproduction steps:
1) run TB (eg `./mach run --debug`)
2) click "Write" in the toolbar to start composing a new message
3) add an attachment to the new message
4) close TB (eg CTRL-Q)

Valgrind would likely help a lot (by telling us which bit of code allocated the dangling objects in the first place), but valgrind doesn't seem to work with Thunderbird at the moment. I thought valgrind was being run as part of the automated builds? Is there some special technique for running it?

My scorched-earth idea is to figure out a gdb script to log out _every_ addref/release, and hack up a tool to run through this and figure out which ones are leaking...
And of course, having written that, it occurs to me that XPCOM probably has something for that built in, and sure enough XPCOM_MEM_LEAK_LOG is a thing. I'll see if it still works and post back here with any progress.
(In reply to Ben Campbell from comment #13)
> And of course, having written that, it occurs to me that XPCOM probably has
> something for that built in, and sure enough XPCOM_MEM_LEAK_LOG is a thing.
> I'll see if it still works and post back here with any progress.
Yes, I used it years ago to find leaks:
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing
https://www-archive.mozilla.org/performance/leak-tutorial.html
https://www-archive.mozilla.org/performance/refcnt-balancer.html
https://wiki.mozilla.org/User:Mook:Leak_Notes

I also attach my notes which may be 100% useless.

Fixed by bug 1551119. The last incarnation of the failure was in attachment/test-attachment-events.js as per comment #8.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(benc)
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 68.0
See Also: → 1677202
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: