Closed Bug 1631149 Opened 4 years ago Closed 3 years ago

tb win7 perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js,browser_saveChangesOnQuit.js,browser_sendFormat.js,browser_signatureUpdating.js,browser_replyFormatFlowed.js | application crashed [@]

Categories

(MailNews Core :: Composition, defect)

defect

Tracking

(thunderbird_esr78 wontfix)

RESOLVED FIXED
89 Branch
Tracking Status
thunderbird_esr78 --- wontfix

People

(Reporter: intermittent-bug-filer, Assigned: benc)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(2 files)

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is -- (non,) indicating it has has not been previously triaged, the bug's Severity is being updated to -- (default, untriaged.)

Severity: normal → --
Summary: Intermittent comm/mail/test/browser/composition/browser_replyMultipartCharset.js | application crashed [@ mozalloc_handle_oom(unsigned int)] → Intermittent comm/mail/test/browser/composition/browser_*.js | application crashed [@ mozalloc_handle_oom(unsigned int)]

TreeHerder didn't like my title change. :-(

Summary: Intermittent comm/mail/test/browser/composition/browser_*.js | application crashed [@ mozalloc_handle_oom(unsigned int)] → Intermittent comm/mail/test/browser/composition/browser_replyMultipartCharset.js, browser_sendFormat.js, browser_saveChangesOnQuit.js | application crashed [@ mozalloc_handle_oom(unsigned int)]

We made it further today before running out of memory. I guess that's progress.

Summary: Intermittent comm/mail/test/browser/composition/browser_replyMultipartCharset.js, browser_sendFormat.js, browser_saveChangesOnQuit.js | application crashed [@ mozalloc_handle_oom(unsigned int)] → Intermittent comm/mail/test/browser/composition/browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@ mozalloc_handle_oom(unsigned int)]
Summary: Intermittent comm/mail/test/browser/composition/browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@ mozalloc_handle_oom(unsigned int)] → Intermittent comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@ mozalloc_handle_oom(unsigned int)]

My notes on this so far:

I think there's something a little odd about the way mochitest drives Thunderbird - it always causes some extra leaks to be reported (via the XPCOM_MEM_LEAK_LOG mechanisms) at the end of a test, even a stubbed-out do-nothing test. I'm assuming these leaks are related to whatever's causing the failures we're seeing here.

It's not clear why it's these mail/test/browser/composition/ tests which are crossing the memory-use threshold and failing. Maybe the composition code is just more susceptable/memory-hungry, causing us to hit the memory limit on the CI machine. Or maybe it's just got some nasty reference cycles holding objects in existance too long...
The other suspicious thing is that the composition tests are run twice - once for the cpp and one for jsm. I'm assuming this is to cover both the old C++ sending code and the new JS-based version? The timing seems a little suspect - I think the JS side landed around the same time this issue started. Or could just be coincidental.

In any case, the logged leaks don't seem to occur outside mochitest, which makes it a little trickier to examine.

I've done a DMD-enabled build, and mach mochitest helpfully supplies a --dump-dmd-after-test option which looks super useful... but unfortunately the dumping crashes:

 0:37.17 GECKO(333022) Assertion failure: gcTotal == rtStats.gcHeapChunkTotal, at /home/ben/tb/mozilla/js/xpconnect/src/XPCJSRuntime.cpp:2018

Which kind of implies the memory tracking is really screwed up here...

some possible next steps I'm mulling:

  • figure out what's wrong that causes that DMD assert
  • do DMD dumps on running tests (If I can get them working) to try and spot leaks
  • dig deeper using the XPCOM_MEM_xxxx_LOG settings to try and pick apart the reported leaks.
  • work out exactly how mochitest drives the app, to figure out why it seems to exit without cleaning up properly (and leaving objects to leak which don't when you run TB directly).
  • try and replicate the tests "manually", without mochitest, and running regular DMD dumps.
  • try and replicate the failure more properly by limited memory available during the test (using ulimit or a virtual machine or something)...
Summary: Intermittent comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@ mozalloc_handle_oom(unsigned int)] → thunderbird debug perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@]

Just some more notes:
Watching the memory usage on one of the tests (browser_sendFormat.js) as it runs under both the JS (MessageSend.jsm) and C++ (msgSend.cpp) paths, I don't see any obvious runaway memory usage that would cause the OOM error on the CI run.

I've tried running builds with various memory/leak tracking instrumentation - ASAN/LSAN, DMD and nsTraceRefCnt, and I can never get a satisfying list of leaking things. There are definitely conditions where things leak, but it's hard to tell if that's happening in normal operation or a side effect of something more catastrophic. I tried nailing down some manual steps to trigger a composition leak without having to run a mochitest, but hit some bigger crash (written up in Bug 1693435). And I've seen that same crash in of the mochitest runs here.

Bug 1657025 also seems relevant (issues with the DMD reporting).

Run the whole folder of tests, not just one.

Summary: thunderbird debug perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@] → thunderbird win7 perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@]
See Also: → 1693435
Assignee: nobody → benc

(I've found a definite memory leak here, which I'll outline in next comment. But first, some notes!)

Steps that I think replicate the memory leak in a non-mochitest situation:

Setup:

  1. do a debug build
  2. set up a profile with a local folder (I just set up an IMAP account, then deleted it after step 4, but it doesn't matter either way)
  3. create a subfolder under "local folders"
  4. copy a single message into that subfolder
  5. quit

To see the memory leak:

  1. run with XPCOM_MEM_LEAK_LOG set (to a filename, or to "1" for stdout), eg:
    env XPCOM_MEM_LEAK_LOG=1 ./mach run
  2. right click on the single message in the subfolder and click "edit message as new".
  3. close the two windows (composition and main) to quit
  4. see the leak log

It looks roughly like there's a localmailfolder + nsMsgDatabase leaking (along with a bunch of other stuff.
Without opening the composition window (step 2), the leak doesn't occur.
(So, given that the memory-leaching mochitests here all involve the composition window, this seems a likely contributor).

At this point you'd then go on to capture addref/release logging for specific classes you're interested (by setting other XPCOM_MEM_* variables), search for mismatched refcounts, then use the stackdumps to see where the mismatches occurred.
But there are currently a few issues that make this tricky. First one is in Bug 1701105 (can be hacked around somewhat by fiddling with to xpcom/base/nsTraceRefcnt.cpp).
The other issue is that I think the log filtering gets a little confused by derived classes (eg AddRef()ing a ptr in a base class, then Release()ing it in the derived class). So you can end up seeing a Release() with no corresponding AddRef(), say. The tools don't seem to like apparently negative refcounts :-) I need to investigate a little more here.

Anyway. After lots of workarounds and fiddling, I've found a definite leak:

nsMsgDBThreadEnumerator adds itself as a listener of the nsMsgDatabase it's iterating through. So, if that database is killed in mid-run, the iterator can stop rather than just crash. The enumerator unregisters it itself from the database in it's dtor (~nsMsgDBThreadEnumerator()).
However, because the database holds all it's listeners in nsCOMPtr<>s, the enumerator is held open and ~nsMsgDBThreadEnumerator() never gets called. So the enumerator and db hold each other in existence, along with all their hangers-on (including the folder!).

I'm not totally clear why opening the composition window causes a thread enumeration... but it seems to, going by the STR in comment 37. In any case, it's a real issue and needs fixing.

The fix I'm working on is to have the enumerator use a WeakPtr to reference the DB, and get rid of the Listener registration, so hope to have a fix ready soon.

Just a housekeeping patch first, to move some enumerator implementations out of the nsMsgDatabase files (the concrete enumerator types are still private to nsMsgDatabase, hence placing nsMsgDatabaseEnumerator.h in src/ rather than out in public/).

Status: NEW → ASSIGNED

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/1ca6fe46b3c7
Move nsMsgDatabase enumerator classes into their own file. r=mkmelin

Summary: thunderbird win7 perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js, browser_saveChangesOnQuit.js, browser_sendFormat.js, browser_signatureUpdating.js | application crashed [@] → tb win7 perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js,browser_saveChangesOnQuit.js,browser_sendFormat.js,browser_signatureUpdating.js,browser_replyFormatFlowed.js | application crashed [@]
Attachment #9213368 - Attachment description: Bug 1631149 - Move nsMsgDatabase enumerator classes into their own file. r=mkmelin → [checked in] Bug 1631149 - Move nsMsgDatabase enumerator classes into their own file. r=mkmelin

Annoyingly this breaks an assert in test_retention.js which checks that a
cached db for the test folder is removed after a garbage collection. I'm not
quite sure of the intention of that assert, so this patch removes it, and the
rest of the test runs fine.

oh - the long hg comment from the phab revision appears here in bugzilla? cool! (re comment #45)

Anyway, this patch isn't the whole story, but I think it should help - it untangles the nsMsgDBThreadEnumerator <-> nsMsgDatabase mutual deathgrip I found.
There are still a bunch of corner-cases to work through. Mainly making sure the enumerators release what they can upon completion, and getting them to clean up properly after errors.

Try run here to try and see if it helps the memory usage... but I don't think it hits the affected windows platforms/configs... is there a list of try platform targets anywhere?

https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=a15877fa6ad4c63553186005d1f98590dd675325

Attachment #9215020 - Attachment description: Bug 1631149 - Untangle mutual nsMsgDBThreadEnumerator <-> nsMsgDatabase refcount lock. r=mkmelin → Bug 1631149 - Untangle mutual nsMsgDatabase <-> enumerator refcount lock. r=mkmelin
Target Milestone: --- → 89 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/57534f0028f7
Untangle mutual nsMsgDatabase <-> enumerator refcount lock. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

I don't know the answer to this question because I'm new on the scene to this bug but it looks like the main guts of this have just landed, so I'm asking this non-hypothetical - how solid is this considering we branch the next beta on Monday? (We don't want to find a serious problem on beta.)

And on a lighter note - is this likely to help bug 805023 and friends?

Flags: needinfo?(benc)

I'm rather biased as I've been wading through this stuff and know how brittle it all is... but I think this fix is a big improvement. I'm kind of amazed that the older code didn't cause more serious issues.
So I'm happy that the approach I used is solid, and a good step forward (but realise that saying that out loud means that some horrible flaw in my implementation will now surface at the most embarrassing moment :- )

Flags: needinfo?(benc)
See Also: → 1705898
Regressions: 1710768
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: