tb win7 perma fail: comm/mail/test/browser/composition/browser_replyAddresses.js, browser_replyMultipartCharset.js,browser_saveChangesOnQuit.js,browser_sendFormat.js,browser_signatureUpdating.js,browser_replyFormatFlowed.js | application crashed [@]
Categories
(MailNews Core :: Composition, defect)
Tracking
(thunderbird_esr78 wontfix)
Tracking | Status | |
---|---|---|
thunderbird_esr78 | --- | wontfix |
People
(Reporter: intermittent-bug-filer, Assigned: benc)
References
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(2 files)
Filed by: geoff [at] darktrojan.net
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=298132095&repo=comm-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BDtR2RxEQ16vjYVZx8lnjA/runs/0/artifacts/public/logs/live_backing.log
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 4•4 years ago
|
||
Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is --
(non,) indicating it has has not been previously triaged, the bug's Severity is being updated to --
(default, untriaged.)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment 7•4 years ago
|
||
TreeHerder didn't like my title change. :-(
Comment hidden (Intermittent Failures Robot) |
Comment 9•4 years ago
|
||
We made it further today before running out of memory. I guess that's progress.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 27•3 years ago
|
||
My notes on this so far:
I think there's something a little odd about the way mochitest drives Thunderbird - it always causes some extra leaks to be reported (via the XPCOM_MEM_LEAK_LOG mechanisms) at the end of a test, even a stubbed-out do-nothing test. I'm assuming these leaks are related to whatever's causing the failures we're seeing here.
It's not clear why it's these mail/test/browser/composition/
tests which are crossing the memory-use threshold and failing. Maybe the composition code is just more susceptable/memory-hungry, causing us to hit the memory limit on the CI machine. Or maybe it's just got some nasty reference cycles holding objects in existance too long...
The other suspicious thing is that the composition tests are run twice - once for the cpp and one for jsm. I'm assuming this is to cover both the old C++ sending code and the new JS-based version? The timing seems a little suspect - I think the JS side landed around the same time this issue started. Or could just be coincidental.
In any case, the logged leaks don't seem to occur outside mochitest, which makes it a little trickier to examine.
I've done a DMD-enabled build, and mach mochitest
helpfully supplies a --dump-dmd-after-test
option which looks super useful... but unfortunately the dumping crashes:
0:37.17 GECKO(333022) Assertion failure: gcTotal == rtStats.gcHeapChunkTotal, at /home/ben/tb/mozilla/js/xpconnect/src/XPCJSRuntime.cpp:2018
Which kind of implies the memory tracking is really screwed up here...
some possible next steps I'm mulling:
- figure out what's wrong that causes that DMD assert
- do DMD dumps on running tests (If I can get them working) to try and spot leaks
- dig deeper using the XPCOM_MEM_xxxx_LOG settings to try and pick apart the reported leaks.
- work out exactly how mochitest drives the app, to figure out why it seems to exit without cleaning up properly (and leaving objects to leak which don't when you run TB directly).
- try and replicate the tests "manually", without mochitest, and running regular DMD dumps.
- try and replicate the failure more properly by limited memory available during the test (using ulimit or a virtual machine or something)...
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 29•3 years ago
|
||
Just some more notes:
Watching the memory usage on one of the tests (browser_sendFormat.js) as it runs under both the JS (MessageSend.jsm) and C++ (msgSend.cpp) paths, I don't see any obvious runaway memory usage that would cause the OOM error on the CI run.
I've tried running builds with various memory/leak tracking instrumentation - ASAN/LSAN, DMD and nsTraceRefCnt, and I can never get a satisfying list of leaking things. There are definitely conditions where things leak, but it's hard to tell if that's happening in normal operation or a side effect of something more catastrophic. I tried nailing down some manual steps to trigger a composition leak without having to run a mochitest, but hit some bigger crash (written up in Bug 1693435). And I've seen that same crash in of the mochitest runs here.
Bug 1657025 also seems relevant (issues with the DMD reporting).
Comment 30•3 years ago
|
||
Run the whole folder of tests, not just one.
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 37•3 years ago
|
||
(I've found a definite memory leak here, which I'll outline in next comment. But first, some notes!)
Steps that I think replicate the memory leak in a non-mochitest situation:
Setup:
- do a debug build
- set up a profile with a local folder (I just set up an IMAP account, then deleted it after step 4, but it doesn't matter either way)
- create a subfolder under "local folders"
- copy a single message into that subfolder
- quit
To see the memory leak:
- run with XPCOM_MEM_LEAK_LOG set (to a filename, or to "1" for stdout), eg:
env XPCOM_MEM_LEAK_LOG=1 ./mach run
- right click on the single message in the subfolder and click "edit message as new".
- close the two windows (composition and main) to quit
- see the leak log
It looks roughly like there's a localmailfolder + nsMsgDatabase leaking (along with a bunch of other stuff.
Without opening the composition window (step 2), the leak doesn't occur.
(So, given that the memory-leaching mochitests here all involve the composition window, this seems a likely contributor).
At this point you'd then go on to capture addref/release logging for specific classes you're interested (by setting other XPCOM_MEM_* variables), search for mismatched refcounts, then use the stackdumps to see where the mismatches occurred.
But there are currently a few issues that make this tricky. First one is in Bug 1701105 (can be hacked around somewhat by fiddling with to xpcom/base/nsTraceRefcnt.cpp
).
The other issue is that I think the log filtering gets a little confused by derived classes (eg AddRef()ing a ptr in a base class, then Release()ing it in the derived class). So you can end up seeing a Release() with no corresponding AddRef(), say. The tools don't seem to like apparently negative refcounts :-) I need to investigate a little more here.
Assignee | ||
Comment 38•3 years ago
•
|
||
Anyway. After lots of workarounds and fiddling, I've found a definite leak:
nsMsgDBThreadEnumerator adds itself as a listener of the nsMsgDatabase it's iterating through. So, if that database is killed in mid-run, the iterator can stop rather than just crash. The enumerator unregisters it itself from the database in it's dtor (~nsMsgDBThreadEnumerator()).
However, because the database holds all it's listeners in nsCOMPtr<>s, the enumerator is held open and ~nsMsgDBThreadEnumerator() never gets called. So the enumerator and db hold each other in existence, along with all their hangers-on (including the folder!).
I'm not totally clear why opening the composition window causes a thread enumeration... but it seems to, going by the STR in comment 37. In any case, it's a real issue and needs fixing.
The fix I'm working on is to have the enumerator use a WeakPtr to reference the DB, and get rid of the Listener registration, so hope to have a fix ready soon.
Assignee | ||
Comment 39•3 years ago
|
||
Assignee | ||
Comment 40•3 years ago
|
||
Just a housekeeping patch first, to move some enumerator implementations out of the nsMsgDatabase
files (the concrete enumerator types are still private to nsMsgDatabase
, hence placing nsMsgDatabaseEnumerator.h
in src/
rather than out in public/
).
Updated•3 years ago
|
Updated•3 years ago
|
Comment 41•3 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/1ca6fe46b3c7
Move nsMsgDatabase enumerator classes into their own file. r=mkmelin
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 45•3 years ago
|
||
Annoyingly this breaks an assert in test_retention.js which checks that a
cached db for the test folder is removed after a garbage collection. I'm not
quite sure of the intention of that assert, so this patch removes it, and the
rest of the test runs fine.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 47•3 years ago
|
||
oh - the long hg comment from the phab revision appears here in bugzilla? cool! (re comment #45)
Anyway, this patch isn't the whole story, but I think it should help - it untangles the nsMsgDBThreadEnumerator <-> nsMsgDatabase mutual deathgrip I found.
There are still a bunch of corner-cases to work through. Mainly making sure the enumerators release what they can upon completion, and getting them to clean up properly after errors.
Try run here to try and see if it helps the memory usage... but I don't think it hits the affected windows platforms/configs... is there a list of try platform targets anywhere?
Updated•3 years ago
|
Updated•3 years ago
|
Comment 48•3 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/57534f0028f7
Untangle mutual nsMsgDatabase <-> enumerator refcount lock. r=mkmelin
Comment 49•3 years ago
|
||
I don't know the answer to this question because I'm new on the scene to this bug but it looks like the main guts of this have just landed, so I'm asking this non-hypothetical - how solid is this considering we branch the next beta on Monday? (We don't want to find a serious problem on beta.)
And on a lighter note - is this likely to help bug 805023 and friends?
Assignee | ||
Comment 50•3 years ago
|
||
I'm rather biased as I've been wading through this stuff and know how brittle it all is... but I think this fix is a big improvement. I'm kind of amazed that the older code didn't cause more serious issues.
So I'm happy that the approach I used is solid, and a good step forward (but realise that saying that out loud means that some horrible flaw in my implementation will now surface at the most embarrassing moment :- )
Comment hidden (Intermittent Failures Robot) |
Description
•