Last Comment Bug 782899 - Daily hangs and leaks memory like crazy when trying to delete messages in the results of a filter search
: Daily hangs and leaks memory like crazy when trying to delete messages in the...
Status: RESOLVED FIXED
[gs][regression:TB15][gssolved]
: hang, mlk, perf, regression
Product: Thunderbird
Classification: Client Software
Component: Folder and Message Lists (show other bugs)
: Trunk
: x86 All
: -- major (vote)
: Thunderbird 18.0
Assigned To: :Irving Reid (No longer working on Firefox)
:
Mentors:
: 753502 787121 787321 787348 788541 789019 789722 790755 (view as bug list)
Depends on: 790204
Blocks: cpg 784286
  Show dependency treegraph
 
Reported: 2012-08-15 02:03 PDT by Dave Miller [:justdave] (justdave@bugzilla.org)
Modified: 2012-10-09 23:54 PDT (History)
25 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
+
fixed
+
fixed
+
fixed


Attachments
Process sample via Activity Manager while hung (171.85 KB, text/plain)
2012-08-15 02:03 PDT, Dave Miller [:justdave] (justdave@bugzilla.org)
no flags Details
testcase (278.90 KB, message/rfc822)
2012-08-23 13:17 PDT, Ludovic Hirlimann [:Usul]
no flags Details
Don't accumulate the entire message body in MIME emitter when creating message summaries (3.07 KB, patch)
2012-08-28 13:53 PDT, :Irving Reid (No longer working on Firefox)
bugmail: review+
standard8: approval‑comm‑aurora+
standard8: approval‑comm‑beta+
standard8: approval‑comm‑release+
Details | Diff | Review

Description Dave Miller [:justdave] (justdave@bugzilla.org) 2012-08-15 02:03:30 PDT
Created attachment 652043 [details]
Process sample via Activity Manager while hung

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0a1
BuildID: 20120814030529

For the last few weeks, typing a string in the filter box at the top of a folder message list pane, and the selecting multiple items in the results and attempting to delete them results in Daily immediately consuming multiple GB of RAM and hanging until it's force-quit.
Comment 1 Ludovic Hirlimann [:Usul] 2012-08-15 02:11:14 PDT
Confirming,; just having multiple selection kills the ponies.
Comment 2 Ludovic Hirlimann [:Usul] 2012-08-15 03:01:57 PDT
Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?
Comment 3 Wayne Mery (:wsmwk, NI for questions) 2012-08-15 04:59:23 PDT
perhaps same as bug 753502
Comment 4 Dave Miller [:justdave] (justdave@bugzilla.org) 2012-08-15 20:44:20 PDT
(In reply to Ludovic Hirlimann [:Usul] from comment #2)
> Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?

It is set to the default value, which is true, yes,  I also have threading disabled in most of my folders (including the ones I've run into this in) in case it matters)
Comment 5 Ludovic Hirlimann [:Usul] 2012-08-15 22:41:34 PDT
Dave do these emails have attachments ?
Comment 6 Dave Miller [:justdave] (justdave@bugzilla.org) 2012-08-16 04:44:05 PDT
They are "page changed" notifications from Confluence, which are multipart/html with embedded pics and css, so yes.

I've also discovered that simply opening one of these can cause the RAM usage explosion.
Comment 7 Wayne Mery (:wsmwk, NI for questions) 2012-08-16 06:51:54 PDT
(In reply to Dave Miller [:justdave] from comment #6)
> They are "page changed" notifications from Confluence, which are
> multipart/html with embedded pics and css, so yes.
> 
> I've also discovered that simply opening one of these can cause the RAM
> usage explosion.

how much memory per message?
regression?
Comment 8 Dave Miller [:justdave] (justdave@bugzilla.org) 2012-08-16 08:09:51 PDT
(In reply to Wayne Mery (:wsmwk) from comment #7)
> (In reply to Dave Miller [:justdave] from comment #6)
> > They are "page changed" notifications from Confluence, which are
> > multipart/html with embedded pics and css, so yes.
> > 
> > I've also discovered that simply opening one of these can cause the RAM
> > usage explosion.
> 
> how much memory per message?

It usually immediately jumps to around 3.5 GB to 4.0 GB physical in use, and then starts swapping.  It continually grabs more memory once it starts, until I force quit.

> regression?

Yes.  I could read these without getting this problem a few weeks ago.
Comment 9 Wayne Mery (:wsmwk, NI for questions) 2012-08-16 08:27:27 PDT
the oldest regression report I find is Bug 759092 - though I'm not convinced it's the same issue.

justdave, could you hunt the regression range?
Bug 759092 cites failure using 2012-05-25 build
Comment 10 Mike Conley (:mconley) - (needinfo me!) 2012-08-23 11:57:28 PDT
Profiling this shows us spending an inordinate amount of time in jsmimeemitter.js, in the writeBody function:

http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Profile posted here:

http://people.mozilla.com/~bgirard/cleopatra/?report=ace22cfa16f76a49a787d5d705c5474c68e72633

Cc'ing asuth and protz because this is Gloda stuff.
Comment 11 Joshua Cranmer [:jcranmer] 2012-08-23 12:09:48 PDT
This is the function in question:

455   writeBody: function mime_emitter_writeBody(aBuf, aSize, aOutAmountWritten) {
456     if (this._writeBody &&
457         (!this._saneBodySize ||
458          this._curPart.body.length < MAX_SANE_BODY_PART_SIZE))
459       this._curPart.body += aBuf;
460   },

If memory is leaking like crazy, then probably this accumulation is where it's going. Likely that means that either the sane body part size isn't being checked, or there are a lot of not-quite-max-size parts getting added.

It sounds like there's a specific message (or set of messages) that's causing this issue; can this be posted?
Comment 12 Jonathan Protzenko [:protz] 2012-08-23 13:01:23 PDT
Shot in the dark: it may be a message with an insane amount of parts, or many such messages.
Comment 13 Ludovic Hirlimann [:Usul] 2012-08-23 13:17:15 PDT
Created attachment 654764 [details]
testcase

(In reply to Jonathan Protzenko [:protz] from comment #12)
> Shot in the dark: it may be a message with an insane amount of parts, or
> many such messages.

NO not more than 3. Message coming up. Can't save it from tb so you'll get the untouched message eg without the mbox headers.
Comment 14 Joshua Cranmer [:jcranmer] 2012-08-23 14:43:21 PDT
(In reply to Ludovic Hirlimann [:Usul] from comment #13)
> Created attachment 654764 [details]
> testcase
> 
> (In reply to Jonathan Protzenko [:protz] from comment #12)
> > Shot in the dark: it may be a message with an insane amount of parts, or
> > many such messages.
> 
> NO not more than 3. Message coming up. Can't save it from tb so you'll get
> the untouched message eg without the mbox headers.

52 parts. It looks like it's basically of the form:

multipart/mixed
  multipart/alternative
    text/msg
    multipart/related
      text/html
      ... Several cid: images ...
  application/pdf

Each image averages about 1/3KB of de-base64'd size. Oh, and the images are all Content-Disposition: attachment too, fwiw.
Comment 15 Andrew Sutherland [:asuth] 2012-08-23 17:22:35 PDT
I see MultiMessageSummary in the profile; not a lot, but a little.  I assume this is a case where MultiMessageSummary is either overwhelmed by what it is streaming to provide a useful summary and/or it is making things worse by re-computing things.  I think the summaries have some logic to wait for quiescence based on a timer; if the system has become overwhelmed, it is conceivable that things appear to quiesce to pour some gasoline on the fire.  It's been long enough that I don't recall the exact event sequence for the summaries so I don't know how possible that is.

I doubt gloda's indexer is involved because all types of deletion should appear to gloda as either a fast-pathed move or an actual deletion.  (gloda does show up in the profile a teeny bit, but that could just be background-ness.)
Comment 16 Andrew Sutherland [:asuth] 2012-08-23 17:25:23 PDT
It's worth noting that because of the way XPConnect works, even if the emitter has stopped concatenating the strings because saturation has been reached, all those calls still happen and those tiny strings will still generate tons of garbage.  And I think in the case of HTML, those calls may not be lined-based, but tag-based, in which case the garbage generation could be a ridiculous multiplier since HTML e-mails can include a ridiculous number of tags.  I suspect Joshua may know more about that (and his MIME parser may not have this pathological problem).
Comment 17 Ludovic Hirlimann [:Usul] 2012-08-23 19:54:06 PDT
FWIT before I came back from vacation , I could read these emails without any issues.
Comment 18 :Irving Reid (No longer working on Firefox) 2012-08-24 06:02:16 PDT
I have this reproduced, so I'll dig in with debugger and tracing. If someone else can find a regression window by testing old Daily builds that would be great.
Comment 19 :Irving Reid (No longer working on Firefox) 2012-08-27 07:15:54 PDT
Oh now this is interesting (and annoying...) The Aug 1 nightly build (https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012-08-01-04-20-04-comm-aurora/) beachballs and consumes a big chunk of CPU when I select several messages in a folder including the test message, but it *doesn't* consume gigabytes of memory - whereas trunk as of last Thursday eats both CPU and memory.

That means we get to bisect two problems - the CPU load before Aug 1 and the memory bloat after.
Comment 20 Andrew Sutherland [:asuth] 2012-08-27 08:20:13 PDT
Could be massive garbage generation in both cases, but the GC is just running more frequently (and/or running full cycle collecting GC's more frequently), so we don't actually bloat as much.  The profiler should label GC's, so if that's the case now, it should be pretty obvious pretty fast.  (Based on the previous profile, I would expect the I/O should still be very obvious.)

It's probably also worth checking with gdb that the I/O hasn't stopped being buffered or something like that. Specifically, check the length of the strings to make sure they are not 1 character long...
Comment 21 :Irving Reid (No longer working on Firefox) 2012-08-27 09:25:02 PDT
Parts of the problem seem to have come and gone. The basic regression window is:

OK:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-14-05-31-54-comm-aurora/

FAILED:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-15-03-05-24-comm-central/

so the problem seems to have been introduced May 14. The May 15 build has both CPU load and memory bloat; the May 30 build has CPU load but not memory bloat, and the memory reappears some time after Aug 1.
Comment 22 :Irving Reid (No longer working on Firefox) 2012-08-27 09:45:56 PDT
Bah, I messed up and mixed Aurora and Nightly builds in my bisection. The problem is earlier than May 14 in nightlies. Bisect bisect bisect.
Comment 23 :Irving Reid (No longer working on Firefox) 2012-08-27 18:42:37 PDT
OK, this time for sure ;->

Good nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-03-03-16-32-comm-central/thunderbird-15.0a1.en-US.mac.txt

Bad nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-04-03-36-16-comm-central/thunderbird-15.0a1.en-US.mac.txt

Looking at the changes, there's nothing obvious in comm-central; unfortunately I can't get a successful build of either the good or bad revision to test/bisect further.

I haven't read all the mozilla-central changes during that time frame yet.
Comment 24 David :Bienvenu 2012-08-27 21:10:43 PDT
I've always suspected the compartment changes in moz-central that landed around that time.
Comment 25 :Irving Reid (No longer working on Firefox) 2012-08-28 06:05:13 PDT
David, good catch - compartment-per-global for XPConnect landed that day. If I recall correctly, mconley's profiler runs showed a bunch of time spent in the cross-compartment JS call path.
Comment 26 :Irving Reid (No longer working on Firefox) 2012-08-28 12:40:39 PDT
OK, so it's a combination of things:

1) the message summary generator uses Gloda's mime structure builder to parse content out of messages, in order to get the first couple of lines to summarize: https://mxr.mozilla.org/comm-central/source/mail/base/content/selectionsummaries.js#585

2) libmime splits messages into text lines and calls the mime "emitter" with the message content line-by-line

3) Gloda's builder reassembles that content into a single string for plaintext and html parts: https://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Unfortunately the string building at step 3 happens in a way that interacts terribly with compartment-per-global; for some reason the string assignment is happening across a compartment boundary and among other things this forces the JS string to be flattened, so there is insane amounts of memory allocation going on inside a tight loop that doesn't have any GC pauses - so we both burn huge amounts of CPU copying strings, and we blow out the heap by keeping all the copies.

The quick fix is to make selectionsummaries.js pass the "saneBodySize:true" flag to MsgHdrToMimeMessage() - this stops the gloda builder from reassembling the entire string.

This still leaves behind a cruel trap for anyone else who uses the Gloda mime engine, we need to fix the underlying problem as well.
Comment 27 Luke Wagner [:luke] 2012-08-28 12:56:44 PDT
(In reply to Irving Reid (:irving) from comment #26)
Any number of things can cause a string to be flattened so, in general, code should not rely on strings not being flattened.  However, part of comment 26 sounded worrying so I'd like to ask for a clarification: are all these huge strings live (i.e., they are in some array or other object structure that is reachable) or should they be garbage that, for some reason, isn't being collected?  If it is the latter, this may be a GC scheduling bug, independent but exacerbated by bug 650353.
Comment 28 Andrew Sutherland [:asuth] 2012-08-28 12:59:38 PDT
Ah, the flattening... good investigatory find!  As a simple horribleness-reducer, we could change to just putting the strings in a list and calling join at the end.
Comment 29 Joshua Cranmer [:jcranmer] 2012-08-28 13:19:36 PDT
(In reply to Irving Reid (:irving) from comment #26)
> Unfortunately the string building at step 3 happens in a way that interacts
> terribly with compartment-per-global; for some reason the string assignment
> is happening across a compartment boundary and among other things this
> forces the JS string to be flattened, so there is insane amounts of memory
> allocation going on inside a tight loop that doesn't have any GC pauses - so
> we both burn huge amounts of CPU copying strings, and we blow out the heap
> by keeping all the copies.

This may be caused by how the gloda mimemsg.js architecture works. mimemsg.js is a JS module that is imported by Cu.import, while the code that appends body parts is a JS component called back by libmime. The this._curPart is synthesized from a function in mimemsg.js, so it generates a cross-compartment wrapper, so most of our manipulation from the JS emitter is going through cross-compartment calls (and xpconnect!).
Comment 30 :Irving Reid (No longer working on Firefox) 2012-08-28 13:23:57 PDT
Luke: I agree that we don't want to rely on flattener and GC behaviour; in fact, while the CPU burn on this bug has been constant since CPG landed the memory bloat has come and gone a few times so it could very well be related to GC scheduling.

Step one of my preferred approach would be to figure out a way to have all the objects in the mime message assembler be inside the same compartment so that we don't need to proxy any of the calls; then we can look at fixing the inefficient way we assemble the big string.
Comment 31 :Irving Reid (No longer working on Firefox) 2012-08-28 13:53:33 PDT
Created attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

Passing this option prevents the gloda MIME emitter from building extra-large strings containing MIME part contents, which reduces the impact of string copying and CPG enough to get by.

asuth: Should we also pass partsOnDemand:true to avoid further downloading, or does aAllowDownload=false already avoid problems with getting MIME parts from the server?

callek: mxr didn't find any other non-test calls to MsgHdrToMimeMessage(), but someone should check SM to make sure it doesn't have a similar problem if it uses the gloda MIME emitter.
Comment 32 Justin Wood (:Callek) 2012-08-28 14:05:15 PDT
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

redirecting to Neil for SeaMonkey "OMG will this break us, or do we need to port" Q, since I know very little about this part of our code.
Comment 33 Mike Conley (:mconley) - (needinfo me!) 2012-08-28 14:10:19 PDT
FWIW, I just tried this patch on a collection of messages that the TB generally chokes on when multi-selecting.

There is a significant improvement to performance - it's still a little sticky, but it certainly doesn't send TB into a tailspin like it did before.

Great detective work, Irving!
Comment 34 neil@parkwaycc.co.uk 2012-08-28 14:20:23 PDT
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

I'm looking at newsblogOverlay.js because I think that's the only shared code (we don't have the code that summarises message selections). Is there no way to get just the headers, or better still, just the content-base header from the message, and no body at all?
Comment 35 Jonathan Protzenko [:protz] 2012-08-28 14:28:58 PDT
> asuth: Should we also pass partsOnDemand:true to avoid further downloading, or
does aAllowDownload=false already avoid problems with getting MIME parts from
the server?

aAllowDownload=false means: "if the message is not available locally (offline sync or message cache), don't download it over the internet, and use the very little information we have from the headers in the .msf file"

partsOnDemand=true means: "if fetching the message remotely (because it isn't fully available locally), and the server is IMAP, and the server supports BODYSTRUCTURE, use it to not download parts too big, but still create a complete Mime structure for the message"

I think aAllowDownload=false takes precedence.
Comment 36 Andrew Sutherland [:asuth] 2012-08-28 15:14:18 PDT
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

As protz says, the default (which is now being explicitly set) for allow download (=false) should avoid any network traffic anyways.  I think setting saneBodySize is definitely a good stop-gap.

I don't think there's any way to avoid the JS component being in its own compartment without changes to the way the emitter is instantiated, which leaves us moving all of the representation gunk from mimemsg.js to jsmimeemitter.js and leaving the calls that trigger the streaming in mimemsg.js.

Specifically, gloda.manifest lists the category manage entry: "category mime-emitter @mozilla.org/messenger/mimeemitter;1?type=application/x-js-mime-message @mozilla.org/gloda/jsmimeemitter;1" which is how it gets instantiated  We could have the streaming process take a pre-constructed XPConnect-looking object and have all the MIME helper stuff operate by using the subscript loader inside the same context as whatever wants to consume it.  We would probably want to use a RequireJS-type idiom (with 'use strict' to avoid global pollution).  This would likely want to entail a change to similarly most of the gloda modules in a similar fashion.

Having said that, I think just having the representation classes loaded in the emitter compartment should be sufficient to stop the worst of the pain.  That and maybe encouraging the C++ shim that defines the emitter layer to perform a little more internal batching so the XPConnect wrapping is not quite so over-the-top.
Comment 37 Ludovic Hirlimann [:Usul] 2012-08-30 06:47:53 PDT
*** Bug 753502 has been marked as a duplicate of this bug. ***
Comment 38 Ludovic Hirlimann [:Usul] 2012-08-31 01:17:52 PDT
*** Bug 787121 has been marked as a duplicate of this bug. ***
Comment 39 Wayne Mery (:wsmwk, NI for questions) 2012-08-31 05:50:20 PDT
several potentially matching topics taggged https://getsatisfaction.com/mozilla_messaging/tags/tb15perf

TB15.0.1 coming?? :)
Comment 40 :Irving Reid (No longer working on Firefox) 2012-08-31 07:37:18 PDT
Checked in:

https://hg.mozilla.org/comm-central/rev/4ac6d862637f

but see Bug 784286 to track further work necessary...
Comment 41 :aceman 2012-08-31 08:42:05 PDT
Is this specific to TB17 or can it be in TB15 too?
Here are some bugs like this: bug 787348, bug 787321.
Comment 42 Matt Dorn 2012-08-31 09:19:14 PDT
@aceman: yes, I believe the bug is in TB15 official release.  Here's my experience after upgrading: bug 787121

@:wsmwk -- I agree that a TB 15.0.1 release containing this fix is needed -- I actually had to roll back to TB14, and am happily productive again.
Comment 43 :aceman 2012-09-05 04:27:54 PDT
There are potential reports that may be a dupe of this. Will this patch be backported to some branches?
Comment 44 Mark Banner (:standard8) 2012-09-05 14:29:48 PDT
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
Taking this onto beta so we can get some testing.
Comment 46 :aceman 2012-09-06 05:16:29 PDT
Is this going into 15.0.1 please? So we can see if the tons of similar bugs are dupes.
Comment 47 :aceman 2012-09-06 07:15:09 PDT
*** Bug 789019 has been marked as a duplicate of this bug. ***
Comment 48 Mark Banner (:standard8) 2012-09-07 07:57:18 PDT
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
The general feedback so far has been that this and bug 784286 are fixing issues, so taking onto the release branch.
Comment 49 Mark Banner (:standard8) 2012-09-07 08:00:06 PDT
Checked in:

http://hg.mozilla.org/releases/comm-release/rev/4c39c77cfe36
Comment 50 Wayne Mery (:wsmwk, NI for questions) 2012-09-08 17:37:54 PDT
*** Bug 789722 has been marked as a duplicate of this bug. ***
Comment 51 :aceman 2012-09-11 00:02:25 PDT
Irving, we get reports in other bugs about this, that in TB15.0.1 the memory (freeze) problem is solved but now not all of the selected messages are summarized in the message pane. Is that a known side effect of the patch?
Comment 52 :aceman 2012-09-11 05:47:35 PDT
*** Bug 788541 has been marked as a duplicate of this bug. ***
Comment 53 :aceman 2012-09-11 06:05:09 PDT
*** Bug 787321 has been marked as a duplicate of this bug. ***
Comment 54 :aceman 2012-09-12 01:15:04 PDT
*** Bug 787348 has been marked as a duplicate of this bug. ***
Comment 55 Andrew Sutherland [:asuth] 2012-09-12 15:22:42 PDT
(In reply to :aceman from comment #51)
> Irving, we get reports in other bugs about this, that in TB15.0.1 the memory
> (freeze) problem is solved but now not all of the selected messages are
> summarized in the message pane. Is that a known side effect of the patch?

There should be no side-effect like that.  But be aware that the summarization logic caps MAX_THREADS to 100 and MAX_MESSAGES to 100, so it could just be people are being more observant and noticing this limitation.
Comment 56 :aceman 2012-09-13 00:10:24 PDT
Are they informed about that? Does the preview pane contain some "further messages were hidden" footnote?
Comment 57 Ludovic Hirlimann [:Usul] 2012-09-13 00:38:18 PDT
(In reply to :aceman from comment #56)
> Are they informed about that? Does the preview pane contain some "further
> messages were hidden" footnote?

yes at the bottom of the summary view
Comment 58 Jeff Gehlhaar 2012-09-13 12:37:56 PDT
In my case, I don't see any notice that not everything was summarized.  For 9 messages, total size is 14MB.
Comment 59 Wayne Mery (:wsmwk, NI for questions) 2012-09-16 06:39:41 PDT
https://getsatisfaction.com/mozilla_messaging/topics/tb15_0_is_horrid_slow marked solved
Comment 60 :aceman 2012-10-09 23:54:54 PDT
*** Bug 790755 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.