Closed Bug 782899 Opened 12 years ago Closed 12 years ago

Daily hangs and leaks memory like crazy when trying to delete messages in the results of a filter search

Categories

(Thunderbird :: Folder and Message Lists, defect)

x86
All
defect
Not set
major

Tracking

(thunderbird15+ fixed, thunderbird16+ fixed, thunderbird17+ fixed)

RESOLVED FIXED
Thunderbird 18.0
Tracking Status
thunderbird15 + fixed
thunderbird16 + fixed
thunderbird17 + fixed

People

(Reporter: justdave, Assigned: Irving)

References

(Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [gs][regression:TB15][gssolved])

Attachments

(3 files)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0a1
BuildID: 20120814030529

For the last few weeks, typing a string in the filter box at the top of a folder message list pane, and the selecting multiple items in the results and attempting to delete them results in Daily immediately consuming multiple GB of RAM and hanging until it's force-quit.
Confirming,; just having multiple selection kills the ponies.
Severity: normal → critical
Keywords: hang
Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?
Keywords: mlk
perhaps same as bug 753502
(In reply to Ludovic Hirlimann [:Usul] from comment #2)
> Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?

It is set to the default value, which is true, yes,  I also have threading disabled in most of my folders (including the ones I've run into this in) in case it matters)
Severity: critical → normal
Dave do these emails have attachments ?
They are "page changed" notifications from Confluence, which are multipart/html with embedded pics and css, so yes.

I've also discovered that simply opening one of these can cause the RAM usage explosion.
(In reply to Dave Miller [:justdave] from comment #6)
> They are "page changed" notifications from Confluence, which are
> multipart/html with embedded pics and css, so yes.
> 
> I've also discovered that simply opening one of these can cause the RAM
> usage explosion.

how much memory per message?
regression?
(In reply to Wayne Mery (:wsmwk) from comment #7)
> (In reply to Dave Miller [:justdave] from comment #6)
> > They are "page changed" notifications from Confluence, which are
> > multipart/html with embedded pics and css, so yes.
> > 
> > I've also discovered that simply opening one of these can cause the RAM
> > usage explosion.
> 
> how much memory per message?

It usually immediately jumps to around 3.5 GB to 4.0 GB physical in use, and then starts swapping.  It continually grabs more memory once it starts, until I force quit.

> regression?

Yes.  I could read these without getting this problem a few weeks ago.
the oldest regression report I find is Bug 759092 - though I'm not convinced it's the same issue.

justdave, could you hunt the regression range?
Bug 759092 cites failure using 2012-05-25 build
Severity: normal → major
Profiling this shows us spending an inordinate amount of time in jsmimeemitter.js, in the writeBody function:

http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Profile posted here:

http://people.mozilla.com/~bgirard/cleopatra/?report=ace22cfa16f76a49a787d5d705c5474c68e72633

Cc'ing asuth and protz because this is Gloda stuff.
OS: Mac OS X → All
This is the function in question:

455   writeBody: function mime_emitter_writeBody(aBuf, aSize, aOutAmountWritten) {
456     if (this._writeBody &&
457         (!this._saneBodySize ||
458          this._curPart.body.length < MAX_SANE_BODY_PART_SIZE))
459       this._curPart.body += aBuf;
460   },

If memory is leaking like crazy, then probably this accumulation is where it's going. Likely that means that either the sane body part size isn't being checked, or there are a lot of not-quite-max-size parts getting added.

It sounds like there's a specific message (or set of messages) that's causing this issue; can this be posted?
Shot in the dark: it may be a message with an insane amount of parts, or many such messages.
Attached file testcase
(In reply to Jonathan Protzenko [:protz] from comment #12)
> Shot in the dark: it may be a message with an insane amount of parts, or
> many such messages.

NO not more than 3. Message coming up. Can't save it from tb so you'll get the untouched message eg without the mbox headers.
(In reply to Ludovic Hirlimann [:Usul] from comment #13)
> Created attachment 654764 [details]
> testcase
> 
> (In reply to Jonathan Protzenko [:protz] from comment #12)
> > Shot in the dark: it may be a message with an insane amount of parts, or
> > many such messages.
> 
> NO not more than 3. Message coming up. Can't save it from tb so you'll get
> the untouched message eg without the mbox headers.

52 parts. It looks like it's basically of the form:

multipart/mixed
  multipart/alternative
    text/msg
    multipart/related
      text/html
      ... Several cid: images ...
  application/pdf

Each image averages about 1/3KB of de-base64'd size. Oh, and the images are all Content-Disposition: attachment too, fwiw.
I see MultiMessageSummary in the profile; not a lot, but a little.  I assume this is a case where MultiMessageSummary is either overwhelmed by what it is streaming to provide a useful summary and/or it is making things worse by re-computing things.  I think the summaries have some logic to wait for quiescence based on a timer; if the system has become overwhelmed, it is conceivable that things appear to quiesce to pour some gasoline on the fire.  It's been long enough that I don't recall the exact event sequence for the summaries so I don't know how possible that is.

I doubt gloda's indexer is involved because all types of deletion should appear to gloda as either a fast-pathed move or an actual deletion.  (gloda does show up in the profile a teeny bit, but that could just be background-ness.)
It's worth noting that because of the way XPConnect works, even if the emitter has stopped concatenating the strings because saturation has been reached, all those calls still happen and those tiny strings will still generate tons of garbage.  And I think in the case of HTML, those calls may not be lined-based, but tag-based, in which case the garbage generation could be a ridiculous multiplier since HTML e-mails can include a ridiculous number of tags.  I suspect Joshua may know more about that (and his MIME parser may not have this pathological problem).
FWIT before I came back from vacation , I could read these emails without any issues.
I have this reproduced, so I'll dig in with debugger and tracing. If someone else can find a regression window by testing old Daily builds that would be great.
Assignee: nobody → irving
Status: NEW → ASSIGNED
Oh now this is interesting (and annoying...) The Aug 1 nightly build (https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012-08-01-04-20-04-comm-aurora/) beachballs and consumes a big chunk of CPU when I select several messages in a folder including the test message, but it *doesn't* consume gigabytes of memory - whereas trunk as of last Thursday eats both CPU and memory.

That means we get to bisect two problems - the CPU load before Aug 1 and the memory bloat after.
Could be massive garbage generation in both cases, but the GC is just running more frequently (and/or running full cycle collecting GC's more frequently), so we don't actually bloat as much.  The profiler should label GC's, so if that's the case now, it should be pretty obvious pretty fast.  (Based on the previous profile, I would expect the I/O should still be very obvious.)

It's probably also worth checking with gdb that the I/O hasn't stopped being buffered or something like that. Specifically, check the length of the strings to make sure they are not 1 character long...
Parts of the problem seem to have come and gone. The basic regression window is:

OK:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-14-05-31-54-comm-aurora/

FAILED:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-15-03-05-24-comm-central/

so the problem seems to have been introduced May 14. The May 15 build has both CPU load and memory bloat; the May 30 build has CPU load but not memory bloat, and the memory reappears some time after Aug 1.
Bah, I messed up and mixed Aurora and Nightly builds in my bisection. The problem is earlier than May 14 in nightlies. Bisect bisect bisect.
OK, this time for sure ;->

Good nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-03-03-16-32-comm-central/thunderbird-15.0a1.en-US.mac.txt

Bad nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-04-03-36-16-comm-central/thunderbird-15.0a1.en-US.mac.txt

Looking at the changes, there's nothing obvious in comm-central; unfortunately I can't get a successful build of either the good or bad revision to test/bisect further.

I haven't read all the mozilla-central changes during that time frame yet.
I've always suspected the compartment changes in moz-central that landed around that time.
David, good catch - compartment-per-global for XPConnect landed that day. If I recall correctly, mconley's profiler runs showed a bunch of time spent in the cross-compartment JS call path.
OK, so it's a combination of things:

1) the message summary generator uses Gloda's mime structure builder to parse content out of messages, in order to get the first couple of lines to summarize: https://mxr.mozilla.org/comm-central/source/mail/base/content/selectionsummaries.js#585

2) libmime splits messages into text lines and calls the mime "emitter" with the message content line-by-line

3) Gloda's builder reassembles that content into a single string for plaintext and html parts: https://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Unfortunately the string building at step 3 happens in a way that interacts terribly with compartment-per-global; for some reason the string assignment is happening across a compartment boundary and among other things this forces the JS string to be flattened, so there is insane amounts of memory allocation going on inside a tight loop that doesn't have any GC pauses - so we both burn huge amounts of CPU copying strings, and we blow out the heap by keeping all the copies.

The quick fix is to make selectionsummaries.js pass the "saneBodySize:true" flag to MsgHdrToMimeMessage() - this stops the gloda builder from reassembling the entire string.

This still leaves behind a cruel trap for anyone else who uses the Gloda mime engine, we need to fix the underlying problem as well.
Blocks: cpg
(In reply to Irving Reid (:irving) from comment #26)
Any number of things can cause a string to be flattened so, in general, code should not rely on strings not being flattened.  However, part of comment 26 sounded worrying so I'd like to ask for a clarification: are all these huge strings live (i.e., they are in some array or other object structure that is reachable) or should they be garbage that, for some reason, isn't being collected?  If it is the latter, this may be a GC scheduling bug, independent but exacerbated by bug 650353.
Ah, the flattening... good investigatory find!  As a simple horribleness-reducer, we could change to just putting the strings in a list and calling join at the end.
(In reply to Irving Reid (:irving) from comment #26)
> Unfortunately the string building at step 3 happens in a way that interacts
> terribly with compartment-per-global; for some reason the string assignment
> is happening across a compartment boundary and among other things this
> forces the JS string to be flattened, so there is insane amounts of memory
> allocation going on inside a tight loop that doesn't have any GC pauses - so
> we both burn huge amounts of CPU copying strings, and we blow out the heap
> by keeping all the copies.

This may be caused by how the gloda mimemsg.js architecture works. mimemsg.js is a JS module that is imported by Cu.import, while the code that appends body parts is a JS component called back by libmime. The this._curPart is synthesized from a function in mimemsg.js, so it generates a cross-compartment wrapper, so most of our manipulation from the JS emitter is going through cross-compartment calls (and xpconnect!).
Luke: I agree that we don't want to rely on flattener and GC behaviour; in fact, while the CPU burn on this bug has been constant since CPG landed the memory bloat has come and gone a few times so it could very well be related to GC scheduling.

Step one of my preferred approach would be to figure out a way to have all the objects in the mime message assembler be inside the same compartment so that we don't need to proxy any of the calls; then we can look at fixing the inefficient way we assemble the big string.
Passing this option prevents the gloda MIME emitter from building extra-large strings containing MIME part contents, which reduces the impact of string copying and CPG enough to get by.

asuth: Should we also pass partsOnDemand:true to avoid further downloading, or does aAllowDownload=false already avoid problems with getting MIME parts from the server?

callek: mxr didn't find any other non-test calls to MsgHdrToMimeMessage(), but someone should check SM to make sure it doesn't have a similar problem if it uses the gloda MIME emitter.
Attachment #656201 - Flags: superreview?(bugspam.Callek)
Attachment #656201 - Flags: review?(bugmail)
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

redirecting to Neil for SeaMonkey "OMG will this break us, or do we need to port" Q, since I know very little about this part of our code.
Attachment #656201 - Flags: superreview?(bugspam.Callek) → superreview?(neil)
FWIW, I just tried this patch on a collection of messages that the TB generally chokes on when multi-selecting.

There is a significant improvement to performance - it's still a little sticky, but it certainly doesn't send TB into a tailspin like it did before.

Great detective work, Irving!
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

I'm looking at newsblogOverlay.js because I think that's the only shared code (we don't have the code that summarises message selections). Is there no way to get just the headers, or better still, just the content-base header from the message, and no body at all?
> asuth: Should we also pass partsOnDemand:true to avoid further downloading, or
does aAllowDownload=false already avoid problems with getting MIME parts from
the server?

aAllowDownload=false means: "if the message is not available locally (offline sync or message cache), don't download it over the internet, and use the very little information we have from the headers in the .msf file"

partsOnDemand=true means: "if fetching the message remotely (because it isn't fully available locally), and the server is IMAP, and the server supports BODYSTRUCTURE, use it to not download parts too big, but still create a complete Mime structure for the message"

I think aAllowDownload=false takes precedence.
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

As protz says, the default (which is now being explicitly set) for allow download (=false) should avoid any network traffic anyways.  I think setting saneBodySize is definitely a good stop-gap.

I don't think there's any way to avoid the JS component being in its own compartment without changes to the way the emitter is instantiated, which leaves us moving all of the representation gunk from mimemsg.js to jsmimeemitter.js and leaving the calls that trigger the streaming in mimemsg.js.

Specifically, gloda.manifest lists the category manage entry: "category mime-emitter @mozilla.org/messenger/mimeemitter;1?type=application/x-js-mime-message @mozilla.org/gloda/jsmimeemitter;1" which is how it gets instantiated  We could have the streaming process take a pre-constructed XPConnect-looking object and have all the MIME helper stuff operate by using the subscript loader inside the same context as whatever wants to consume it.  We would probably want to use a RequireJS-type idiom (with 'use strict' to avoid global pollution).  This would likely want to entail a change to similarly most of the gloda modules in a similar fashion.

Having said that, I think just having the representation classes loaded in the emitter compartment should be sufficient to stop the worst of the pain.  That and maybe encouraging the C++ shim that defines the emitter layer to perform a little more internal batching so the XPConnect wrapping is not quite so over-the-top.
Attachment #656201 - Flags: review?(bugmail) → review+
several potentially matching topics taggged https://getsatisfaction.com/mozilla_messaging/tags/tb15perf

TB15.0.1 coming?? :)
Whiteboard: [gs]
Checked in:

https://hg.mozilla.org/comm-central/rev/4ac6d862637f

but see Bug 784286 to track further work necessary...
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Is this specific to TB17 or can it be in TB15 too?
Here are some bugs like this: bug 787348, bug 787321.
@aceman: yes, I believe the bug is in TB15 official release.  Here's my experience after upgrading: bug 787121

@:wsmwk -- I agree that a TB 15.0.1 release containing this fix is needed -- I actually had to roll back to TB14, and am happily productive again.
Attachment #656201 - Flags: superreview?(neil)
Blocks: TB2SM
No longer blocks: TB2SM
Whiteboard: [gs] → [gs][regression:TB15]
Target Milestone: --- → Thunderbird 17.0
There are potential reports that may be a dupe of this. Will this patch be backported to some branches?
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
Taking this onto beta so we can get some testing.
Attachment #656201 - Flags: approval-comm-beta+
Attachment #656201 - Flags: approval-comm-aurora+
Is this going into 15.0.1 please? So we can see if the tons of similar bugs are dupes.
Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
The general feedback so far has been that this and bug 784286 are fixing issues, so taking onto the release branch.
Attachment #656201 - Flags: approval-comm-release+
Irving, we get reports in other bugs about this, that in TB15.0.1 the memory (freeze) problem is solved but now not all of the selected messages are summarized in the message pane. Is that a known side effect of the patch?
Depends on: 790204
(In reply to :aceman from comment #51)
> Irving, we get reports in other bugs about this, that in TB15.0.1 the memory
> (freeze) problem is solved but now not all of the selected messages are
> summarized in the message pane. Is that a known side effect of the patch?

There should be no side-effect like that.  But be aware that the summarization logic caps MAX_THREADS to 100 and MAX_MESSAGES to 100, so it could just be people are being more observant and noticing this limitation.
Are they informed about that? Does the preview pane contain some "further messages were hidden" footnote?
(In reply to :aceman from comment #56)
> Are they informed about that? Does the preview pane contain some "further
> messages were hidden" footnote?

yes at the bottom of the summary view
In my case, I don't see any notice that not everything was summarized.  For 9 messages, total size is 14MB.
https://getsatisfaction.com/mozilla_messaging/topics/tb15_0_is_horrid_slow marked solved
Whiteboard: [gs][regression:TB15] → [gs][regression:TB15][gssolved]
Blocks: 1330872
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: