782899 - Daily hangs and leaks memory like crazy when trying to delete messages in the results of a filter search

Reporter

Description

•

12 years ago

Attached file Process sample via Activity Manager while hung — Details

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0a1
BuildID: 20120814030529

For the last few weeks, typing a string in the filter box at the top of a folder message list pane, and the selecting multiple items in the results and attempting to delete them results in Daily immediately consuming multiple GB of RAM and hanging until it's force-quit.

Ludovic Hirlimann [:Usul]

Comment 1

•

12 years ago

Confirming,; just having multiple selection kills the ponies.

Severity: normal → critical

tracking-thunderbird17: --- → ?

Keywords: hang

Ludovic Hirlimann [:Usul]

Comment 2

•

12 years ago

Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?

Ludovic Hirlimann [:Usul]

Updated

•

12 years ago

Keywords: mlk

Wayne Mery (:wsmwk)

Comment 3

•

12 years ago

perhaps same as bug 753502

Dave Miller [:justdave]

Reporter

Comment 4

•

12 years ago

(In reply to Ludovic Hirlimann [:Usul] from comment #2)
> Dave is mail.operate_on_msgs_in_collapsed_threads set to true ?

It is set to the default value, which is true, yes,  I also have threading disabled in most of my folders (including the ones I've run into this in) in case it matters)

Severity: critical → normal

tracking-thunderbird17: ? → ---

Ludovic Hirlimann [:Usul]

Comment 5

•

12 years ago

Dave do these emails have attachments ?

Dave Miller [:justdave]

Reporter

Comment 6

•

12 years ago

They are "page changed" notifications from Confluence, which are multipart/html with embedded pics and css, so yes.

I've also discovered that simply opening one of these can cause the RAM usage explosion.

Wayne Mery (:wsmwk)

Comment 7

•

12 years ago

(In reply to Dave Miller [:justdave] from comment #6)
> They are "page changed" notifications from Confluence, which are
> multipart/html with embedded pics and css, so yes.
> 
> I've also discovered that simply opening one of these can cause the RAM
> usage explosion.

how much memory per message?
regression?

Dave Miller [:justdave]

Reporter

Comment 8

•

12 years ago

(In reply to Wayne Mery (:wsmwk) from comment #7)
> (In reply to Dave Miller [:justdave] from comment #6)
> > They are "page changed" notifications from Confluence, which are
> > multipart/html with embedded pics and css, so yes.
> > 
> > I've also discovered that simply opening one of these can cause the RAM
> > usage explosion.
> 
> how much memory per message?

It usually immediately jumps to around 3.5 GB to 4.0 GB physical in use, and then starts swapping.  It continually grabs more memory once it starts, until I force quit.

> regression?

Yes.  I could read these without getting this problem a few weeks ago.

Wayne Mery (:wsmwk)

Comment 9

•

12 years ago

the oldest regression report I find is Bug 759092 - though I'm not convinced it's the same issue.

justdave, could you hunt the regression range?
Bug 759092 cites failure using 2012-05-25 build

Severity: normal → major

tracking-thunderbird17: --- → ?

Keywords: perf, regression, regressionwindow-wanted

Mike Conley (:mconley) (:⚙️)

Comment 10

•

12 years ago

Profiling this shows us spending an inordinate amount of time in jsmimeemitter.js, in the writeBody function:

http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Profile posted here:

http://people.mozilla.com/~bgirard/cleopatra/?report=ace22cfa16f76a49a787d5d705c5474c68e72633

Cc'ing asuth and protz because this is Gloda stuff.

Mike Conley (:mconley) (:⚙️)

Updated

•

12 years ago

OS: Mac OS X → All

Joshua Cranmer [:jcranmer]

Comment 11

•

12 years ago

This is the function in question:

455   writeBody: function mime_emitter_writeBody(aBuf, aSize, aOutAmountWritten) {
456     if (this._writeBody &&
457         (!this._saneBodySize ||
458          this._curPart.body.length < MAX_SANE_BODY_PART_SIZE))
459       this._curPart.body += aBuf;
460   },

If memory is leaking like crazy, then probably this accumulation is where it's going. Likely that means that either the sane body part size isn't being checked, or there are a lot of not-quite-max-size parts getting added.

It sounds like there's a specific message (or set of messages) that's causing this issue; can this be posted?

Jonathan Protzenko [:protz]

Comment 12

•

12 years ago

Shot in the dark: it may be a message with an insane amount of parts, or many such messages.

Ludovic Hirlimann [:Usul]

Comment 13

•

12 years ago

Attached file testcase — Details

(In reply to Jonathan Protzenko [:protz] from comment #12)
> Shot in the dark: it may be a message with an insane amount of parts, or
> many such messages.

NO not more than 3. Message coming up. Can't save it from tb so you'll get the untouched message eg without the mbox headers.

Joshua Cranmer [:jcranmer]

Comment 14

•

12 years ago

(In reply to Ludovic Hirlimann [:Usul] from comment #13)
> Created attachment 654764 [details]
> testcase
> 
> (In reply to Jonathan Protzenko [:protz] from comment #12)
> > Shot in the dark: it may be a message with an insane amount of parts, or
> > many such messages.
> 
> NO not more than 3. Message coming up. Can't save it from tb so you'll get
> the untouched message eg without the mbox headers.

52 parts. It looks like it's basically of the form:

multipart/mixed
  multipart/alternative
    text/msg
    multipart/related
      text/html
      ... Several cid: images ...
  application/pdf

Each image averages about 1/3KB of de-base64'd size. Oh, and the images are all Content-Disposition: attachment too, fwiw.

Andrew Sutherland [:asuth] (he/him)

Comment 15

•

12 years ago

I see MultiMessageSummary in the profile; not a lot, but a little.  I assume this is a case where MultiMessageSummary is either overwhelmed by what it is streaming to provide a useful summary and/or it is making things worse by re-computing things.  I think the summaries have some logic to wait for quiescence based on a timer; if the system has become overwhelmed, it is conceivable that things appear to quiesce to pour some gasoline on the fire.  It's been long enough that I don't recall the exact event sequence for the summaries so I don't know how possible that is.

I doubt gloda's indexer is involved because all types of deletion should appear to gloda as either a fast-pathed move or an actual deletion.  (gloda does show up in the profile a teeny bit, but that could just be background-ness.)

Andrew Sutherland [:asuth] (he/him)

Comment 16

•

12 years ago

It's worth noting that because of the way XPConnect works, even if the emitter has stopped concatenating the strings because saturation has been reached, all those calls still happen and those tiny strings will still generate tons of garbage.  And I think in the case of HTML, those calls may not be lined-based, but tag-based, in which case the garbage generation could be a ridiculous multiplier since HTML e-mails can include a ridiculous number of tags.  I suspect Joshua may know more about that (and his MIME parser may not have this pathological problem).

Ludovic Hirlimann [:Usul]

Comment 17

•

12 years ago

FWIT before I came back from vacation , I could read these emails without any issues.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 18

•

12 years ago

I have this reproduced, so I'll dig in with debugger and tracing. If someone else can find a regression window by testing old Daily builds that would be great.

Assignee: nobody → irving

Status: NEW → ASSIGNED

:Irving Reid (No longer working on Firefox)

Assignee

Comment 19

•

12 years ago

Oh now this is interesting (and annoying...) The Aug 1 nightly build (https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012-08-01-04-20-04-comm-aurora/) beachballs and consumes a big chunk of CPU when I select several messages in a folder including the test message, but it *doesn't* consume gigabytes of memory - whereas trunk as of last Thursday eats both CPU and memory.

That means we get to bisect two problems - the CPU load before Aug 1 and the memory bloat after.

Andrew Sutherland [:asuth] (he/him)

Comment 20

•

12 years ago

Could be massive garbage generation in both cases, but the GC is just running more frequently (and/or running full cycle collecting GC's more frequently), so we don't actually bloat as much.  The profiler should label GC's, so if that's the case now, it should be pretty obvious pretty fast.  (Based on the previous profile, I would expect the I/O should still be very obvious.)

It's probably also worth checking with gdb that the I/O hasn't stopped being buffered or something like that. Specifically, check the length of the strings to make sure they are not 1 character long...

:Irving Reid (No longer working on Firefox)

Assignee

Comment 21

•

12 years ago

Parts of the problem seem to have come and gone. The basic regression window is:

OK:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-14-05-31-54-comm-aurora/

FAILED:  https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-15-03-05-24-comm-central/

so the problem seems to have been introduced May 14. The May 15 build has both CPU load and memory bloat; the May 30 build has CPU load but not memory bloat, and the memory reappears some time after Aug 1.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 22

•

12 years ago

Bah, I messed up and mixed Aurora and Nightly builds in my bisection. The problem is earlier than May 14 in nightlies. Bisect bisect bisect.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 23

•

12 years ago

OK, this time for sure ;->

Good nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-03-03-16-32-comm-central/thunderbird-15.0a1.en-US.mac.txt

Bad nightly: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2012/05/2012-05-04-03-36-16-comm-central/thunderbird-15.0a1.en-US.mac.txt

Looking at the changes, there's nothing obvious in comm-central; unfortunately I can't get a successful build of either the good or bad revision to test/bisect further.

I haven't read all the mozilla-central changes during that time frame yet.

David :Bienvenu

Comment 24

•

12 years ago

I've always suspected the compartment changes in moz-central that landed around that time.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 25

•

12 years ago

David, good catch - compartment-per-global for XPConnect landed that day. If I recall correctly, mconley's profiler runs showed a bunch of time spent in the cross-compartment JS call path.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 26

•

12 years ago

OK, so it's a combination of things:

1) the message summary generator uses Gloda's mime structure builder to parse content out of messages, in order to get the first couple of lines to summarize: https://mxr.mozilla.org/comm-central/source/mail/base/content/selectionsummaries.js#585

2) libmime splits messages into text lines and calls the mime "emitter" with the message content line-by-line

3) Gloda's builder reassembles that content into a single string for plaintext and html parts: https://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#455

Unfortunately the string building at step 3 happens in a way that interacts terribly with compartment-per-global; for some reason the string assignment is happening across a compartment boundary and among other things this forces the JS string to be flattened, so there is insane amounts of memory allocation going on inside a tight loop that doesn't have any GC pauses - so we both burn huge amounts of CPU copying strings, and we blow out the heap by keeping all the copies.

The quick fix is to make selectionsummaries.js pass the "saneBodySize:true" flag to MsgHdrToMimeMessage() - this stops the gloda builder from reassembling the entire string.

This still leaves behind a cruel trap for anyone else who uses the Gloda mime engine, we need to fix the underlying problem as well.

Blocks: cpg

Keywords: regressionwindow-wanted

Luke Wagner [:luke]

Comment 27

•

12 years ago

(In reply to Irving Reid (:irving) from comment #26)
Any number of things can cause a string to be flattened so, in general, code should not rely on strings not being flattened.  However, part of comment 26 sounded worrying so I'd like to ask for a clarification: are all these huge strings live (i.e., they are in some array or other object structure that is reachable) or should they be garbage that, for some reason, isn't being collected?  If it is the latter, this may be a GC scheduling bug, independent but exacerbated by bug 650353.

Andrew Sutherland [:asuth] (he/him)

Comment 28

•

12 years ago

Ah, the flattening... good investigatory find!  As a simple horribleness-reducer, we could change to just putting the strings in a list and calling join at the end.

Joshua Cranmer [:jcranmer]

Comment 29

•

12 years ago

(In reply to Irving Reid (:irving) from comment #26)
> Unfortunately the string building at step 3 happens in a way that interacts
> terribly with compartment-per-global; for some reason the string assignment
> is happening across a compartment boundary and among other things this
> forces the JS string to be flattened, so there is insane amounts of memory
> allocation going on inside a tight loop that doesn't have any GC pauses - so
> we both burn huge amounts of CPU copying strings, and we blow out the heap
> by keeping all the copies.

This may be caused by how the gloda mimemsg.js architecture works. mimemsg.js is a JS module that is imported by Cu.import, while the code that appends body parts is a JS component called back by libmime. The this._curPart is synthesized from a function in mimemsg.js, so it generates a cross-compartment wrapper, so most of our manipulation from the JS emitter is going through cross-compartment calls (and xpconnect!).

:Irving Reid (No longer working on Firefox)

Assignee

Comment 30

•

12 years ago

Luke: I agree that we don't want to rely on flattener and GC behaviour; in fact, while the CPU burn on this bug has been constant since CPG landed the memory bloat has come and gone a few times so it could very well be related to GC scheduling.

Step one of my preferred approach would be to figure out a way to have all the objects in the mime message assembler be inside the same compartment so that we don't need to proxy any of the calls; then we can look at fixing the inefficient way we assemble the big string.

:Irving Reid (No longer working on Firefox)

Assignee

Comment 31

•

12 years ago

Attached patch Don't accumulate the entire message body in MIME emitter when creating message summaries — Details — Splinter Review

Passing this option prevents the gloda MIME emitter from building extra-large strings containing MIME part contents, which reduces the impact of string copying and CPG enough to get by.

asuth: Should we also pass partsOnDemand:true to avoid further downloading, or does aAllowDownload=false already avoid problems with getting MIME parts from the server?

callek: mxr didn't find any other non-test calls to MsgHdrToMimeMessage(), but someone should check SM to make sure it doesn't have a similar problem if it uses the gloda MIME emitter.

Attachment #656201 - Flags: superreview?(bugspam.Callek)

Attachment #656201 - Flags: review?(bugmail)

Justin Wood (:Callek)

Comment 32

•

12 years ago

Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

redirecting to Neil for SeaMonkey "OMG will this break us, or do we need to port" Q, since I know very little about this part of our code.

Attachment #656201 - Flags: superreview?(bugspam.Callek) → superreview?(neil)

Mike Conley (:mconley) (:⚙️)

Comment 33

•

12 years ago

FWIW, I just tried this patch on a collection of messages that the TB generally chokes on when multi-selecting.

There is a significant improvement to performance - it's still a little sticky, but it certainly doesn't send TB into a tailspin like it did before.

Great detective work, Irving!

neil@parkwaycc.co.uk

Comment 34

•

12 years ago

Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

I'm looking at newsblogOverlay.js because I think that's the only shared code (we don't have the code that summarises message selections). Is there no way to get just the headers, or better still, just the content-base header from the message, and no body at all?

Jonathan Protzenko [:protz]

Comment 35

•

12 years ago

> asuth: Should we also pass partsOnDemand:true to avoid further downloading, or
does aAllowDownload=false already avoid problems with getting MIME parts from
the server?

aAllowDownload=false means: "if the message is not available locally (offline sync or message cache), don't download it over the internet, and use the very little information we have from the headers in the .msf file"

partsOnDemand=true means: "if fetching the message remotely (because it isn't fully available locally), and the server is IMAP, and the server supports BODYSTRUCTURE, use it to not download parts too big, but still create a complete Mime structure for the message"

I think aAllowDownload=false takes precedence.

Andrew Sutherland [:asuth] (he/him)

Comment 36

•

12 years ago

Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

As protz says, the default (which is now being explicitly set) for allow download (=false) should avoid any network traffic anyways.  I think setting saneBodySize is definitely a good stop-gap.

I don't think there's any way to avoid the JS component being in its own compartment without changes to the way the emitter is instantiated, which leaves us moving all of the representation gunk from mimemsg.js to jsmimeemitter.js and leaving the calls that trigger the streaming in mimemsg.js.

Specifically, gloda.manifest lists the category manage entry: "category mime-emitter @mozilla.org/messenger/mimeemitter;1?type=application/x-js-mime-message @mozilla.org/gloda/jsmimeemitter;1" which is how it gets instantiated  We could have the streaming process take a pre-constructed XPConnect-looking object and have all the MIME helper stuff operate by using the subscript loader inside the same context as whatever wants to consume it.  We would probably want to use a RequireJS-type idiom (with 'use strict' to avoid global pollution).  This would likely want to entail a change to similarly most of the gloda modules in a similar fashion.

Having said that, I think just having the representation classes loaded in the emitter compartment should be sufficient to stop the worst of the pain.  That and maybe encouraging the C++ shim that defines the emitter layer to perform a little more internal batching so the XPConnect wrapping is not quite so over-the-top.

Attachment #656201 - Flags: review?(bugmail) → review+

:Irving Reid (No longer working on Firefox)

Assignee

Updated

•

12 years ago

Blocks: 784286

Wayne Mery (:wsmwk)

Comment 39

•

12 years ago

several potentially matching topics taggged https://getsatisfaction.com/mozilla_messaging/tags/tb15perf

TB15.0.1 coming?? :)

Whiteboard: [gs]

:Irving Reid (No longer working on Firefox)

Assignee

Comment 40

•

12 years ago

Checked in:

https://hg.mozilla.org/comm-central/rev/4ac6d862637f

but see Bug 784286 to track further work necessary...

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

:aceman

Comment 41

•

12 years ago

Is this specific to TB17 or can it be in TB15 too?
Here are some bugs like this: bug 787348, bug 787321.

Matt Dorn

Comment 42

•

12 years ago

@aceman: yes, I believe the bug is in TB15 official release.  Here's my experience after upgrading: bug 787121

@:wsmwk -- I agree that a TB 15.0.1 release containing this fix is needed -- I actually had to roll back to TB14, and am happily productive again.

neil@parkwaycc.co.uk

Updated

•

12 years ago

Attachment #656201 - Flags: superreview?(neil)

Tony Mechelynck [:tonymec]

Updated

•

12 years ago

Blocks: TB2SM

Philip Chee

Updated

•

12 years ago

No longer blocks: TB2SM

Wayne Mery (:wsmwk)

Updated

•

12 years ago

Whiteboard: [gs] → [gs][regression:TB15]

Target Milestone: --- → Thunderbird 17.0

:aceman

Comment 43

•

12 years ago

There are potential reports that may be a dupe of this. Will this patch be backported to some branches?

Mark Banner (:standard8)

Comment 44

•

12 years ago

Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
Taking this onto beta so we can get some testing.

Attachment #656201 - Flags: approval-comm-beta+

Attachment #656201 - Flags: approval-comm-aurora+

Mark Banner (:standard8)

Comment 45

•

12 years ago

Checked in:

https://hg.mozilla.org/releases/comm-aurora/rev/f0b6d2317205
https://hg.mozilla.org/releases/comm-beta/rev/0564977e1c9d

status-thunderbird16: --- → fixed

status-thunderbird17: --- → fixed

tracking-thunderbird15: --- → +

tracking-thunderbird16: --- → +

tracking-thunderbird17: ? → +

Target Milestone: Thunderbird 17.0 → Thunderbird 18.0

:aceman

Comment 46

•

12 years ago

Is this going into 15.0.1 please? So we can see if the tons of similar bugs are dupes.

Mark Banner (:standard8)

Comment 48

•

12 years ago

Comment on attachment 656201 [details] [diff] [review]
Don't accumulate the entire message body in MIME emitter when creating message summaries

[Triage Comment]
The general feedback so far has been that this and bug 784286 are fixing issues, so taking onto the release branch.

Attachment #656201 - Flags: approval-comm-release+

Mark Banner (:standard8)

Comment 49

•

12 years ago

Checked in:

http://hg.mozilla.org/releases/comm-release/rev/4c39c77cfe36

status-thunderbird15: --- → fixed

Launchpad

Updated

•

12 years ago

See Also: → https://launchpad.net/bugs/1048618

:aceman

Comment 51

•

12 years ago

Irving, we get reports in other bugs about this, that in TB15.0.1 the memory (freeze) problem is solved but now not all of the selected messages are summarized in the message pane. Is that a known side effect of the patch?

:Irving Reid (No longer working on Firefox)

Assignee

Updated

•

12 years ago

Depends on: 790204

Andrew Sutherland [:asuth] (he/him)

Comment 55

•

12 years ago

(In reply to :aceman from comment #51)
> Irving, we get reports in other bugs about this, that in TB15.0.1 the memory
> (freeze) problem is solved but now not all of the selected messages are
> summarized in the message pane. Is that a known side effect of the patch?

There should be no side-effect like that.  But be aware that the summarization logic caps MAX_THREADS to 100 and MAX_MESSAGES to 100, so it could just be people are being more observant and noticing this limitation.

:aceman

Comment 56

•

12 years ago

Are they informed about that? Does the preview pane contain some "further messages were hidden" footnote?

Ludovic Hirlimann [:Usul]

Comment 57

•

12 years ago

(In reply to :aceman from comment #56)
> Are they informed about that? Does the preview pane contain some "further
> messages were hidden" footnote?

yes at the bottom of the summary view

Jeff Gehlhaar

Comment 58

•

12 years ago

In my case, I don't see any notice that not everything was summarized.  For 9 messages, total size is 14MB.

Wayne Mery (:wsmwk)

Comment 59

•

12 years ago

https://getsatisfaction.com/mozilla_messaging/topics/tb15_0_is_horrid_slow marked solved

Whiteboard: [gs][regression:TB15] → [gs][regression:TB15][gssolved]

Wayne Mery (:wsmwk)

Updated

•

7 years ago

Blocks: 1330872

Process sample via Activity Manager while hung 12 years ago Dave Miller [:justdave] 171.85 KB, text/plain		Details
testcase 12 years ago Ludovic Hirlimann [:Usul] 278.90 KB, message/rfc822		Details
Don't accumulate the entire message body in MIME emitter when creating message summaries 12 years ago :Irving Reid (No longer working on Firefox) 3.07 KB, patch	asuth : review+ standard8 : approval-comm-aurora+ standard8 : approval-comm-beta+ standard8 : approval-comm-release+	Details \| Diff \| Splinter Review