Last Comment Bug 543737 - gloda needs to avoid indexing ridiculously large message bodies (especially from offline store corruption)
: gloda needs to avoid indexing ridiculously large message bodies (especially f...
Status: RESOLVED FIXED
[gloda key]
: perf
Product: MailNews Core
Classification: Components
Component: Database (show other bugs)
: Trunk
: All All
: -- major (vote)
: Thunderbird 3.1rc1
Assigned To: Andrew Sutherland [:asuth]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-02-02 07:13 PST by Landry Breuil (:gaston)
Modified: 2010-05-15 09:48 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
rc1+
rc1-fixed


Attachments
a body has limits (20.83 KB, patch)
2010-05-13 18:51 PDT, Andrew Sutherland [:asuth]
no flags Details | Diff | Review
v2 a body has limits (22.65 KB, patch)
2010-05-13 19:55 PDT, Andrew Sutherland [:asuth]
no flags Details | Diff | Review
log of error (26.15 KB, text/plain)
2010-05-14 12:45 PDT, David :Bienvenu
no flags Details
v3 a body has limits (23.84 KB, patch)
2010-05-14 20:59 PDT, Andrew Sutherland [:asuth]
mozilla: review+
Details | Diff | Review

Description Landry Breuil (:gaston) 2010-02-02 07:13:02 PST
User-Agent:       Midori/0.2.2 (X11; OpenBSD amd64; U; fr-fr) WebKit/531.2+
Build Identifier: Mozilla/5.0 (X11; U; OpenBSD amd64; en-US; rv:1.9.1.7) Gecko/20100129 Lightning/1.0b1 Thunderbird/3.0.1

I'm working on the port of Thunderbird 3.0.1 for OpenBSD, we currently ship 2.0.0.23. The 3.0.1 build is with the embedded sqlite 3.6.20, i also tried with systemwide sqlite 3.6.16 & 3.6.22, both with SQLITE_ENABLE_FTS3 set.

I have 3 imap accounts configured in TB2, with around 5000 messages total. On the server, total Maildir size for all accounts is around 2Gb, and the local cache in ImapMail is around 500M (i didn't browse to each and every dir in each account).
After getting TB3 to build successfully, it started migrating my profile (ie syncing all mails from the server for offline use) and indexing all of them.

Rough values : it indexes approx 1 message every 10 secs, the machine is _totally_ unusable, i stopped the indexing after two hours and the sqlite index is now around 5.6Gb.

Slow indexing, i can understand that, but the index size is just crazy. vacuuming it didn't help btw.

528M    .thunderbird/ci580w50.default/ImapMail                                                                                             
5.6G    .thunderbird/ci580w50.default/global-messages-db.sqlite                                                                            
                                                                                                                                           
echo 'select count(*) from messages;' | sqlite3 .thunderbird/ci580w50.default/global-messages-db.sqlite                                    
2646

I created a new clean profile with the exact same accounts, and after a little while (very short compared to the TB2 migration), syncing & indexing was done okay (around 10 msg/sec indexed), and here are the values :

1019M   .thunderbird/7yh39hls.Test/ImapMail                                                                                                
71.8M   .thunderbird/7yh39hls.Test/global-messages-db.sqlite                                                                               
                                                                                                                                           
echo 'select count(*) from messages;' | sqlite3 .thunderbird/7yh39hls.Test/global-messages-db.sqlite                                       
5121     

I'm fine with telling the users that they have to create a new profile and don't try to migrate their existing profile (or otherwise disable GLODA), but the index size just puzzles me.

Reproducible: Always

Steps to Reproduce:
1. Install and configure thunderbird 2
2. Upgrade to thunderbird 3
3.
Actual Results:  
Index is huge compared to relative ImapMail size :
528M	.thunderbird/ci580w50.default/ImapMail/
5.6G	.thunderbird/ci580w50.default/global-messages-db.sqlite

Expected Results:  
Index should have a more reasonable size.
Comment 1 Andrew Sutherland [:asuth] 2010-02-04 12:54:08 PST
From offline conversation, it appears the problem is that the 2.0 profile had some serious offline store corruption.

Landry indicates having "~140 messages which have a body size of ~15Mo" where the body "contains a concatenation of lots of messages and attachements".

I think it's pretty clear that gloda should have some form of heuristic limiting the maximum body size it indexes.

bienvenu, for the 2-to-3 migration, though, should we be considering some form of sanity-checking of the offline store?  Gloda could likely participate in that if need be.
Comment 2 David :Bienvenu 2010-02-04 13:13:40 PST
So, sanity checking of the offline store message size would be great, but we have to deal with Exchange's often wildly incorrect message sizes (i.e., the sanest sanity check is to compare the offline msg size with the online message size).
Comment 3 Andrew Sutherland [:asuth] 2010-02-04 13:35:25 PST
For offline store checking we could leverage the fact that offline messages should not overlap each other in the store.  Of course, this only works if the  store was never compacted.  Once the compaction happens we just have absurdly large messages... and I think we don't escape From in the offline store so we can't super easily detect from the contents what has gone wrong?

For overlap detection the naive implementation would be to sort the messages by offset and then make sure that 1) no one else has the same offset and 2) the size does not take us beyond the start of the next message.  In the worst case, that could get expensive, but that could be mitigated by a data-structure like we use for newsrc ranges or selection that fuses adjacent ranges together.
Comment 4 David :Bienvenu 2010-02-04 13:39:39 PST
(In reply to comment #3)
> For offline store checking we could leverage the fact that offline messages
> should not overlap each other in the store.  Of course, this only works if the 
> store was never compacted.  Once the compaction happens we just have absurdly
> large messages... and I think we don't escape From in the offline store so we
> can't super easily detect from the contents what has gone wrong?

I think that's right.
> 
> For overlap detection the naive implementation would be to sort the messages by
> offset and then make sure that 1) no one else has the same offset and 2) the
> size does not take us beyond the start of the next message.  In the worst case,
> that could get expensive, but that could be mitigated by a data-structure like
> we use for newsrc ranges or selection that fuses adjacent ranges together.

Once compaction happens, I think we're horked, like you say. We could have a heuristic like the message can't have an offline size 3x the message size. Even Exchange doesn't get it that wrong...
Comment 5 Dan Mosedale (:dmose) 2010-04-27 14:24:59 PDT
I'm having a hard time convincing myself that this meets either of our current blocking criteria:

a) make the upgrade experience from TB2 very painful for a large number of users

or 

b) be a new, reproducible, severe quality issue (eg dataloss, frequent crashes)

Am I missing something?
Comment 6 Andrew Sutherland [:asuth] 2010-04-27 16:45:09 PDT
I think a non-trivial number of TB2 upgraders are going to have corrupt messages that end up disturbingly large and that will have performance and result quality fall-out.  (Memory usage during indexing, SQLite performance during querying, result quality problems in terms of false positives, et.c)

But honestly I was abusing the blocking flag as an important to-do list since this is definitely on the easy end of the spectrum with serious potential to eliminate a whole set of performance problems.  We can downgraded to needed if you like (or something else that is full of things we are really going to try and implement).
Comment 7 Andrew Sutherland [:asuth] 2010-05-12 23:16:57 PDT
I have a fix with tests but unfortunately the tests are revealing a subtle glitch in the imap online-to-offline case that previous tests did not catch.  Fix by tomorrow afternoon.
Comment 8 Andrew Sutherland [:asuth] 2010-05-13 18:51:30 PDT
Created attachment 445258 [details] [diff] [review]
a body has limits
Comment 9 Andrew Sutherland [:asuth] 2010-05-13 18:52:37 PDT
I had to punt on the more thorough performance impact analysis of this, but I'll run with my cheap systemtap script to sanity check things.
Comment 10 Andrew Sutherland [:asuth] 2010-05-13 19:55:32 PDT
Created attachment 445266 [details] [diff] [review]
v2 a body has limits

Whoops.  The bulk indexing I was doing for performance analysis revealed that the fix surfaced an underlying assumption violation going on...

We only wanted/thought we only got to stream the bodies of text/plain and text/html.  It turns out that all text/* is getting streamed.  In that case, there is no body attribute because we don't care about it, but "+=" when provided with a string on the right and null/undefined on the left just coerces null/undefined into a string.  This was observed on darcs patches which end up text/x-darcs-patch.

I just changed the logic to inside the JS mime emitter to guard against this.
Comment 11 David :Bienvenu 2010-05-14 07:34:38 PDT
Comment on attachment 445258 [details] [diff] [review]
a body has limits

punchline missing:

+   *  does; when indexing an already existing object, all mutations happen on
+   *  a clone of the existing object so that

xpcshell tests have been failing for me in gloda so it may be a bit tricky for me to test the tests.

20K is a pretty reasonable limit, IMO.
Comment 12 David :Bienvenu 2010-05-14 12:45:06 PDT
Created attachment 445422 [details]
log of error

I tried just running test_mime_emitter.js and got a bunch of failures. This is on comm-central-1.9.2 and moz-central-1.9.2
Comment 13 David :Bienvenu 2010-05-14 13:53:01 PDT
I get the same failures w/o your patch, but that doesn't help me run the tests :-(
Comment 14 David :Bienvenu 2010-05-14 14:02:01 PDT
(In reply to comment #13)
> I get the same failures w/o your patch, but that doesn't help me run the tests
> :-(

er, scratch that - I don't see the failures w/o the patch.
Comment 15 David :Bienvenu 2010-05-14 14:19:43 PDT
It looks like _writeBody isn't getting set correctly (or not read correctly). If I comment out that part of the conditional, test_mime_emitter.js succeeds.

  writeBody: function mime_emitter_writeBody(aBuf, aSize, aOutAmountWritten) {
    if (/* this._writeBody && */
Comment 16 Andrew Sutherland [:asuth] 2010-05-14 15:57:26 PDT
I'll try a windows build...
Comment 17 Andrew Sutherland [:asuth] 2010-05-14 20:35:41 PDT
It was a cross-platform control-flow bone-headed move on my part.  Sadly, with that fixed there's still an order of operations that is confusing me since most of my emitter memory is now in long term storage:

"Sad Llama In Your Car" is doing this:

*** CONTENT TYPE: multipart/related
***

*** CONTENT TYPE: text/html
***

*** CONTENT TYPE: image/png
***
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 16 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 1 bytes with write: false
&&& writeBody 1 bytes with write: false


I expect this is the normal multipart/related 'defer processing the body' logic, but I'm a bit confused as to how it ever worked if the image/png was clobbering...

I will research more.

Oh, and as you can tell from the debug, this suggests that the standard libmime function call costs are probably pretty high but the XPConnect crossing costs for HTML messages have to be off the chart.
Comment 18 Andrew Sutherland [:asuth] 2010-05-14 20:47:21 PDT
ah, addHeaderField is fixing things up.  it needs to play a part in the writeBody game.
Comment 19 Andrew Sutherland [:asuth] 2010-05-14 20:59:00 PDT
Created attachment 445516 [details] [diff] [review]
v3 a body has limits

Sorry about the glitch on the previous patch; I must have done something dumb like run the tests without the patch applied.

My rationale for the _writeBody flag is that it's not really a surprise that a lot of calls get made to writeBody (although the 0-length calls in the excessively writey HTML logic are somewhat surprising) and I was trying to optimize that check.
Comment 20 David :Bienvenu 2010-05-15 09:20:33 PDT
Comment on attachment 445516 [details] [diff] [review]
v3 a body has limits

much better :-)
Comment 21 Andrew Sutherland [:asuth] 2010-05-15 09:48:34 PDT
pushed to comm-1.9.2:
http://hg.mozilla.org/releases/comm-1.9.2/rev/0ab2eb35d388

pushed to comm-central despite the tree being seriously messed up:
http://hg.mozilla.org/comm-central/rev/991adaa32680

Note You need to log in before you can comment on or make changes to this bug.