Closed Bug 543737 Opened 14 years ago Closed 14 years ago

gloda needs to avoid indexing ridiculously large message bodies (especially from offline store corruption)

Categories

(MailNews Core :: Database, defect)

defect
Not set
major

Tracking

(blocking-thunderbird3.1 rc1+, thunderbird3.1 rc1-fixed)

RESOLVED FIXED
Thunderbird 3.1rc1
Tracking Status
blocking-thunderbird3.1 --- rc1+
thunderbird3.1 --- rc1-fixed

People

(Reporter: gaston, Assigned: asuth)

Details

(Keywords: perf, Whiteboard: [gloda key])

Attachments

(2 files, 2 obsolete files)

User-Agent:       Midori/0.2.2 (X11; OpenBSD amd64; U; fr-fr) WebKit/531.2+
Build Identifier: Mozilla/5.0 (X11; U; OpenBSD amd64; en-US; rv:1.9.1.7) Gecko/20100129 Lightning/1.0b1 Thunderbird/3.0.1

I'm working on the port of Thunderbird 3.0.1 for OpenBSD, we currently ship 2.0.0.23. The 3.0.1 build is with the embedded sqlite 3.6.20, i also tried with systemwide sqlite 3.6.16 & 3.6.22, both with SQLITE_ENABLE_FTS3 set.

I have 3 imap accounts configured in TB2, with around 5000 messages total. On the server, total Maildir size for all accounts is around 2Gb, and the local cache in ImapMail is around 500M (i didn't browse to each and every dir in each account).
After getting TB3 to build successfully, it started migrating my profile (ie syncing all mails from the server for offline use) and indexing all of them.

Rough values : it indexes approx 1 message every 10 secs, the machine is _totally_ unusable, i stopped the indexing after two hours and the sqlite index is now around 5.6Gb.

Slow indexing, i can understand that, but the index size is just crazy. vacuuming it didn't help btw.

528M    .thunderbird/ci580w50.default/ImapMail                                                                                             
5.6G    .thunderbird/ci580w50.default/global-messages-db.sqlite                                                                            
                                                                                                                                           
echo 'select count(*) from messages;' | sqlite3 .thunderbird/ci580w50.default/global-messages-db.sqlite                                    
2646

I created a new clean profile with the exact same accounts, and after a little while (very short compared to the TB2 migration), syncing & indexing was done okay (around 10 msg/sec indexed), and here are the values :

1019M   .thunderbird/7yh39hls.Test/ImapMail                                                                                                
71.8M   .thunderbird/7yh39hls.Test/global-messages-db.sqlite                                                                               
                                                                                                                                           
echo 'select count(*) from messages;' | sqlite3 .thunderbird/7yh39hls.Test/global-messages-db.sqlite                                       
5121     

I'm fine with telling the users that they have to create a new profile and don't try to migrate their existing profile (or otherwise disable GLODA), but the index size just puzzles me.

Reproducible: Always

Steps to Reproduce:
1. Install and configure thunderbird 2
2. Upgrade to thunderbird 3
3.
Actual Results:  
Index is huge compared to relative ImapMail size :
528M	.thunderbird/ci580w50.default/ImapMail/
5.6G	.thunderbird/ci580w50.default/global-messages-db.sqlite

Expected Results:  
Index should have a more reasonable size.
Keywords: perf
Whiteboard: [223 Migration]
Component: Search → Database
Product: Thunderbird → MailNews Core
QA Contact: search → database
Version: unspecified → 1.9.1 Branch
From offline conversation, it appears the problem is that the 2.0 profile had some serious offline store corruption.

Landry indicates having "~140 messages which have a body size of ~15Mo" where the body "contains a concatenation of lots of messages and attachements".

I think it's pretty clear that gloda should have some form of heuristic limiting the maximum body size it indexes.

bienvenu, for the 2-to-3 migration, though, should we be considering some form of sanity-checking of the offline store?  Gloda could likely participate in that if need be.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: OpenBSD → All
Summary: Migrating a Thunderbird 2 profile creates a huge GLODA index → gloda needs to avoid indexing ridiculously large message bodies (especially from offline store corruption)
Whiteboard: [223 Migration] → [gloda key]
Version: 1.9.1 Branch → Trunk
So, sanity checking of the offline store message size would be great, but we have to deal with Exchange's often wildly incorrect message sizes (i.e., the sanest sanity check is to compare the offline msg size with the online message size).
For offline store checking we could leverage the fact that offline messages should not overlap each other in the store.  Of course, this only works if the  store was never compacted.  Once the compaction happens we just have absurdly large messages... and I think we don't escape From in the offline store so we can't super easily detect from the contents what has gone wrong?

For overlap detection the naive implementation would be to sort the messages by offset and then make sure that 1) no one else has the same offset and 2) the size does not take us beyond the start of the next message.  In the worst case, that could get expensive, but that could be mitigated by a data-structure like we use for newsrc ranges or selection that fuses adjacent ranges together.
(In reply to comment #3)
> For offline store checking we could leverage the fact that offline messages
> should not overlap each other in the store.  Of course, this only works if the 
> store was never compacted.  Once the compaction happens we just have absurdly
> large messages... and I think we don't escape From in the offline store so we
> can't super easily detect from the contents what has gone wrong?

I think that's right.
> 
> For overlap detection the naive implementation would be to sort the messages by
> offset and then make sure that 1) no one else has the same offset and 2) the
> size does not take us beyond the start of the next message.  In the worst case,
> that could get expensive, but that could be mitigated by a data-structure like
> we use for newsrc ranges or selection that fuses adjacent ranges together.

Once compaction happens, I think we're horked, like you say. We could have a heuristic like the message can't have an offline size 3x the message size. Even Exchange doesn't get it that wrong...
Severity: normal → major
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
blocking-thunderbird3.1: --- → rc1+
Target Milestone: --- → Thunderbird 3.1rc1
I'm having a hard time convincing myself that this meets either of our current blocking criteria:

a) make the upgrade experience from TB2 very painful for a large number of users

or 

b) be a new, reproducible, severe quality issue (eg dataloss, frequent crashes)

Am I missing something?
I think a non-trivial number of TB2 upgraders are going to have corrupt messages that end up disturbingly large and that will have performance and result quality fall-out.  (Memory usage during indexing, SQLite performance during querying, result quality problems in terms of false positives, et.c)

But honestly I was abusing the blocking flag as an important to-do list since this is definitely on the easy end of the spectrum with serious potential to eliminate a whole set of performance problems.  We can downgraded to needed if you like (or something else that is full of things we are really going to try and implement).
Whiteboard: [gloda key] → [eta Saturday May 8][gloda key]
Whiteboard: [eta Saturday May 8][gloda key] → [eta Monday May 10][gloda key]
Whiteboard: [eta Monday May 10][gloda key] → [eta Wed May 12][gloda key]
I have a fix with tests but unfortunately the tests are revealing a subtle glitch in the imap online-to-offline case that previous tests did not catch.  Fix by tomorrow afternoon.
Whiteboard: [eta Wed May 12][gloda key] → [final patch eta Thu May 13][gloda key]
Attached patch a body has limits (obsolete) — Splinter Review
Attachment #445258 - Flags: review?(bienvenu)
I had to punt on the more thorough performance impact analysis of this, but I'll run with my cheap systemtap script to sanity check things.
Whiteboard: [final patch eta Thu May 13][gloda key] → [has patch needs review bienvenu][gloda key]
Attached patch v2 a body has limits (obsolete) — Splinter Review
Whoops.  The bulk indexing I was doing for performance analysis revealed that the fix surfaced an underlying assumption violation going on...

We only wanted/thought we only got to stream the bodies of text/plain and text/html.  It turns out that all text/* is getting streamed.  In that case, there is no body attribute because we don't care about it, but "+=" when provided with a string on the right and null/undefined on the left just coerces null/undefined into a string.  This was observed on darcs patches which end up text/x-darcs-patch.

I just changed the logic to inside the JS mime emitter to guard against this.
Attachment #445258 - Attachment is obsolete: true
Attachment #445266 - Flags: review?(bienvenu)
Attachment #445258 - Flags: review?(bienvenu)
Comment on attachment 445258 [details] [diff] [review]
a body has limits

punchline missing:

+   *  does; when indexing an already existing object, all mutations happen on
+   *  a clone of the existing object so that

xpcshell tests have been failing for me in gloda so it may be a bit tricky for me to test the tests.

20K is a pretty reasonable limit, IMO.
Attached file log of error
I tried just running test_mime_emitter.js and got a bunch of failures. This is on comm-central-1.9.2 and moz-central-1.9.2
I get the same failures w/o your patch, but that doesn't help me run the tests :-(
(In reply to comment #13)
> I get the same failures w/o your patch, but that doesn't help me run the tests
> :-(

er, scratch that - I don't see the failures w/o the patch.
It looks like _writeBody isn't getting set correctly (or not read correctly). If I comment out that part of the conditional, test_mime_emitter.js succeeds.

  writeBody: function mime_emitter_writeBody(aBuf, aSize, aOutAmountWritten) {
    if (/* this._writeBody && */
I'll try a windows build...
It was a cross-platform control-flow bone-headed move on my part.  Sadly, with that fixed there's still an order of operations that is confusing me since most of my emitter memory is now in long term storage:

"Sad Llama In Your Car" is doing this:

*** CONTENT TYPE: multipart/related
***

*** CONTENT TYPE: text/html
***

*** CONTENT TYPE: image/png
***
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 6 bytes with write: false
&&& writeBody 16 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 0 bytes with write: false
&&& writeBody 7 bytes with write: false
&&& writeBody 1 bytes with write: false
&&& writeBody 1 bytes with write: false


I expect this is the normal multipart/related 'defer processing the body' logic, but I'm a bit confused as to how it ever worked if the image/png was clobbering...

I will research more.

Oh, and as you can tell from the debug, this suggests that the standard libmime function call costs are probably pretty high but the XPConnect crossing costs for HTML messages have to be off the chart.
ah, addHeaderField is fixing things up.  it needs to play a part in the writeBody game.
Sorry about the glitch on the previous patch; I must have done something dumb like run the tests without the patch applied.

My rationale for the _writeBody flag is that it's not really a surprise that a lot of calls get made to writeBody (although the 0-length calls in the excessively writey HTML logic are somewhat surprising) and I was trying to optimize that check.
Attachment #445266 - Attachment is obsolete: true
Attachment #445516 - Flags: review?(bienvenu)
Attachment #445266 - Flags: review?(bienvenu)
Comment on attachment 445516 [details] [diff] [review]
v3 a body has limits

much better :-)
Attachment #445516 - Flags: review?(bienvenu) → review+
pushed to comm-1.9.2:
http://hg.mozilla.org/releases/comm-1.9.2/rev/0ab2eb35d388

pushed to comm-central despite the tree being seriously messed up:
http://hg.mozilla.org/comm-central/rev/991adaa32680
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [has patch needs review bienvenu][gloda key] → [gloda key]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: