Open Bug 632791 Opened 14 years ago Updated 5 years ago

Gloda indexes one message per minute with high CPU. Excessively sized global-messages-db.sqlite caused by buggered, excessively sized offline store.

Categories

(MailNews Core :: Database, defect)

1.9.2 Branch
Other
All
defect
Not set
major

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: gaston, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(4 files)

User-Agent:       Midori/0.3 (X11; OpenBSD; U; fr-fr) WebKit/531.2+
Build Identifier: 

Maybe a followup of https://bugzilla.mozilla.org/show_bug.cgi?id=543737, as it's with the same profile offline store. I realized i still had gloda disabled, and decided to re-enable it.

It first indexed smth like 2000 messages on ~3300 in INBOX, then started being awfully slow (ie 1 msg/min in activity manager). At that time, i realised the gloda DB was 5Gb, but that was probably a leftover from the previous tries, so i removed it to start a fresh indexing.

It stopped it, and restarted TB, and now it seems it cant index more than one message a minute :

2011-02-09 14:58:11     gloda.index_msg DEBUG   *** Indexing message: 1297 : Cron <root@polegeom> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
2011-02-09 14:59:09     gloda.index_msg DEBUG     * Got message, subject Cron <root@polegeom> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
....
2011-02-09 14:59:09     gloda.index_msg DEBUG   *** Indexing message: 1296 : Debian package updates on polegeom
2011-02-09 15:00:07     gloda.index_msg DEBUG     * Got message, subject Debian package updates on polegeom

That time, gloda db size looks more 'normal' :

4.4G    .mozilla-thunderbird/7yh39hls.Test/ImapMail/

-rw-r--r--  1 landry  users  13530112 Feb  9 15:00 .mozilla-thunderbird/7yh39hls.Test/global-messages-db.sqlite

(with 2137 messages indexed so far)

That's with Tb 3.1.7

Reproducible: Always
Oh, and as a sidenote, thunderbird takes 100% cpu if gloda is indexing. If i disable gloda in prefs, it doesn't stop the indexing, i have to kill thunderbird and restart it to get it back to regular cpu usage.
Version: unspecified → 3.1
Component: Folder and Message Lists → Database
Product: Thunderbird → MailNews Core
Version: 3.1 → Trunk
Forgot to mention this is with Tb built against systemwide sqlite 3.7.4 with SQLITE_ENABLE_COLUMN_METADATA, SQLITE_HAVE_ISNAN, SQLITE_SOUNDEX, SQLITE_ENABLE_RTREE and SQLITE_ENABLE_FTS3 set.
QA Contact: folders-message-lists → database
(In reply to comment #1)
> If i disable gloda in prefs, it doesn't stop the indexing, i have to kill
> thunderbird and restart it to get it back to regular cpu usage.

bug 550770
Severity: normal → major
Keywords: perf
Is this a DEBUG build?
No, but i can make one. I only enabled mailnews.database.global.logging.dump.
(In reply to comment #5)
> No, but i can make one. I only enabled mailnews.database.global.logging.dump.

No, Andrew was hoping it was a debug build, because there's a known slowness in debug builds (though even that's not as slow as one message per minute, at least on my machine)
It's important to note that gloda's rate of indexing is adaptive.  If a non-gloda part of Thunderbird is using most of your CPU, gloda will reduce its rate of indexing way way way down to increase responsiveness and let that other code finish up quickly.  For example, auto-sync downloading IMAP messages for offline storage can potentially result in non-trivial load.

It would be good to get some profiling data, even if it is just poor man's gdb profiling.  If you could attach gdb periodically while gloda is indexing slowly and Thunderbird is eating 100% of your CPU, that could be invaluable.  Please do a "thread apply all bt" to see what threads are actively doing things.  (Inactive threads will be waiting on semaphores and/or sockets.)  Of particular interest are: what is happening on the first thread, and whether the asynchronous db thread is deep in a sqlite stack.

You may also be able to get some good profiling results using dtrace or other BSD mechanisms I'm not up on.  If using a profiling mechanism capable of polling at a rate faster than human control-c profiling with gdb, please keep in mind that if you only have 1 or 2 processing cores it is possible for thread scheduling to leave a thread in the same inactive state in the same stack frame which could make it seem like that thread is more busy than it really is.

I'm assuming you built this yourself and so have an unstripped binary.  If the frame pointer is being omitted backtraces may cut out early and you may need to do one of the following in your mozconfig (with ac_enable_*):
a) --enable-profiling to tell us to stop omitting the frame pointer
b) --enable-debug-symbols=yes to get us to generate dwarf debug info and generate the unwind data required to be able to unwind the stack.  This will also let gdb tell us the arguments functions were invoked with.  Likely option 'a' is sufficient.  Certainly your builds will go much faster that way.
Attached file trace
my debug builds are done with --enable-debug --disable-install-strip, and -g is passed to CFLAGS. Here's a first trace, i'm not sure there's enough info in it.
I'll add --enable-profiling and --enable-debug-symbols=yes to my next debug builds.
As to profiling, the only tool we have on OpenBSD is ktrace, and its output is.. quite hard to parse.

In fact, the activity manager doesn't show gloda indexing yet, it looks like it's waiting forever on 'determining which messages to index in INBOX'.
To elaborate on bienvenu's reply, it is indeed far preferable that you do not build with "--enable-debug".  (Although the specific problem I am thinking of will not be an issue if you build/run with the recently released SQLite 3.7.5 which has the fixes for that problem.)

Ideally, you would also want to run without "mailnews.database.global.logging.dump" as it does increase gloda's overhead non-trivially.

Your backtraces look surprisingly gibberishy for having been built with --enable-debug and having -g passed...  (That's all --enable-debug-symbols=yes should be netting out to anyways, -g.)

The log mentions you are using gdb 6.3 which is from 2004.  While hopefully your gcc toolchain is more recent than gdb, I'm going to have to invoke "unsupported platform shenanigans" and bow out from involvement in further analysis until you can reproduce on a tier 1 thunderbird platform (OS X/linux/windows) or track down a specific smoking gun.  Let me emphasize that I do appreciate your willingness to look into this and really hope you do keep looking into it.

One potentially platform related tidbit is that gloda uses nsStopwatch to help figure out what the rest of Thunderbird is up to:
http://mxr.mozilla.org/comm-central/source/mailnews/base/util/nsStopwatch.cpp

I assume OpenBSD is falling into the XP_UNIX case and all is well, but if not, that could explain gloda backing off, but not why something else would take 100% of the cpu.

And a useful gloda tidbit is that if your message corpus includes a ton of duplicate messages (all having the same message-id) or a ridiculously long message thread (in terms of distance from the root to the last message), gloda performance will suffer while indexing them.
(In reply to comment #9)
> To elaborate on bienvenu's reply, it is indeed far preferable that you do not
> build with "--enable-debug".  (Although the specific problem I am thinking of
> will not be an issue if you build/run with the recently released SQLite 3.7.5
> which has the fixes for that problem.)

Do you mean that sqlite 3.7.5 might fix my issue, or it only fixes an unrelated issue with builds with --enable-debug ? I can try building it if that helps.

> Ideally, you would also want to run without
> "mailnews.database.global.logging.dump" as it does increase gloda's overhead
> non-trivially.
> 
> Your backtraces look surprisingly gibberishy for having been built with
> --enable-debug and having -g passed...  (That's all --enable-debug-symbols=yes
> should be netting out to anyways, -g.)

So, the official way of building a debug-enabled version is --enable-profiling --enable-debug-symbols=yes ? noted.

> The log mentions you are using gdb 6.3 which is from 2004.  While hopefully
> your gcc toolchain is more recent than gdb, I'm going to have to invoke
> "unsupported platform shenanigans" and bow out from involvement in further
> analysis until you can reproduce on a tier 1 thunderbird platform (OS
> X/linux/windows) or track down a specific smoking gun.  Let me emphasize that I
> do appreciate your willingness to look into this and really hope you do keep
> looking into it.

Yes; gdb 6.3. We might update to 6.6 at some point, last gplv2 version.
Gcc is 4.2.1 since a year or so, was 3.3.5 before. And of course i know the supported platforms list.. oh well, i hope i'll get
a better trace with a correct debug build.

> One potentially platform related tidbit is that gloda uses nsStopwatch to help
> figure out what the rest of Thunderbird is up to:
> http://mxr.mozilla.org/comm-central/source/mailnews/base/util/nsStopwatch.cpp
> 
> I assume OpenBSD is falling into the XP_UNIX case and all is well, but if not,
> that could explain gloda backing off, but not why something else would take
> 100% of the cpu.

Yes, it's in the XP_UNIX codepath.

> And a useful gloda tidbit is that if your message corpus includes a ton of
> duplicate messages (all having the same message-id) or a ridiculously long
> message thread (in terms of distance from the root to the last message), gloda
> performance will suffer while indexing them.

My message corpus is 'normal' i'd say, no huge threads, and no dupes (although corpus might have been corrupted by a fs crash). 
I don't know if the fact that it's slow indexing my admin folder (containing a ton of similar messages in content from cronjobs) has something to do with it. I'll also compare with indexing of a 'normal' folder.
(In reply to comment #10)
> Do you mean that sqlite 3.7.5 might fix my issue, or it only fixes an unrelated
> issue with builds with --enable-debug ? I can try building it if that helps.

3.7.5 fixes an issue where SQLite statement tracing, which is enabled by default on DEBUG builds, is ridiculously inefficient for FTS3 tables such as those used by gloda.
 
> So, the official way of building a debug-enabled version is --enable-profiling
> --enable-debug-symbols=yes ? noted.

"debug" is a very overloaded word.  If someone says "debug build" they probably mean -DDEBUG build, which turns on all kinds of expensive paranoia checking throughout the system.  That is definitely not what you want if you are trying to deal with real-world performance because of the comprehensive slowdowns that result.

You definitely do not want a DEBUG build for your investigation.  You do want accurate stack traces, though, which should ideally come just from --enable-profiling.
 
> My message corpus is 'normal' i'd say, no huge threads, and no dupes (although
> corpus might have been corrupted by a fs crash). 
> I don't know if the fact that it's slow indexing my admin folder (containing a
> ton of similar messages in content from cronjobs) has something to do with it.
> I'll also compare with indexing of a 'normal' folder.

Our streaming process will stream the entire message even though gloda only cares about bits of the body.  If these are huge messages, that could explain things.  This should be very obvious if it is the problem.

Since you are resigned to using gdb, you might want to check out the script here which could be useful:
http://poormansprofiler.org/

Note that when counting all thread stacks, we will expect and overwhelming number of stack frames for the non-main threads to be idle blocking on additional tasks.
New trace with --enable-profiling. Strange thing is that thread apply all bt doesn't show all threads, and thread 1 and 4 seems to be the same. Content analyzed is rsync backup output sent in an admin mail, if that matters.
Nothing jumps out at me :(
Another data point :

created a fresh new profile, with only my admin account configured, tb fetched 3400 messages, Gloda indexed them in 120 secs. The ImapMail folder in profile is 60Mb (mainly INBOX file), and on my 'regular profile', the same folder/file for the admin account (with supposedly the same messages) is ~350Mb. Emptied trash, same thing. And when i try to compact the INBOX folder, tb says that i can't because 'another operation is in progress', whereas nothing is running in activity manager (maybe compacting folders should be shown in activity manager ?). Quit Tb, rerun it, i can compact the folder, but it's still 350Mb.

I'm pretty sure if i set a totally new profile with the same accounts configured, gloda won't choke on it....
Looking more, i think my profile got corrupted at some point, for this folder tb says '3392 messages', and :

$grep -ci ^Message-Id: .thunderbird/7yh39hls.Test/ImapMail/mail.craig.fr/INBOX             
13676

$grep -i ^Message-Id: .thunderbird/7yh39hls.Test/ImapMail/mail.craig.fr/INBOX | sort | uniq | wc -l
    3473

Here', 3473 doesn't even match 3392.

And on my fresh new profile:
$grep ^Message-Id .thunderbird/3yfidt7e.Admin/ImapMail/craig.fr/INBOX  | sort | uniq  | wc -l
    3344
$grep -c ^Message-Id .thunderbird/3yfidt7e.Admin/ImapMail/craig.fr/INBOX                      
3344

Is there a procedure in thunderbird to fix/purge the profile files from dupes, between 'expunge', 'compact' and 'empty trash', or it's better in my case to start from a new profile ?

I ran a script checking msg id count vs unique msg ids, and lots of folders file seem to contain duplicated messages. Sometimes more than 4x, sometimes a bit more than 2x.. how can that happen ?

.thunderbird/7yh39hls.Test/ImapMail/mail.craig.fr/INBOX
13676   3473
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX
13036   3465
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/Sent-1
729   421
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX.sbd/geonetwork-users-fr
431   411
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX.sbd/drupal-security
204   197
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX.sbd/ial
332   129
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX.sbd/geonetwork-devel
541   528
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX.sbd/geonetwork-users
542   539
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-1.fr/mapfish-users
406   398
.thunderbird/7yh39hls.Test/ImapMail/mail.craig-2.fr/INBOX
3400   2106

<not pasting the whole list>

Here's the script:

find .thunderbird/7yh39hls.Test/ImapMail/ -name *.msf | while read i ;
do
        b=`echo "$i" | sed -e 's/.msf$//'`;
        if [ -f "$b" ] ; then
        n=`grep -ci ^Message-Id: "$b"`
        n2=`grep -i ^Message-Id: "$b" | sort | uniq | wc -l`
        if [ $n -ne $n2 ] ; then
                echo "$b"
                echo $n " " $n2 ;
        fi
        fi
done
Btw, i've looked at http://kb.mozillazine.org/Duplicate_messages_received and mail.server.default.dup_action setting, but i doubt this is the same, as i don't get dupes in the message list, only in the storage file on the profile.

And i'm not sure setting dup_action to 1 will have a retroactive effect on the messages already stored in profile.
file | compact folders, or right click, compact folder compacts the offline store, and removes the messages that are no longer in the folder from the offline store, i.e., the deleted messages. If some other url is adding stuff to the offline store (e.g., autosync), then compact will fail with the error you saw. I'd suggest trying again.
Ok, something went really bad. I used the Xpunge extension to compact all folders, at some point it crashed (tb had opened too much files) and my profile went from 5Go to 50Go. Own account INBOX file was 2gb, it's now 30gb, and another account INBOX file went from 2gb to 18.4gb.

I'm giving up on that profile, that's just insane. I can keep it for post-mortem analysis, but to get job done i'll start from a fresh one.
We had offline IMAP store corruption for a while (but also fixed a while ago at this point...) which could potentially result in Thunderbird thinking a message's offline store was the size of the entire offline store.  When compacting, you could then end up with the kind of explosion you are talking about.

It sounds like gloda was probably going slow because it had to stream a huge amount of data every time...?
The 30Gb file contains 3470 unique Message-Ids, for a total of 249885
Message-Ids. Oops ?

So, yes i think gloda was so slow because of the huge amount of duplicates. If only Tb stored its offline store as a standard Maildir with a file/mail.. but that's another discussion :)
Landry, asuth, does the modified summary now fully capture the symptoms?  

I'd imagine the underlying cause(s) would need to be fixed. However, is there a sanity check that makes sense to avoid this situation?
Summary: Gloda indexes one message per minute → Gloda indexes one message per minute. Excessively sized global-messages-db.sqlite caused by buggered offline store.
Summary: Gloda indexes one message per minute. Excessively sized global-messages-db.sqlite caused by buggered offline store. → Gloda indexes one message per minute with high CPU. Excessively sized global-messages-db.sqlite caused by buggered offline store.
Summary is good for me..
as for the cause of a buggered offline store.. besides deleting obvious dupes by Message-Ids when recompacting folders, i'm note sure one can make assumptions on 'lets add a sanity check, noone can have a 30Gb mailbox with only 3500 distinct mails'..
protz, andrew, what direction/disposition do you think this should go?
Landry, have you ever experienced this problem again?


(In reply to Landry Breuil (:gaston) from comment #24)
> besides deleting obvious  dupes by Message-Ids when recompacting folders, i'm not sure one can make
> assumptions on 'lets add a sanity check, noone can have a 30Gb mailbox with
> only 3500 distinct mails'..

yeah, we tend to not do/add such checks and attacking "symptoms". preferring instead to catch the causes when then occur, and fix them
Flags: needinfo?(landry)
OS: OpenBSD → All
Summary: Gloda indexes one message per minute with high CPU. Excessively sized global-messages-db.sqlite caused by buggered offline store. → Gloda indexes one message per minute with high CPU. Excessively sized global-messages-db.sqlite caused by buggered, excessively sized offline store.
Version: Trunk → 1.9.2 Branch
The profile i'm running now since two years or so doesnt exhibit the problem with 17.0.6, and its size is 'regular' (ie 4G). Not much duplicates in msf files (at least, not 4x time the real # of msgs..)

I still have the 50Gb profile around if needed for post-mortem analysis. Oh, and i'm running with gloda enabled of course.
Flags: needinfo?(landry)
See Also: → 895516
bug 895516 got duped to bug 484646

(In reply to Landry Breuil (:gaston) from comment #27)
> The profile i'm running now since two years or so doesnt exhibit the problem
> with 17.0.6, and its size is 'regular' (ie 4G). Not much duplicates in msf
> files (at least, not 4x time the real # of msgs..)
> 
> I still have the 50Gb profile around if needed for post-mortem analysis. Oh,
> and i'm running with gloda enabled of course.

Thanks for the offer. Is that 50Gb without the global-messages-db.sqlite?
zipped?
Flags: needinfo?(landry)
See Also: → 484646
$du -s 7yh39hls.Test/* |sort -rn
53386278        7yh39hls.Test/ImapMail
13232   7yh39hls.Test/global-messages-db.sqlite
7148    7yh39hls.Test/extensions

The main size issue was that mails were duplicated dozens of times in the INBOX files....

$du -s 7yh39hls.Test/ImapMail/*/* |sort -rn
31532576        7yh39hls.Test/ImapMail/mail.craig-1.fr/INBOX
18129808        7yh39hls.Test/ImapMail/mail.craig.fr/nstmp-1
814064  7yh39hls.Test/ImapMail/mail.craig-2.fr/Archivage.sbd
738944  7yh39hls.Test/ImapMail/mail.craig.fr/nstmp
721152  7yh39hls.Test/ImapMail/mail.craig-1.fr/nstmp
581800  7yh39hls.Test/ImapMail/mail.craig-2.fr/INBOX.sbd
409088  7yh39hls.Test/ImapMail/mail.craig.fr/INBOX
165696  7yh39hls.Test/ImapMail/mail.craig-2.fr/Sent
161376  7yh39hls.Test/ImapMail/mail.craig-2.fr/INBOX
Flags: needinfo?(landry)
Blocks: 1023000
See Also: → 1406653
See Also: → 585429
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: