Open Bug 226730 Opened 21 years ago Updated 2 years ago

Slow threading when all messages have the same subject, with mail.strict_threading

Categories

(MailNews Core :: Database, defect)

x86
All
defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: aleksander.adamowski, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: perf)

Attachments

(1 file)

This is a follow-up to bug 172786.

Steps to reproduce:
1) Fill a folder with numerous messages with the same subject
2) Turn on threading for this folder

Mozilla starts to consume  all available CPU.

Tracing its process on Linux shows this:

kill(5529, SIGRTMIN)                    = 0
gettimeofday({1069773704, 813034}, NULL) = 0
gettimeofday({1069773704, 814891}, NULL) = 0
gettimeofday({1069773704, 817263}, NULL) = 0
gettimeofday({1069773704, 819174}, NULL) = 0
gettimeofday({1069773704, 820994}, NULL) = 0
gettimeofday({1069773704, 822855}, NULL) = 0
gettimeofday({1069773704, 824665}, NULL) = 0
gettimeofday({1069773704, 826472}, NULL) = 0
gettimeofday({1069773704, 828276}, NULL) = 0
gettimeofday({1069773704, 830056}, NULL) = 0
gettimeofday({1069773704, 831863}, NULL) = 0
gettimeofday({1069773704, 833675}, NULL) = 0
gettimeofday({1069773704, 835475}, NULL) = 0
gettimeofday({1069773704, 837280}, NULL) = 0
gettimeofday({1069773704, 839084}, NULL) = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
write(5, "\372", 1)                     = 1
ioctl(3, FIONREAD, [32])                = 0
read(3, "~\2\t\21ZXA\1\1\20\0\0\20\0\0\0\0\0\0\20\20\20\20\20\0"..., 32) = 32
poll([{fd=3, events=POLLIN}], 1, 0)     = 0
write(3, "\230\4\2\0\0\1\0\1", 8)       = 8
read(3, "\1\1\n\21\0\0\0\0\20\0\0\20\0\0\0\0\0\0\20\0\0\0\0\0\0"..., 32) = 32
ioctl(3, FIONREAD, [0])                 = 0
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN, revents=POLLIN}], 2, 0) = 1
gettimeofday({1069773710, 224690}, NULL) = 0
gettimeofday({1069773710, 224729}, NULL) = 0
gettimeofday({1069773710, 224861}, NULL) = 0
kill(5529, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
kill(5530, SIGRTMIN)                    = 0
This problem occurs on a folder where I have a large number of bounce messages
generated by a mail server that I admin.

Currently the size of that folder is 162 MB, but the problem was significant
even when the folder had ~300 messages (that would amount to a couple of MBs
only) on a relatively fast hardware (Pentium III, 512MB RAM, UDMA 66 hard drive).

See my comments in bug 172786 for details.
I don't know whether this will be useful (the binaries are stripped), but here's
the result of next-stepping and back-tracing the running mozilla-bin process
(the one consuming CPU). I attached to the process by "gdb - PID":

#0  0x4285f900 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
#1  0x40a8bf96 in PL_DHashTableRawRemove () from /usr/lib/mozilla-1.4/libxpcom.so
(gdb) next
Single stepping until exit from function NSGetModule,
which has no line number information.
0x40a8bf96 in PL_DHashTableRawRemove () from /usr/lib/mozilla-1.4/libxpcom.so
(gdb) bt
#0  0x40a8bf96 in PL_DHashTableRawRemove () from /usr/lib/mozilla-1.4/libxpcom.so
#1  0x09cebd28 in ?? ()
(gdb) next
Single stepping until exit from function PL_DHashTableRawRemove,
which has no line number information.
0x40a8bf0a in PL_DHashTableOperate () from /usr/lib/mozilla-1.4/libxpcom.so
(gdb) bt
#0  0x40a8bf0a in PL_DHashTableOperate () from /usr/lib/mozilla-1.4/libxpcom.so
#1  0x09cebd28 in ?? ()
(gdb) next
Single stepping until exit from function PL_DHashTableOperate,
which has no line number information.
0x4285faf5 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
(gdb) bt
#0  0x4285faf5 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
#1  0x09cebd28 in ?? ()
(gdb) next
Single stepping until exit from function NSGetModule,
which has no line number information.
0x40a8c063 in PL_DHashTableEnumerate () from /usr/lib/mozilla-1.4/libxpcom.so
(gdb) bt
#0  0x40a8c063 in PL_DHashTableEnumerate () from /usr/lib/mozilla-1.4/libxpcom.so
#1  0x09cbb498 in ?? ()
(gdb) next
Single stepping until exit from function PL_DHashTableEnumerate,
which has no line number information.
0x4285f550 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
(gdb) bt
#0  0x4285f550 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
#1  0x40a8c063 in PL_DHashTableEnumerate () from /usr/lib/mozilla-1.4/libxpcom.so
(gdb) next
Single stepping until exit from function NSGetModule,
which has no line number information.
0x424d9de0 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmork.so
(gdb) bt
#0  0x424d9de0 in NSGetModule () from /usr/lib/mozilla-1.4/components/libmork.so
#1  0x4286e64a in NSGetModule () from /usr/lib/mozilla-1.4/components/libmsgdb.so
thanks, I know exactly what's going on, but it's hard to fix - turning off
threading by subject will speed it up for you.

user_pref("mail.thread_without_re", false);
Assignee: sspitzer → bienvenu
Component: Mail Window Front End → Mail Database
I've set "mail.thread_without_re" to "false" in about:config, but it didn't help
much...

It's Mozilla Mozilla 1.4
More specifically, Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030630.

The change of pref was effective:
user_pref("mail.thread_without_re", false);

But still entering the 162 MB folder caused Mozilla to use all CPU for a few
hours (Athlon 1800 MHz) and I had to leave it obernight to finish that stuff.

Then today in the morning I've selected about 4/5 of messages and pressed DEL.

Mozilla started consuming all CPU again. After about 1 hour I've made a strace,
then closed Mozilla safely.

Strace looked like this:

gettimeofday({1069929043, 591423}, NULL) = 0
gettimeofday({1069929043, 593281}, NULL) = 0
lseek(93, 0, SEEK_CUR)                  = 87017914
lseek(93, 0, SEEK_CUR)                  = 87017914
lseek(93, 0, SEEK_END)                  = 87017914
lseek(93, 87017914, SEEK_SET)           = 87017914
lseek(93, 87017914, SEEK_SET)           = 87017914
gettimeofday({1069929043, 604744}, NULL) = 0
lseek(93, 0, SEEK_CUR)                  = 87017914
lseek(93, 0, SEEK_CUR)                  = 87017914
lseek(93, 0, SEEK_END)                  = 87017914
lseek(93, 87017914, SEEK_SET)           = 87017914
lseek(93, 87017914, SEEK_SET)           = 87017914
write(93, "From - Mon Jul 14 10:11:13 2003\n"..., 2641) = 2641
lseek(93, 0, SEEK_CUR)                  = 87020555
lseek(93, 0, SEEK_CUR)                  = 87020555
lseek(93, 0, SEEK_END)                  = 87020555
lseek(93, 87020555, SEEK_SET)           = 87020555
lseek(93, 87020555, SEEK_SET)           = 87020555
lseek(93, 0, SEEK_CUR)                  = 87020555
kill(16905, SIGRTMIN)                   = 0
write(10, "\372", 1)                    = -1 EAGAIN (Resource temporarily
unavailable)
ioctl(3, FIONREAD, [32])                = 0
read(3, "\34\366\256\353\20\0\300\1\23\1\0\0\227\210\247\5\0\0\0"..., 32) = 32
poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, 0) = 1
ioctl(3, FIONREAD, [32])                = 0
read(3, "\34\0\256\353\20\0\300\1\23\1\0\0\311\210\247\5\1\0\0\0"..., 32) = 32
poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, 0) = 1
ioctl(3, FIONREAD, [32])                = 0
read(3, "\34\366\256\353\20\0\300\1\23\1\0\0\312\210\247\5\0\0\0"..., 32) = 32
poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, 0) = 1
ioctl(3, FIONREAD, [32])                = 0
read(3, "\34\0\256\353\20\0\300\1\23\1\0\0\373\210\247\5\1\0\0\0"..., 32) = 32
poll([{fd=3, events=POLLIN}], 1, 0)     = 0
ioctl(3, FIONREAD, [0])                 = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN, revents=POLLIN}], 2, 0) = 1
gettimeofday({1069929043, 931590}, NULL) = 0
gettimeofday({1069929043, 933517}, NULL) = 0

Sorry for previous comment, I tested with an olde release...

Fix to bug 172786 changed the behaviour significantly.

Here's what it looks like with the latest nightly build from Nov 25:

Browsing the folder runs acceptably (I do have threading without "Re" set to
false), and when I delete half of messages, Mozilla initially behaves well:

It doesn't stop responding; It shows "Moving messages to Trash" message in the
status bar;

But when the move operation gets finished, it enters an unresponsive state
again. Consumes all CPU, and strace with "-r" (relative timestamps) shows this:


     0.001829 write(42, "X-Mozilla-Status: 0008", 22) = 22
     0.001876 lseek(42, 0, SEEK_CUR)    = 1583644
     0.001815 lseek(42, 0, SEEK_CUR)    = 1583644
     0.001816 lseek(42, 0, SEEK_END)    = 169382599
     0.001806 lseek(42, 1583644, SEEK_SET) = 1583644
     0.001813 lseek(42, 1583623, SEEK_SET) = 1583623
     0.001825 lseek(42, 0, SEEK_CUR)    = 1583623
     0.001777 read(42, "-Mozilla-Status: 0008\nX-Mozilla-"..., 49) = 49
     0.001945 lseek(42, 0, SEEK_CUR)    = 1583672
     0.001818 lseek(42, 0, SEEK_CUR)    = 1583672
     0.001793 lseek(42, 0, SEEK_END)    = 169382599
     0.001808 lseek(42, 1583672, SEEK_SET) = 1583672
     0.001812 lseek(42, 1583645, SEEK_SET) = 1583645
     1.228290 lseek(42, 0, SEEK_CUR)    = 1583645
     0.002306 lseek(42, 0, SEEK_CUR)    = 1583645
     0.001817 lseek(42, 0, SEEK_END)    = 169382599
     0.001817 lseek(42, 1583645, SEEK_SET) = 1583645
     0.001814 lseek(42, 1585981, SEEK_SET) = 1585981
     0.001796 lseek(42, 0, SEEK_CUR)    = 1585981
     0.001809 read(42, "X-Mozilla-Status: 0000\nX-Mozilla"..., 49) = 49
     0.001945 lseek(42, 0, SEEK_CUR)    = 1586030
     0.002010 lseek(42, 0, SEEK_CUR)    = 1586030
     0.001801 lseek(42, 0, SEEK_END)    = 169382599
     0.001809 lseek(42, 1586030, SEEK_SET) = 1586030
     0.001818 lseek(42, 1586004, SEEK_SET) = 1586004
     0.001815 lseek(42, 0, SEEK_CUR)    = 1586004
     0.002219 lseek(42, 0, SEEK_CUR)    = 1586004
     0.002014 lseek(42, 0, SEEK_END)    = 169382599
     0.001812 lseek(42, 1586004, SEEK_SET) = 1586004
     0.001815 lseek(42, 1585981, SEEK_SET) = 1585981
     0.001841 lseek(42, 0, SEEK_CUR)    = 1585981
     0.002065 write(42, "X-Mozilla-Status: 0008", 22) = 22
     0.002042 lseek(42, 0, SEEK_CUR)    = 1586003
     0.001825 lseek(42, 0, SEEK_CUR)    = 1586003


As visible, occasional lseek()'s take over 1 second to complete.
when you delete messages from a local folder, we have to go mark the messages
deleted in the local mail folder, adjusting the x-mozilla-status line of each
message to indicate that it has been deleted. This involves seeking to each
message and rewriting the x-mozilla-status line, as you can tell from your log.
It's not really mozilla's fault if an individual seek take a second, but I'm not
sure that it's really the seek that's taking a second. I suspect what's
happening is that for every 300 messages deleted, we're committing the .msf file
and that's taking close to a second. I can't tell how many messages you're
deleting, but if it's 1000, for example, you'll have 3 db commits. We were
trying to keep the db in sync, but in this particular example, it's the wrong
thing to do, and I'll try to attach a patch to fix it.
Status: NEW → ASSIGNED
The long seeks occured more frequently. Something around every 5-th - 10th
write, a 1 second pause.

But this might really be the fault of operating system, since when the folder
size dropped after successful delete (I've let it finish its work...) now there
are no pauses. It has 120 MB and about 70000 messages and strace doesn't show
delays when I further delete large amount of messages. Consumes all CPU, but
finishes in a justifiable amount of time.

Maybe I should mark this bug as depending on bug 58508...
On the other hand, this bug is really about slow threading by same subject, not
deletion.

So ignore the stuff about slow deletion. 
I've just discovered one more performance problem with the big folder, this time
related to marking messages as read. Filed a new bug 227011.
May be important:
I still have this pref set to false:

user_pref("mail.thread_without_re", false);

And all messages in the folder have tha same subject, but it doesn't start with
"Re:".

So Mozilla doesn't thread them (or at least isn't supposed to).
The fix for bug 227011 improved responsiveness an performance of operations on
large quantities of messages considerably.

However, this issue (threading large quantities of messages) is still present.

Testing with build 2003120209, I've discovered that when threading without "Re"
is off, building summary for a folder works fine, progressing at a constant rate.

When threading is on, strace shows that a certain system call gets longer and
longer to execute, its time of execution grows together with the number of
messages for which summary waas already built.

The syscall looks like this:

write(5, "\372", 1)       = 1

I've checked file descriptor number 5 of that mozilla process with "lsof -d 5 |
grep moz" - it is some sort of a FIFO pipe.

When building of summary file is at the beginning, this syscall takes about 0.1
second to complete.

When summary for 2000 messages is completed, this syscall takes 0.7 second to
complete.

When summary for 4000 messages is completed, this syscall takes 1.2 second to
complete.

When summary for 6000 messages is completed, this syscall takes 1.7 second to
complete.

When summary for 8000 messages is completed, this syscall takes 2.1 second to
complete.

When summary for 10000 messages is completed, this syscall takes 2.7 second to
complete.

This is exactly linear growth of execution time of that syscall proportional to
number of messages summarized so far.
This is a strace of Mozilla process (latest build 2003120209 with fix for bug
227011) doing the rebuild of summary file for big folder.

Strace was started before entering that folder and interrupted when rebuilding
started to progress very slowly.

It's compressed with bzip2.
Comment on attachment 136719 [details]
Strace of Mozilla process

Threading was _on_ during this strace.
Blocks: 236849
Bug 159660 and bug 90452 might be related.
Product: MailNews → Core
*** Bug 262843 has been marked as a duplicate of this bug. ***
> Do you still see this problem given that bug 90452 is fixed ?

Aleksander on 11/7/2006: ... threading a folder containing 4300 messages with the same subject took about 30 seconds on a 2.80 GHz P4, so it's still quite slow. 



QA Contact: esther → database
Reporter is gone. Perhaps fixed by bug 90452 2005-07-27
=> incomplete
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → INCOMPLETE
I'm not gone.

The problem is still present on Thunderbird 2.0.0.14 (20080505).

On a folder with 5941 messages, 5829 of which have the same subject ("NOTICE: mail delivery status."), when I click the thread button, Thunderbird starts consuming all available CPU for about 2 minutes. The CPU is a 1.73GHz Intel Pentium M. There are 2GB of RAM available. There's no excessive RAM, disk or network usage - only CPU.

During that time, Thunderbird's window doesn't repaint, and the application appears frozen - a typical user would likely kill it before it finishes its work.

When it finishes the folder view looks normal, and the 5829-long thread is rendered correctly.

Collapsing the thread works instantly, but expanding it back repeats the process of 2-minute freeze.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
that problem is about expanding a large thread, and I don't think it has to do with the subjects being the same at all - I think it's simply a problem expanding huge threads.
Product: Core → MailNews Core
(In reply to comment #19)
> that problem is about expanding a large thread, and I don't think it has to do
> with the subjects being the same at all - I think it's simply a problem
> expanding huge threads.

Is there a different bug for "expanding large thread" performance?  (there may be, but I'm not finding one)
Keywords: perf
(In reply to comment #1)
> This problem occurs on a folder where I have a large number of bounce messages
> generated by a mail server that I admin.
> 
> Currently the size of that folder is 162 MB, but the problem was significant
> even when the folder had ~300 messages (that would amount to a couple of MBs
> only) on a relatively fast hardware (Pentium III, 512MB RAM, UDMA 66 hard drive).
> 
> See my comments in bug 172786 for details.

not resolved?
Presently, I have no idea. I've reported the bug over 6 years ago, lots have changed. I'm unable to test it currently, since I have a different job and don't use IMAP mail anymore. If someone's still interested in fixing it (if not done already), the method to reproduce is to flood the mailbox with lots of identical small messages, then some amount of other messages, then trying to thread the message list.

It there's noone interested, I have nothing against closing it with WORKSFORME (an additional resolution code: "OLD" might be handy in Bugzilla...).
Problem cited in bug summary can be observed with local mail folder too.
See bug 452221 comment #0 and description on attached test mails.
However, it may depends on user's preference choice. As default prefs.js setting around threading was changed by Tb 3(mail.strict_threading,  mail.thread_without_re, ...), prefs.js setting change may be needed to observe phenomenon with test mails attached to bug 452221.
Assignee: dbienvenu → nobody
Summary: Slow threading when all messages have the same subject → Slow threading when all messages have the same subject, with mail.strict_threading
this wouldn't be linux only
OS: Linux → All
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: