Closed Bug 562593 Opened 14 years ago Closed 14 years ago

Thunderbird very slow to index, using a whole core of CPU.

Categories

(MailNews Core :: Database, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: marty, Unassigned)

Details

(Keywords: perf)

Attachments

(1 file, 5 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.9.2.3) Gecko/20100401 SUSE/3.6.3-1.2 Firefox/3.6.3
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.9.1.9) Gecko/20100317 SUSE/3.0.4-2.3 Thunderbird/3.0.4

After a weekend or when I've been away Thunderbird seems to choke on downloading, filtering and indexing the backlog of emails.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7382 marty     20   0  803m 323m  25m R   93  8.5  19:30.59 thunderbird-bin

Most of the time it will complete if left long enough. Sometimes the UI hangs completely even though it still appears to be indexing in the background.

I'll attach an strace of the process.

Reproducible: Always

Steps to Reproduce:
1. Open Thunderbird when there is a large backlog of mail to download, index and filter.
2. Thunderbird will report "Indexing # of $ messages in Inbox" and make extremely slow progress.
3. Indexing will eventually complete if Thunderbird does not crash in the meantime.
Actual Results:  
Indexing will eventually complete if Thunderbird does not crash in the meantime.

Expected Results:  
Complete indexing and filtering in a reasonable time frame.
Component: General → Database
Keywords: perf
Product: Thunderbird → MailNews Core
QA Contact: general → database
Andrew do you need something else besides a strace log ?
Please try a 3.1 series nightly / beta 2 build; can't do anything for 3.0.x.
This is a work machine so I am trying to stick with OpenSUSE repos.

The latest build in mozilla:beta is 3.1b1 which I am now using. If the problem persists in b1 I'll install and try b2.
That's fine, but the fixes I am thinking of are in beta 2, so this might need to get marked WFM until you are on current code.
Hi Andrew

I upgraded to 3.1b2 late last week.

Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.9.2.5pre) Gecko/20100430 SUSE/3.1b2-7.1 Lightning/1.0pre Thunderbird/3.1b2

I had about 530 new messages to filter and index, which took three and a half minutes chewing up an entire core.

I have an strace which I'll attach. Let me know anything else you need.
(In reply to comment #6)
> Hi Andrew
> 
> I upgraded to 3.1b2 late last week.
> 
> Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.9.2.5pre) Gecko/20100430
> SUSE/3.1b2-7.1 Lightning/1.0pre Thunderbird/3.1b2
> 
> I had about 530 new messages to filter and index, which took three and a half
> minutes chewing up an entire core.

This is certainly within an order of magnitude of what I would consider respectable performance without strace attached.  I am assuming this is a ptrace-based strace, which is on the pricey end of the performance spectrum, especially when it's catching futexes.  (Systemtap's strace-alike would be cheaper.)

> I have an strace which I'll attach. Let me know anything else you need.

Honestly, strace is not particularly useful for this class of problem since there is so little that makes it down to the system call level.  Using something like 'perf' to get call-stacks assuming debug symbols are available to translate the call-stacks would be most useful, but...

The most useful thing for me to know is whether Thunderbird was generally responsive during this processing.  The indexer's goal is to use as much CPU as possible without interfering with the UI or other apps running on the system and never to lock up the UI.  If you have multiple cores that are not fully taxed, it will probably never decide to back off due to contention with other applications.
The UI remains responsive. I can interact with it and even switch folders (although the results of the filtering are not visible till right at the end).

This is a dual core machine, the other core remains mostly idle. All other applications remain responsive and usable.
Ok, ran perf with a weekend's worth of email backlog. 470 messages to download, filter and index.

I ran 'time perf record -f -- thunderbird' and closed thunderbird as soon as the status bar cleared of any filtering or indexing messages. This coincided with it no longer using a full core.
Attachment #442340 - Attachment is obsolete: true
Attachment #444361 - Attachment is obsolete: true
Dang, I just realized you used a Suse build after figuring out how to actually get the symbols off our symbol server for our own linux builds.  I can confidently tell you that if you had been using an official 32-bit linux build that your problem would have been the spell checker*.  (Which would actually make sense, because that sucker is inefficient.)

You'll need to one of the following for me to follow this up more:

a) Grab the debug symbols for your build and figure out what 0x0000000079f667 corresponds to in thunderbird-bin and tell me.

b) Install the debug symbols package for your build of thunderbird and re-run.  perf should pick up on the debug symbols and be able to translate the symbols when it does its output.

c) Use an official mozilla build for which our symbol server has symbols.

Unfortunately I have no idea how you install debug symbols on Suse.  On fedora there's a 'debuginfo-install' helper you can use which is straightforward and easy (and gdb will even tell you about it!); on Ubuntu you have to add a special debug package repo and look for a package suffixed with -dbginfo or something like that.

Thanks for your persistence in trying to provide useful performance feedback!

* Actually, that address was in the middle of an opcode, so it made no sense, but hey.
Heh, sorry for wasting your time like that.

Have installed the -debuginfo package and re-run the report.
Attachment #445652 - Attachment is obsolete: true
What you attached is missing the top two samples from your version that didn't have symbols (39.72% and 6.26%), any idea where those went?
No idea. It's using the same perf.data file. Perhaps I accidentally ran record again?!?

BTW, any guidance on what's the best options to use against 'perf record' and 'perf report'?

I used 'time perf record -f -- thunderbird' and 'perf report --sort comm,dso,symbol | head -30'
Ok, manully looked up.

Reading symbols from /usr/lib/debug/usr/lib64/thunderbird/thunderbird-bin.debug...done.
(gdb) info symbol 0x0000000079f667
nsDocumentEncoder::EncodeToString(nsAString_internal&) + 397 in section .text


Reading symbols from /usr/lib/debug/usr/lib64/thunderbird/libxpcom_core.so.debug...done.
(gdb) info symbol 0x0000000006a607
nsManifestLineReader::NextLine() + 31 in section .text
The number of samples used in the report is the same though, which I would expect to be a unique value with unless it's just the number of records that fit in the buffer's default size.

It's possible that when it doesn't have debug symbols it just attributes things to the preceding public symbol and so lots of different samples got binned into one public symbol that precedes a huge block of code with no public symbols.  Then when it has the detailed symbols, it is able to properly resolve things.  This seems pretty reasonable, especially as there is no way that 39.72% was all on one single instruction...

I would drop the "head -30" for this report, it probably turns out things are smeared all over the place.

For future perf runs, you probably want to add "-g" to both options.  This will gather a call-stack (and graph it) as opposed to just the single currently active instruction.  The downside is it is not super cheap and if Suse builds with -fomit-frame-pointer, it might not work.
I'm deeply confused by what your manual lookups turned up, but I am impressed by your technical prowess!
I'll attach the full 'perf report' output but unless it says something interesting leave the ball in my court and I'll get you another report with the -g flag.

Oh, and make no mistake I am way out of my depth as a sysadmin / network op! Google is my friend. :-)
Attachment #445661 - Attachment is obsolete: true
Ok, this mornings backlog. 282 messages to download, filter and index.

real	2m15.024s
user	0m0.018s
sys	0m0.337s

Funnily enough the status bar reported moving over 480 messages from Inbox to a sub-folder which I find hard to believe seeing as there are only 181 messages in there now.

I've noticed that the graph output in this report has a lot of unresolved symbols. I tried installing the debuginfo for the kernel and glibc but it doesn't appear to have fixed the issue.
Attachment #445672 - Attachment is obsolete: true
(In reply to comment #22)
> Funnily enough the status bar reported moving over 480 messages from Inbox to
> a sub-folder which I find hard to believe seeing as there are only 181 messages in there now.

If you enable Glada's debugging function, you can see Gloda's activity by log for "when/which mail is indexed" etc.
> https://wiki.mozilla.org/Thunderbird:Debugging_Gloda
Log example attached to bug 559022 : attachment id=442634
If you install add-on of GlodaQuilla, you can add ondisk, gloda id, gloda dirty columns to thread pane.
> https://addons.mozilla.org/en-US/thunderbird/addon/9873/
Just got my attention drawn back to this by a cc event...

I'm going to mark this WFM as the original report had two issues reported:
a) UI locks up while indexing
b) uses one CPU core a lot

'a' is a serious problem, but likely was caused by 3.0 running with non-truly-async mozStorage and is no longer apparently an issue.  'b' is not really considered a problem unless the indexing is super ridiculously slow or aggressively interferes with other processes on the machine, and based on the numbers provided, that doesn't really seem to be the case.  (A lower bound of 3.76 messages indexed per second with perf active is not horrible.)  I think I got slightly off track when perf entered the scene; based on the most recent log, I don't think further use of perf would gain any interesting new facts.  Something like my systemtap work would be required for better prioritization but the payoff for trying to use it is not there.

Thank you for all your effort with perf though!  We have not given up on improving the performance, but we are now much closer to being able to run indexing JITting the JavaScript in releases that are not all that far off, which changes the efficiency hot spots considerably.  (Although there is still the issue of measuring them effectively...)

If 'a' is still happening, please do reopen/request reopening.
Status: UNCONFIRMED → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: