Open Bug 585429 Opened 14 years ago Updated 1 year ago

Global indexing slows because of high system I/O rate, slow I/O, or lack of multicore support (antivirus, slow disk, disk errors, etc) [was Significantly speed up global indexing (gloda)]

Categories

(MailNews Core :: Database, defect)

defect

Tracking

(Not tracked)

People

(Reporter: michel, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: perf, reproducible, Whiteboard: [patchlove][ref comment 7, comment 27])

User Story

(aceman impacted)

worth noting the blockers to bug 1023000 - https://mzl.la/2C5Wr3w

Attachments

(1 file)

User-Agent:       Opera/9.80 (Windows NT 6.0; U; de) Presto/2.6.30 Version/10.60
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; de; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2

The global indexer seems to become very, very slow when indexing a large number of messages (i.e. more than 10.000). It even becomes slower and slower the more messages it has indexed.

I tracked this problem down to sqlite transactions. I have about 36.000 messages in my folders. After indexing, my sqlite database is about 350 MB large. Indexing takes several days on my dual core computer!

The farther the indexing went on, the less messages it could index at a time and so the more sqlite transactions it had to make. For a database of several hundred megabytes, committing the database takes almost more than one second. This explains why indexing is so slow.

I found out that the global indexer in "indexer.js" has a maximum number of tokens to index at a time "_CPU_MAX_TOKENS_PER_BATCH", but it has no minimum value. Due to some glitch in the algorithm (I could not find out why) the number of tokens decreases the more messages you want to index. So, I propose to introduce a minimum number of tokens as well ("_CPU_MIN_TOKENS_PER_BATCH").

I will add a patch to this bug in a few seconds. I tested this patch on my system and indexing became about 10 times faster! I hope that it will be included in the next release because I believe that a lot of people are suffering from this problem.

Reproducible: Always

Steps to Reproduce:
1. Have more than 10.000 messages
2. Start Thunderbird and wait for it to index all messages :-)
Actual Results:  
Indexing is very very slow (several days)

Expected Results:  
Indexing is fast (several hours)
Assignee: nobody → michel
Component: General → Search
QA Contact: general → search
Attachment #463912 - Flags: review?(bugmail)
Severity: enhancement → normal
Keywords: perf
To get some idea what these values mean in practical terms, why is the
optimum range considered to be within MIN=100/MAX=200 and some other range?
Does this batch size relate to the syncing rate of index changes to disk,
or some other activity?
> and some other range?
"and not some other range?"
I think your patch doesn't work because you have defined a minimum _indexTokens, but because your have doubled the maximum.
Component: Search → Database
Product: Thunderbird → MailNews Core
QA Contact: search → database
Version: unspecified → Trunk
Comment on attachment 463912 [details] [diff] [review]
Introduces a minum token count to speed up indexing.

Making the old maximum the new minimum is going to speed things up, but we back off on indexing for good reasons.

If every SQLite transaction is taking several seconds, something may be wrong with your I/O subsystem.  A really slow hard disk (possibly impacted by fragmentation), failing hard disk, or ridiculously overzealous and inefficient virus checker are among the more likely options that come to mind.

Slow I/O could cause us to decrease the tokens we use per unit time slice.  Since our SQLite commits are time-based rather than token based, I could see how this would net out to really slow indexing.

If you could provide info on what anti-virus/anti-malware software you have active as well as general info on the hard disk (RPMs, drive utilization), that would be useful.  If you have a SMART tool that can tell you if the drive or the communications with the system are having issues, that might also be of interest.  (Although SMART is frequently confusing, especially in the ECC error stuff because manufacturers frequently pack two counters into a single integer so it looks like everything is on fire and exploding when in fact everything is fine.)
Attachment #463912 - Flags: review?(bugmail) → review-
(In reply to comment #5)
> If every SQLite transaction is taking several seconds, something may be wrong
> with your I/O subsystem.  A really slow hard disk (possibly impacted by
> fragmentation), failing hard disk, or ridiculously overzealous and inefficient
> virus checker are among the more likely options that come to mind.

I did some performance tests and you're right. The main problem is my virus scanner (Microsoft Security Essentials). Here are my results for indexing 1835 messages on my laptop (Intel Core 2 Duo T7500, 2.2 GHz, 2 GB RAM, Vista x86):

Without patch:
With antivirus:    111s
Without antivirus:  47s

With patch (min: 100 / max: 200):
With antivirus:     39s
Without antivirus:  33s

You can see that the patch increases performance significantly when the scanner is enabled. Even without the scanner the performance is slightly better.

(In reply to comment #4)
> I think your patch doesn't work because you have defined a minimum
> _indexTokens, but because your have doubled the maximum.

Fair comment. I did another test with lower values:

With patch (min: 50 / max: 100):
With antivirus:     42s
Without antivirus:  42s

This did not increase time very much but it is still faster than the original indexer.

From these tests I would conclude that a minimum number of tokens is really necessary. You can never know what antivirus software the average user has installed or how much his hard drive is fragmented or something. The minimum number of tokens leads to a more stable performance. As I said before, on my other system with 36,000 messages the number of tokens dropped down to 2 per transaction. This would be even too low if I disabled my virus checker. And apart from that, I think a minimum number of 50 tokens per transaction is a reasonable value on an average system.
(In reply to comment #6)
> I did some performance tests and you're right. The main problem is my virus
> scanner (Microsoft Security Essentials). Here are my results for indexing 1835
> messages on my laptop (Intel Core 2 Duo T7500, 2.2 GHz, 2 GB RAM, Vista x86):

That is likely to be a pretty common virus checker, but...

> Without patch:
> With antivirus:    111s
> Without antivirus:  47s

111s for 1835 messages is entirely reasonable.  That's 16.5/second.
 
I realize you're saying things actually go downhill from here, but I just want to be clear that this is considered an acceptable rate of indexing.  I also recognize that it sounds like on your system we could be a lot more aggressive about indexing and/or better about not triggering the anti-virus without interfering with your computer and Thunderbird usage.

> apart from that, I think a minimum number of 50 tokens per transaction is a
> reasonable value on an average system.

Performance on less powerful systems than yours (ex: single core) was unacceptable before the adaptive logic was added.  That the algorithm decides to drop from 16.5/sec to 2/sec is a problem, but forcing an indexing rate of 43.7 messages/second is not a practical solution.

I really appreciate your effort at looking into this.  If you are interested in looking at it more, that would be fantastic, as it will be some time before I am able to focus a lot of attention on this.  I do have time to help with your investigation if you are willing, though.

What I think would be most useful to see is the change in the indexing rate over time and the inputs that go into it.  If you look at the code, you will see we attempt to detect load by sleeping and seeing when we are actually woken up again with the attempt of detecting when there is other code on the system that is probably more important than us.  Logging:
- how long we think we slept for over time as well as the indexing rate
- how long it took for the database commit to happen when we triggered it
seems like a good first step and should hopefully illustrate the rate at which things go south and perhaps why they go south.

I should note that we definitely do not detect when people have multicore systems and we likely could change our behaviour to be more aggressive in that case.  When the code was authored we did not have jsctypes available; that might be one way to let us 'juice' our indexing rate on more powerful systems.  Having said that, I'd still really like to first fix the deficiency that is leading our algorithm to back off too much before improving the rate since the multicore detection might hide the problem on your system.
Thanks for your comments!

I completely agree that 16/sec is a reasonable value, but the situation on my system leads to a lower value over time and so the number of transaction commits increases. You say that before you implemented the adaptive algorithm, Thunderbird was unusable on single core systems. I can report that this large number of transaction commits even makes Thunderbird on my dual core system become completely unresposive (due to IO activity I guess).

I believe there is a break-even point where one transaction commit takes more time than message indexing which then makes the algorithm use a even lower value which then leads to more transaction commits... So, the snake's biting its own tail.

There should definitely be a minimum value (let's say 10/sec) to avoid this problem. Maybe the algorithm can even be adapted to find that break-even point and hence the optimal number of tokens per second. Therefore it would also have to consider the time elapsed during transaction commits.

What do you think? Can this easily be implemented?
(In reply to comment #8)
> I can report that this large
> number of transaction commits even makes Thunderbird on my dual core system
> become completely unresponsive (due to IO activity I guess).

Is this indicated in the MS-Windows UI?  
Do you see "(not responding)" in the window title?
(In reply to comment #8)
> Therefore it would also have
> to consider the time elapsed during transaction commits.
Nevermind. I just figured that transactions will only be committed every 5 seconds. So, where else can the virus scanner interfere?
(In reply to comment #9)
> Is this indicated in the MS-Windows UI?  
> Do you see "(not responding)" in the window title?
No. It's only that after I select a message it takes several seconds for it to show the body. Or after I pressed the delete button it also takes several seconds until the message disappears. After indexing is finished, everything runs fine.
Maybe the virus checker is forcing all of our process's I/O access to be serialized while it is being crazy, causing Thunderbird's main thread to block on I/O.

One possible explanation for things getting constantly slower is that the virus checker scans the global-message-db.sqlite file every time we commit a transaction, and that causes it to take more time each and every time.

Perhaps you could use Process Monitor (the sysinternals utility that replaced filemon) to see what thunderbird and the virus checker get up to?

http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
Thanks for your help. I checked process activity with Process Monitor, but the virus checker only checks "global-messages-db.sqlite" and "global-messages-db.sqlite.journal". If I exclude either one of these files from the scan process, then the indexer runs a lot faster.

Meanwhile I'm back at my machine with 36,000 messages and I made some interesting observations: as I said above, Thunderbird becomes completely unresponsive on that machine during indexing (that means, that opening a mail takes several seconds, the program reacts very very slow). I found out, that if I disable the pause in indexer.js (I commented out the for loop) then everything runs blazingly fast and even mails can be opened within milliseconds.

I realize that this loop has been added to improve performance on some systems, but on my system it slows down everything. It seems like the loop also pauses the UI thread! I believe that your idea to make the algorithm detect dual core systems is a good approach.
I assume you are talking about the loop that has "yield this.kWorkPause" in it.  It would be deeply disturbing if that somehow caused the UI thread to block; that it slows down indexing is not surprising as that is the heart of the throttling mechanism.

I wonder if the virus checker's activity is being billed to our process which triggers us to repeatedly pause even though our pause is returning promptly.
(In reply to comment #14)
> 
> I wonder if the virus checker's activity is being billed to our process which
> triggers us to repeatedly pause even though our pause is returning promptly.

Bienvenu ^ thoughts?

I believe this is more than possible. (I seem to have read some reputable example)
(In reply to comment #15)

> 
> Bienvenu ^ thoughts?

Yes, I believe that's pretty likely.
do we need to hit up the Microsoft Security Essentials folks?
Probably not; the next step would likely be for someone to run Xperf on windows with security essentials installed to figure out exactly what is going on and to establish a baseline.  This is not on my short term radar, but performance, especially with MS security essentials active, is definitely on the long term radar.  Anyone stepping up in the meantime would be greatly appreciated.
michel, are you able to do the Xperf per Andrew's thinking?

> ... the next step would likely be for someone to run Xperf on windows
> with security essentials installed to figure out exactly what is going on and
> to establish a baseline.
Sorry, but I disabled this feature since it made my Thunderbird completely unusable. I don't think this is due to the virus checker only. As I said above, there also seems to be a problem with the algorithm that adapts itself to the system's speed. It does not work very well on multi-core systems.

Michel
It seems I have hit this too. I have an artificially created test folder with 600 000 messages comprising 2,1GB. When indexing it all it gradually slows down, I am now down at 6/msgs per progressbar in Activity manager refresh.
I am on linux, with no antivirus.

I'll try the patch.
aceman, did you have any success?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Thanks for reminding me. I tried it once but can't say if it helped. I think it updated the progress in bigger chunks (more messages) but it also took longer between updates. I'll need to consolidate my test system (I am exchanging HW) and then I can test this more. I need to find out the msg/sec ratio and of it impacts responsiveness of TB in any way.
Blocks: 1023000
(In reply to :aceman from comment #21)
> It seems I have hit this too. I have an artificially created test folder
> with 600 000 messages comprising 2,1GB. When indexing it all it gradually
> slows down, I am now down at 6/msgs per progressbar in Activity manager
> refresh.
> I am on linux, with no antivirus.
> 
> I'll try the patch.

Do you still see this slowdown of 6msgs/iteration (I see 60)
And what were you able to determine relative to comment #23?
Flags: needinfo?(acelists)
Summary: Significantly speed up global indexing (gloda), patch included → Global indexing slows down over time as more messages are indexed [was Significantly speed up global indexing (gloda), patch included]
(In reply to Michel Krämer from comment #0)
> The global indexer seems to become very, very slow when indexing a large number of messages 
> (i.e. more than 10.000). It even becomes slower and slower the more messages it has indexed.
> ... I have about 36.000 messages in my folders. After indexing, my sqlite database is about 
> 350 MB large. Indexing takes several days on my dual core computer!

Michel, 
I failed to notice a 350MB index for only 36k messages seems excessive. My index is 366mb for a hundred thousand messages.  Perhaps this is a factor in your performance issues.  

Also note
1. (along the lines of my questions above) we have learned that emails that contain things like server logfiles are known to cause problems
2. Since you filed this bug, other serious performance issues have been diagnosed, like https://bugzilla.mozilla.org/show_bug.cgi?id=1116055 - affecting pop accounts and network file shares
3. External factors like Windows Search enabled in Thunderbird can cause high overhead. 

Do you have many messages that are not "normal correspondence"?  (#1 above)
Are your accounts imap or pop?  (#2)
Do you fit any other items?


(In reply to Michel Krämer from comment #20)
> Sorry, but I disabled this feature since it made my Thunderbird completely
> unusable. I don't think this is due to the virus checker only. As I said
> above, there also seems to be a problem with the algorithm that adapts
> itself to the system's speed. It does not work very well on multi-core
> systems.

Given my points above, you may be correct about AV not being the only factor. But I doubt having a multicore system contributes to the issue. And we consistently see badly behaving AV systems having serious impacts for Thunderbird.

Have you uncovered additional useful information since your comment?


Regardless of all of the above - a patch for better performance that accounts for newer multicore systems would be great. I'm not sure what that patch would look like. Maybe increase the min and max by 1.5x?  Are you currently running a patch?
Flags: needinfo?(michel)
> When the code was authored we did not have jsctypes available; that might be one way to let us 'juice' our indexing rate on more powerful systems. 

Andrew, can you elaborate?
Flags: needinfo?(bugmail)
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #26)
> > When the code was authored we did not have jsctypes available; that might be one way to let us 'juice' our indexing rate on more powerful systems. 
> 
> Andrew, can you elaborate?

I meant we could use jsctypes to call platform APIs without having to add new C++ code.

But it turns out there's already a platform way to get at this info: Services.sysinfo from Services.jsm.  There's some info at https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/NsIPropertyBag and https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Services.jsm.

But it looks like one really needs to consult the source at https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsSystemInfo.cpp to find cool strings

For example, if I do these things from the browser console, I get these results on my linux machine, and these are right answers:
- Services.sysinfo.getProperty("cpucount") => 8
- Services.sysinfo.getProperty("cpucores") => 4
Flags: needinfo?(bugmail)
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #25)
> (In reply to Michel Krämer from comment #0)
> > The global indexer seems to become very, very slow when indexing a large number of messages 
> > (i.e. more than 10.000). It even becomes slower and slower the more messages it has indexed.
> > ... I have about 36.000 messages in my folders. After indexing, my sqlite database is about 
> > 350 MB large. Indexing takes several days on my dual core computer!
> 
> Michel, 
> I failed to notice a 350MB index for only 36k messages seems excessive. My
> index is 366mb for a hundred thousand messages.  Perhaps this is a factor in
> your performance issues.  
> 
> Also note
> 1. (along the lines of my questions above) we have learned that emails that
> contain things like server logfiles are known to cause problems
> 2. Since you filed this bug, other serious performance issues have been
> diagnosed, like https://bugzilla.mozilla.org/show_bug.cgi?id=1116055 -
> affecting pop accounts and network file shares
> 3. External factors like Windows Search enabled in Thunderbird can cause
> high overhead. 
> 
> Do you have many messages that are not "normal correspondence"?  (#1 above)
> Are your accounts imap or pop?  (#2)
> Do you fit any other items?
> 
> 
> (In reply to Michel Krämer from comment #20)
> > Sorry, but I disabled this feature since it made my Thunderbird completely
> > unusable. I don't think this is due to the virus checker only. As I said
> > above, there also seems to be a problem with the algorithm that adapts
> > itself to the system's speed. It does not work very well on multi-core
> > systems.
> 
> Given my points above, you may be correct about AV not being the only
> factor. But I doubt having a multicore system contributes to the issue. And
> we consistently see badly behaving AV systems having serious impacts for
> Thunderbird.
> 
> Have you uncovered additional useful information since your comment?
> 
> 
> Regardless of all of the above - a patch for better performance that
> accounts for newer multicore systems would be great. I'm not sure what that
> patch would look like. Maybe increase the min and max by 1.5x?  Are you
> currently running a patch?

Wayne,

I turned off global indexing years ago. I have learned to live without it ever-since. The quick filter feature is even more precise so I don't need the global index anymore.

Cheers,
Michel
Flags: needinfo?(michel)
(In reply to Michel Krämer from comment #28)
> ...
> I turned off global indexing years ago. I have learned to live without it
> ever-since. The quick filter feature is even more precise so I don't need
> the global index anymore.

Fair enough. And thanks for the reply.

Can you still answer these few questions please so we can at least compare other known problem situations to yours? (I'd like to know whether your situation is a different animal than those we already know about.)

From the time period of your report...
Were there many messages that are not "normal correspondence", like server log files?  (#1 above)
How many imap accounts?  And how many pop?  (#2)
Is (or was it ever) Windows Search enabled in Thunderbird at tools | options | advanced | general?
Are/were your thunderbird profile and account directories on local disk, and in the default locations?

Thanks
Flags: needinfo?(michel)
Michel no longer uses Thunderbird
Assignee: michel → nobody
User Story: (updated)
Flags: needinfo?(michel)
Whiteboard: [patchlove]
Summary: Global indexing slows down over time as more messages are indexed [was Significantly speed up global indexing (gloda), patch included] → Global indexing slows because of high system I/O rate, slow I/O, or lack of multicore support (antivirus, slow disk, disk errors, etc) [was Significantly speed up global indexing (gloda)]
Whiteboard: [patchlove] → [patchlove][ref comment 7, comment 27]

(In reply to Wayne Mery (:wsmwk) from comment #24)

Do you still see this slowdown of 6msgs/iteration (I see 60)
And what were you able to determine relative to comment #23?

Yes, it is glacially slow (3 msgs/sec) even when indexing other folders (like 10000 msgs) while the gloda.sqlite file is only 130MB.
There is no I/O to be mentioned and TB is using 100% of 1 CPU core.

The patch only causes the work to be updated in the Activity manager in larger chunks, but the overall speed isn't visibly changed.

Flags: needinfo?(acelists)
Keywords: reproducible
See Also: → 632791, 518465
Severity: normal → S3

(In reply to :aceman from comment #31)

(In reply to Wayne Mery (:wsmwk) from comment #24)
...
The patch only causes the work to be updated in the Activity manager in larger chunks, but the overall speed isn't visibly changed.

Not great, but it would still be a win if overall CPU is reduced. Do we know if that happens?

Can you create a performance profile? https://support.mozilla.org/en-US/kb/profiling-thunderbird-performance

Flags: needinfo?(acelists)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: