Open Bug 1553439 Opened 5 years ago Updated 2 years ago

Gloda indexing options/settings to select specific fields (ex. exclude message body)

Categories

(MailNews Core :: Database, enhancement)

enhancement

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: cra3yk, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

Steps to reproduce:

Request for options (gui) or config settings to change what fields GLODA is indexing.

Now GLODA have indexed all (headers like from/to, subject etc. and message's body), but for some users (with lot amount of mails and their needs) I need to limit GLODA to index from, to and subject fields only, to speed-up indexing process and to make global-messages-db.sqlite smaller)

Gloda already limits itself to the first 20k of each message, and excludes quoted material. I doubt we would allow user control in the areas you suggest. But it sounds like simple filtering would do what you want, in which case you could disable gloda completely.
Also, what you have written suggests you perceive a performance issue. Is that correct? And if so, please describe this in detail, plus how big is global-messages-db.sqlite ?

Type: defect → enhancement
Flags: needinfo?(cra3yk)

Hello

About 350 megabytes. But simple filtering works on one folder-per-time only (when You select folder and enter query in form field), but users want to find something in several folders quickly (based on subject/from/to only), so ability to select, what GLODA has indexing will be very useful feature.

PS: with such 350 megabytes database, when someone move mail message from one folder to another then GLODA needs big amount of time to reflect this change (it depends - 10 minutes to several hours)

Flags: needinfo?(cra3yk)

(In reply to cra3y from comment #2)

Hello

About 350 megabytes. But simple filtering works on one folder-per-time only (when You select folder and enter query in form field), but users want to find something in several folders quickly (based on subject/from/to only), so ability to select, what GLODA has indexing will be very useful feature.

350MB is not overly large. Mine for example is 1 GB, and there are users who are bigger.

How many folders and accounts do you want to search?
Are all of them gmail accounts?

PS: with such 350 megabytes database, when someone move mail message from one folder to another then GLODA needs big amount of time to reflect this change (it depends - 10 minutes to several hours)

I very much doubt the slowness is caused by gloda. I can comment better after you describe your accounts, your hardware, and what antivirus and malware software you have installed.

Flags: needinfo?(cra3yk)
Keywords: perf

Dell vosto 3580, 8GB ram, i5, 256GB SSD - not slow at all.

Their mailboxes have lot of small emails (confirmations, orders etc).

Account is IMAP based (dovecot in Ubuntu) with berkeley background on thunderbird side.

There is about 100-120 folders per account (tree like - not flat)

Flags: needinfo?(cra3yk)

Dell Vostro

It is not about to slow finding/query.
When mails are indexed - the search works instant.

It is about how quick GLODA will index new emails or reflect changes in emails/folder structure.
With 350MB or more database it takes minutes to hours to reflect changes
for example: when User move email to different folder - GLODA search points to older location and when user "click" on this email (in search query) - it points to older location and shows nothing (blank)

It is about, how quick GLODA database is build and how quick changes in email structure are reflected in global-messages-db.sqlite

We (users) have no control, how often GLODA crawler is doing his job, how many threads/instances of GLODA's crawler are running and how to clean this database (periodicaly) from non-existent emails (without deleting global-messages-db.sqlite), when user, for example, delete the whole old 2010-2014 folders from Archive (I know, how to do in this situation - delete global-messages-db.sqlite and lets TB and GLODA to rebuild, but casual user don't)

Thanks for continuing to provide details. With that information, I'll continue along the ideas started in comment 3...

A. Your search requirement is equivalent to filtering a virtual folder that walks ALL folders. I suggest you create a virtual folder (aka Saved Search) - in properties pick "Choose", and in the Select Folders dialog enable all folders. To enable all folders - select the first one, hold the shift key, page down to the last folder, hit spacebar, (all folders should now be checked) then OK. Now you have a "folder" on which you can filter on subject, to/from, etc

B. As for performance, (and speaking from experience) a fast computer doesn't make one immune from performance issues. And again you are making assumptions to blame gloda without technical background, although the symptoms would seem to point in this direction. Two things to check

B1. Check Tools > activity manager - when a message is moved, what exactly what you see there with respect to folders or indexing?

B2. A quick test...

Start Windows' safe mode with networking enabled - win10 https://support.microsoft.com/en-us/help/12376/windows-10-start-your-pc-in-safe-mode
Still In Windows safe mode, start thunderbird in safe mode - https://support.mozilla.org/kb/safe-mode-thunderbird
Does problem go away?

  • If no, then cause is either: bug in Thunderbird, something (eg a setting, file or folder) in your Thunderbird profile, your mail provider. Please post into topic the contents of Help | Troubleshooting | copy text to clipboard

  • If yes, then (still in Windows safe mode) ... start Thunderbird normally
    -- If problem is still gone, then cause is a program loaded during windows startup. Possibilities include: antivirus SW, virus/malware, background downloads such as program updates
    -- If problem is came back, then cause is likely a Thunderbird addon - eliminate them by disabling each one at a time in Tools | addons | extensions and restarting

  • If results are unclear ... possibilities include temporary conditions such as contention from other running programs, downloads related to windows update, ...

And only for my curiosity, laptop or desktop?

Ok. Problem is not on one computer, but on 20-30 ones - few laptops, most desktops (in 3 departments than mainly do their job with customers and had 300MB+ indexing file + lots of folders with emails - about 50-60 GB of them).

I did quick tests on three of them using Your suggestions (antivirus uninstalled completely, windows defender disabled, safe mode etc.) - nothing helps at all. These TB had only extension installed by default - Lightning

ad B1 - only info that message has been moved to destination folder - nothing more.

ad B2 - no difference at all

PS: most of these computers has these running programs, with TB at once: SAP GUI (7.40 or 7.50), MS Excel (2010/13/16) and nothing more :-)

PPS: It isn't Windows depended, because few of these 30 hosts are W7, but most W10 (different builds) and indexing problem is similar.

PPPS: On one of these host I installed Outlook 2016 and configured the same account (as for TB) and none of indexing problems happened (but I hate Outlook for some different reason so this is not solution for me)

here is the copy/paste from troubleshoting
https://ufile.io/l1znwrsa

This problem is virtually non existed on up to 7-10GB mailboxes (with 20-50 megabytes of indexing global-messages-db.sqlite file). I had over 200 hosts with similar size mail accounts and all works well.

I want to add this info:
When I delete global-messages-db.sqlite file and when this file is completely rebuilt - this problem is non existed for some 1-3 weeks then it appeared again (randomly) and I have to delete global-messages-db.sqlite file again.

Thanks for all the information. Just to be clear, you used safe mode exactly as described? Because just disabling antivirus individually is not a sufficient test - but also not necessary because safe mode does that for you.

So this is quite interesting to have a problem involving the gloda file. And strange. But not unheard of. Sorry, a lot more questions ...

Both trash and junk folders are empty?
Thunderbird profile is in the default location, in %appdata% ?
No server side filters, and no setting of non-inbox Thunderbird folders to check for new mail (either on a per folder basis or with mail.check_all_imap_folders_for_new)?
What is an example size of the target folder to which you are moving messages - both number of messages and size?
When moving messages - are you using the Archive function?
Are messages being moved to a local folder on the PC, or to an imap folder?
How much CPU and memory is being used by thunderbird.exe process during the message move delay?
How much CPU and memory is being used by Thunderbird when it has been up for a while but relatively idle?
What do you use on the PC for backup?

Account is IMAP based (dovecot in Ubuntu) with berkeley background on thunderbird side.

Change both hidden preferences using config editor (not editing prefs.js directly) per http://kb.mozillazine.org/Modify_Thunderbird_settings

  • mail.db.max_open to 100
  • mail.db.idle_limit to 300000 (5 zeros)

If that doesn't help, can you email me a sample message, with all the mail headers?

Flags: needinfo?(cra3yk)

Trash folder empty

No Junk mails (because we don't use this feat.)

Server only filter only lives in postfix (to kill Viruses) - no filters in Dovecot (IMAP daemon)

No using mail.check_all_imap_folders_for_new

example size - 5497 messages - 6GB

moving using drag and drop

all file/folders operations in IMAP folders (backed locally by TB)

message moves instantly (it depends on their numbers by) and they're appearing instantly in new folders (activity managers shows that TB is downloading body messages from new folder) - during this process TB is using almost one CPU core

when idle, TB is process sometimes goes to full 1 core utilization (in 4 core CPU i will shows 25%) for 2-5 minutes

no backup software - all is server based (documents, mails etc) - no need to backup workstation (easy replaceable in case of disaster)

PS: I experimented months ago with these parameters (mail.db.max_open and mail.db.idle_limit), in this particular workstation it helps a bit (this GLODA mishap happens rarer than was with default values), but still exist.
Most of 99% of the rest workstations have default settings (I'm only experimenting in 2-3 ones)

Yes. I will send.

Flags: needinfo?(cra3yk)

You know - I can live with this inconvenience (the way, how indexing works in TB), but I will be nice to find, why it behaves so weirdly (from times to times) and fix it (if it is possible)

It is not about, how quick TB moves folders and messages from one folder to another (it does quickly), but how quick global indexing will reflect changes and IF it will reflect changes (sometimes not and points into empty message).

The bigger TB mailbox folder is (and messages count), the often it happens.

Improvements in bug 1023000 might help.
I still think some issue on your machines making the search process slow. We have examples like Bug 1566958 - file attribute of global-messages-db.sqlite should prevent Windows indexation

Depends on: 1023000
Summary: Gloda indexing options/settings → Gloda indexing options/settings to select specific fields (ex. exclude message body)

Not Windows indexation, because we disabled this by group policy (because no one was using it)

Severity: normal → S3
Duplicate of this bug: 1763729

I seriously doubt this would be implemented, because indexing performance can be improved without creating user settings.
https://mzl.la/3Y0nAz1 list some of them, and also some bugs.

Do you still experience your performance issues?
If you do, then a performance profile would be helpful https://support.mozilla.org/en-US/kb/profiling-thunderbird-performance

Flags: needinfo?(cra3yk)

Bug 585429 comment 5 points out some possible causes of slow indexing.
And the reporter of that bug later wrote that Microsoft Security Essentials was a big cause of their problems.

(In reply to Wayne Mery (:wsmwk) from bug 1763729 comment #4 (duplicate))

Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2

The machine is a Lenovo Thinkpad T500 with SSD and 8 GB RAM.

This in combination with a non-SSD disk and 14gb profile is indeed going to be slow for a long time, although 5 days seems excessive.
Does 5 days include downloading the messages? Or just rebuilding from deleting global-messages-db.sqlite?

It does not include downloading. With 5 days I mean 5 x 24 h, no low power or sleep mode. And the bottleneck is not the disk IO, because the HDD LED only flashes rarely. It's only a CPU issue, it consumes 40 - 50 % for this long time. Here are some performance profiles: bug 1772673
Additionally I think, 2 GB index for 4 GB IMAP and 9 GB local messages (15 %) is a bad ratio. I believe, it could be enhanced significantly by only indexing the headers but not the bodies of the messages.

But I seriously doubt that will be implemented, and thus may be closed.

Please don't.

If you want better performance today, your options are

  • reduce the size of your profile
  • set some folders to not synchronize (folder/account settings)
  • set some folders to not be indexed (account settings)

These 3 points are not possible, as this would make global search useless and/or corrupt my workflow with TB.

(In reply to Wayne Mery (:wsmwk) from comment #21)

Bug 585429 comment 5 points out some possible causes of slow indexing.
And the reporter of that bug later wrote that Microsoft Security Essentials was a big cause of their problems.

To me this is not applicable, I'm on Ubuntu Linux.

(In reply to Wayne Mery (:wsmwk) from comment #20)

Do you still experience your performance issues?
If you do, then a performance profile would be helpful https://support.mozilla.org/en-US/kb/profiling-thunderbird-performance

Indexing process in TB is now faster because ...... today computers are faster than 4y ago ....

I'm still on to allow users to choose from, what "Gloda" will index (by using "about:config" settings) like: from, to, subject, body (how much words in the body), attachments name.

Flags: needinfo?(cra3yk)

Thanks for the added information.

(In reply to Ulf Zibis from comment #22)

(In reply to Wayne Mery (:wsmwk) from bug 1763729 comment #4 (duplicate))

Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2

The machine is a Lenovo Thinkpad T500 with SSD and 8 GB RAM.

This in combination with a non-SSD disk and 14gb profile is indeed going to be slow for a long time, although 5 days seems excessive.
Does 5 days include downloading the messages? Or just rebuilding from deleting global-messages-db.sqlite?

It does not include downloading. With 5 days I mean 5 x 24 h, no low power or sleep mode. And the bottleneck is not the disk IO, because the HDD LED only flashes rarely. It's only a CPU issue, it consumes 40 - 50 % for this long time. Here are some performance profiles: bug 1772673

Unfortunately it turns out those profiles are not helpful. According to developers "The profile is misleading because it doesn't include CPU delta information, so threads look busy even when they're not. I think the main problem is the long fseek calls (or the high number of them, unclear) on the main thread. With "Main Thread File IO" markers we might be able to get a better idea of what's going on, but we'd need to capture a new profile with that setting, which I assume is a challenge."

So, to determine why your indexing is slow will require https://support.mozilla.org/en-US/kb/profiling-thunderbird-performance#w_step-1-preparing-performance-recording-in-thunderbird plus main thread IO enabled using verion 102. (These instructions greatly updated compared to when you captured your profile 6 months ago.)

Also, Ubuntu must upload the thunderbird symbols for the ppa build to mozilla. Your previous profile was missing those symbols. To avoid this problem entirely you can use the Thunderbird provided installs from https://www.thunderbird.net/.

Additionally I think, 2 GB index for 4 GB IMAP and 9 GB local messages (15 %) is a bad ratio. I believe, it could be enhanced significantly by only indexing the headers but not the bodies of the messages.

You may consider 15% to be bad, but 10-20% is typical, and considered to be absolutely normal. (my own is toward 20%) Within this range, search speed is expected to be good. Also, a smaller index will not necessarily improve your search speed nor the speed of indexing new messages.

If you don't want Thunderbird to index message bodies because you want fewer search hits or a smaller index, then you can do that today by turning off the automatic imap syncronization function to avoid indexing message bodies. (not a 100% fix, but it will help)

(In reply to Ulf Zibis from comment #22)

(In reply to Wayne Mery (:wsmwk) from bug 1763729 comment #4 (duplicate))

Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2

The machine is a Lenovo Thinkpad T500 with SSD and 8 GB RAM.

This in combination with a non-SSD disk and 14gb profile is indeed going to be slow for a long time, although 5 days seems excessive.
Does 5 days include downloading the messages? Or just rebuilding from deleting global-messages-db.sqlite?

It does not include downloading. With 5 days I mean 5 x 24 h, no low power or sleep mode. And the bottleneck is not the disk IO, because the HDD LED only flashes rarely. It's only a CPU issue, it consumes 40 - 50 % for this long time.

However, with these comments I have effectively hijacked cra3y's bug report. If you still want to persue the indexing speed issue, please create a new bug report or reopen one of your bugs that you consider to not be resolved, and we can handle the investigation there.

(In reply to cra3y from comment #24)

(In reply to Wayne Mery (:wsmwk) from comment #20)

Do you still experience your performance issues?
If you do, then a performance profile would be helpful https://support.mozilla.org/en-US/kb/profiling-thunderbird-performance

Indexing process in TB is now faster because ...... today computers are faster than 4y ago ....

I'm still on to allow users to choose from, what "Gloda" will index (by using "about:config" settings) like: from, to, subject, body (how much words in the body), attachments name.

For what purpose?

Note, making the index smaller is not a goal which would make this attractive enough to product planners.

(In reply to Wayne Mery (:wsmwk) from comment #27)

I'm still on to allow users to choose from, what "Gloda" will index (by using "about:config" settings) like: from, to, subject, body (how much words in the body), attachments name.

For what purpose?

To index, what user needs (for example: some wants to index subjects and from/to, the others from/to/subject and body till first 50 words) and do index quicker for ones, that will not "check" all options (mentioned above).

To make Gloda indexing more flexible and suitable for user needs.

(In reply to cra3y from comment #28)

To index, what user needs (for example: some wants to index subjects and from/to, the others from/to/subject and body till first 50 words) and do index quicker for ones, that will not "check" all options (mentioned above).

To make Gloda indexing more flexible and suitable for user needs.

+1
At least have the choice to disable indexing the body. This could decrease the size of the index and speed up the build significantly.

(In reply to Wayne Mery (:wsmwk) from comment #25)

You may consider 15% to be bad, but 10-20% is typical, and considered to be absolutely normal. (my own is toward 20%) Within this range, search speed is expected to be good. Also, a smaller index will not necessarily improve your search speed nor the speed of indexing new messages.

Well, on my Ubuntu machine I have ~ 400 GB of data, but the Linux index db only has 8 MB. This is, what I expect as "normal".
But there too was a problem some time ago: https://bugs.launchpad.net/bugs/1946561

(In reply to Wayne Mery (:wsmwk) from comment #26)

However, with these comments I have effectively hijacked cra3y's bug report. If you still want to persue the indexing speed issue, please create a new bug report or reopen one of your bugs that you consider to not be resolved, and we can handle the investigation there.

Hm, as you made bug 585429 a duplicate of this, I don't understand this. So what to do ... reopen bug 585429 again?

(In reply to Ulf Zibis from comment #31)

(In reply to Wayne Mery (:wsmwk) from comment #26)

However, with these comments I have effectively hijacked cra3y's bug report. If you still want to persue the indexing speed issue, please create a new bug report or reopen one of your bugs that you consider to not be resolved, and we can handle the investigation there.

Hm, as you made bug 585429 a duplicate of this, I don't understand this. So what to do ... reopen bug 585429 again?

Bug 1763729 was duplicated to this (not 585429) because 1763729 requested the same types of controls as this bug. So that's not a useful bug report to analyze why your indexing is slow.

Also, I suggest your issue not be analyzed in someone else's bug report.

(In reply to Wayne Mery (:wsmwk) from comment #32)

Bug 1763729 was duplicated to this (not 585429) ...

Sorry for my typo.

You need to log in before you can comment on or make changes to this bug.