Closed Bug 295088 Opened 20 years ago Closed 19 years ago

Filter/Search term evaluator is extremely sub-optimal

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hyc, Assigned: hyc)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: fixed1.8.1, perf)

Attachments

(1 file, 4 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050516
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050516

The term evaluator builds an an expression tree for each filter and then
destroys it again on a per-message basis. So even if you have a very simple set
of filters, if you're running them on a large folder there is a bunch of time
wasted rebuilding and destroying the same expression structure over and over again.

Instead, the expression tree should be constructed once, the first time the
filter is used, and re-used for the entire duration of the in-progress
operation, then destroyed in the Filter's destructor.

For example, I have a filter "Age in days" "is greater than" "30" / Action is
Delete, which I run in my trash folder periodically. I keep deleted messages
just in case I might need them again. Typically I have around 2000 messages in
each Trash folder on my two main email accounts. Expiring brings them down to
around 1500, but it takes at least a minute to run this single filter. It
shouldn't take this long.

Reproducible: Always
Another problem - the complete expression tree is always constructed, and values
are assigned to all of its nodes, before the AND/OR evaluation occurs. It should
not work in this order; it should only assign values to nodes that it actually
visits, so that an AND/OR result that prunes a branch doesn't require any
additional work. This is most critical if bug #223591 is fixed, because Junk
classification will be one of these filter nodes and you don't want to start
that up if you don't need to.


Blocks: 223591
Attached patch Rearrange expressionTree code (obsolete) — Splinter Review
This patch keeps a single expressionTree hanging around for the various
search/filter callers, to avoid unnecessayl malloc churn. It also fixes the
term evaluation to allow shortcuts.
Since there's a patch in the works, might as well confirm this.
Assignee: nobody → hyc
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached patch Cleanup prev patch (obsolete) — Splinter Review
Since the m_evalValue field of the nsMsgSearchBoolExpression was no longer
being used, I deleted that and consolidated everything that was referenced by
it. Also stuck my name in...
Attachment #184244 - Attachment is obsolete: true
Attached patch Cleanup one more loose end (obsolete) — Splinter Review
Missed cleaning up a call to AddSearchTerm...
Attachment #184280 - Attachment is obsolete: true
Attachment #184281 - Flags: review?(bienvenu)
Comment on attachment 184281 [details] [diff] [review]
Cleanup one more loose end

thx, Howard - this looks good.	There's an open bug about search evaluation not
 being lazy, which this fixes, right?
Attachment #184281 - Flags: superreview?(mscott)
Attachment #184281 - Flags: review?(bienvenu)
Attachment #184281 - Flags: review+
(In reply to comment #6)
> (From update of attachment 184281 [details] [diff] [review] [edit])
> thx, Howard - this looks good.	There's an open bug about search evaluation not
>  being lazy, which this fixes, right?

Right. I only searched for MailNews:Filter bugs before, so I missed any :Search
bugs. I think you're talking about bug #166502 right?
yeah, that's the one, thx. When this gets checked in, could mark that one fixed
as well? Dup'ing or marking it dependent on this doesn't seem quite right...
Blocks: 166502
(In reply to comment #8)
> yeah, that's the one, thx. When this gets checked in, could mark that one fixed
> as well? Dup'ing or marking it dependent on this doesn't seem quite right...

I don't have general bug editing privs; if you want to assign that bug to me
then I can track it.
Status: NEW → ASSIGNED
(In reply to comment #9)
> I don't have general bug editing privs;
 
You do now :).
Doesn't yet Solve (second half of bug #166502) ordering clauses so most
expensive (body text) operations are done last (so summary op's have chance to
short-circuit it)? User merely must define/correct their clauses to be in
optimal order to start with, now that there is an optimal order thankfully.

Nor the "pie in the sky" common subexpression elimination when exact same
clauses appear (reused) in multiple filters (mentioned in bug #154867)?
(In reply to comment #11)
> Doesn't yet Solve (second half of bug #166502) ordering clauses so most
> expensive (body text) operations are done last (so summary op's have chance to
> short-circuit it)? User merely must define/correct their clauses to be in
> optimal order to start with, now that there is an optimal order thankfully.

Right. I don't believe the code should automatically re-order things behind the
users' back. We should just document that the order makes a difference.

> Nor the "pie in the sky" common subexpression elimination when exact same
> clauses appear (reused) in multiple filters (mentioned in bug #154867)?
 
Right again. That would require walking through all of the filter terms to see
if any of them are identical, then linking the terms in the expression trees.
Finding the identical terms will be an inefficient process, but I guess it will
be trivial in comparison to the filter execution time.
>> Doesn't yet Solve (second half of bug #166502) ordering clauses so most
>> expensive (body text) operations are done last (so summary op's have chance to
>> short-circuit it)? User merely must define/correct their clauses to be in
>> optimal order to start with, now that there is an optimal order thankfully.

>Right. I don't believe the code should automatically re-order things behind the
>users' back. We should just document that the order makes a difference.

If you're going to mark this as fixing #166502 (and it's dupe #154867), then
shouldn't actually fix that bug?

in what 'match all' search cases could the non-short-circuited verson produce a
different result from the short-circuited version? It seems like a very
unintuitive 'feature' to have to document, given how much trouble humans have
with boolean logic at the best of times.

I always used to try to combine subject and body searches to speed things up
until I realised it was pointless, but I can't imagine a situation where I'd want

'from contains public@whitelabel.org' AND
'body contains bienvenu'

to return different results from 

'body contains bienvenu' AND
'from contains public@whitelabel.org'

Especially in a world where a body search is an order of magnitude slower than
summary.

or have I got the smelly end of the stick here?
(In reply to comment #13)
> >Right. I don't believe the code should automatically re-order things behind the
> >users' back. We should just document that the order makes a difference.

> If you're going to mark this as fixing #166502 (and it's dupe #154867), then
> shouldn't actually fix that bug?

As far as I'm concerned, the *bug* is that it always fully evaluated the terms
instead of doing lazy evaluation, and that bug is fixed in this patch. The fact
that the code executes filters in the order that you ask it to is not a bug,
IMO. That is correct behavior, and anything other than the order the user
specifies is incorrect behavior.

Anyway, if you disagree, I'm probably not the person you need to convince, so
I'm going to shut up.
Great that this bug has solved precisely what it was opened for. I'd say at
least keep one of the related bugs open for the body-last issue. So far as I
know order of expression evaluation can't produce different results (except
variations in elapsed time to when the current date is fetched, if the age
reference date isn't already snapshot-ed once upfront). 

IMO, body-last just "does the right thing" transparently to the user, giving
more users a more optimal runtime without teaching them how to use or how to
optimize their existing filters. Still give a user-pref by which to disable it:
When headers are longer than body, matches on headers take longer than body;
some users might have such a special purpose need to configure for: Use exactly
my ordering.

Ideal to: A one-time reordering of the expression tree, that only needs to be
done once, then the static tree can be evaluated for all messages. But is it
easier to implement up to two passes over the tree per message: first to only
evaluate non-body clauses, second for remainder? Skip the first pass if
user-pref said to use exactly my ordering. Would cost per message of the 2nd
tree walk and 2nd expression type tests, be considered significant enough to
kill this easier implementation?
(In reply to comment #15)
> Great that this bug has solved precisely what it was opened for. I'd say at
> least keep one of the related bugs open for the body-last issue. So far as I
> know order of expression evaluation can't produce different results (except
> variations in elapsed time to when the current date is fetched, if the age
> reference date isn't already snapshot-ed once upfront). 

See, even without spending any time in the code you've already run into an
exception. I wonder how many others there are, waiting to be discovered.

> IMO, body-last just "does the right thing" transparently to the user, giving
> more users a more optimal runtime without teaching them how to use or how to
> optimize their existing filters. Still give a user-pref by which to disable it:
> When headers are longer than body, matches on headers take longer than body;
> some users might have such a special purpose need to configure for: Use exactly
> my ordering.

Lulling users into a false sense of security, relying on automagical processes
to Do The Right Thing is a bad policy, especially when the code cannot determine
what The Right Thing is in every case. You're looking at things from an
extremely simplistic view of what the current filter UI allows. But the filter
engine allows arbitrarily nested trees of boolean combinations, and at some
point someone is going to fix the UI to allow users to create these complex
filters. (Notice bug #59821) When people start using these complex filters, the
simple reordering you're talking about here is going to be ineffective in the
general case, and figuring out how to restructure a complex expression is not
(IMO) worth the effort. The most important consideration in "fixing a bug" is to
make sure that you actually fix something - that means that the result is
always, consistently correct. If you introduce inconsistent behavior, then you
have created a problem, not a solution. Trying to automagically restructure a
users' input is guaranteed to meet with varying degrees of success. It will add
an inordinate amount of complexity to the code, making it harder to understand
and maintain, for an extremely small potential benefit that frequently will not
materialize.

Of course, if you have the code that implements what you ask for, and it works,
feel free to post it here.
I see your point Howard, and perhaps it is not unreasonable to leave it as is.
However, in the "real world" most people will not care about complex nested
expressions (and in fact a UI which allows it will probably be more confusing).
Most people just want a simple search like gmail allows which searches
everything as quickly as possible.

Is it possible to detect that this is a "simple" expression (ie. single level)
and reorder for speed only in that case? 
Assignee: hyc → nobody
Status: ASSIGNED → NEW
Whoops, sorry: accidentally clicked " Reassign bug to owner and QA contact of
selected component " before Commit. 
(In reply to comment #18)
> Whoops, sorry: accidentally clicked " Reassign bug to owner and QA contact of
> selected component " before Commit. 

Fixing.
Assignee: nobody → hyc
(In reply to comment #17)
> 
> Is it possible to detect that this is a "simple" expression (ie. single level)
> and reorder for speed only in that case? 

That sounds reasonable. Unfortunately I've used up all the spare time I have for
now, so I won't be able to pursue this for a while.

> I see your point Howard, and perhaps it is not unreasonable to leave it as is.
> However, in the "real world" most people will not care about complex nested
> expressions (and in fact a UI which allows it will probably be more confusing).

It seems that not a lot of people want it, but things like bug 274192 indicate
that *some* people want it. Personally, I'd like to have a full UI for this.

> Most people just want a simple search like gmail allows which searches
> everything as quickly as possible.

Adding AND/OR and Grouping to the UI won't prevent people from defining simple
filters. They can just ignore the Grouping mechanism. (But I guess we can
discuss the UI on 274192 or elsewhere.)
Status: NEW → ASSIGNED
Comment on attachment 184281 [details] [diff] [review]
Cleanup one more loose end

sr=dmose
Attachment #184281 - Flags: superreview?(mscott) → superreview+
Comment on attachment 184281 [details] [diff] [review]
Cleanup one more loose end

Howard, the patch has bitrotted enough since May that it doesn't apply anymore.
Could you please attach an updated version?

$ patch -p0 < ~/moz/patches/fremde/295088.diff
patching file public/nsMsgSearchBoolExpression.h
patching file src/nsMsgFilter.cpp
Hunk #2 succeeded at 171 (offset 15 lines).
Hunk #3 succeeded at 182 (offset 15 lines).
Hunk #4 succeeded at 246 (offset 15 lines).
Hunk #5 succeeded at 395 (offset 15 lines).
Hunk #6 succeeded at 516 (offset 16 lines).
patching file src/nsMsgFilter.h
Hunk #3 succeeded at 122 (offset 2 lines).
patching file src/nsMsgLocalSearch.cpp
patching file src/nsMsgLocalSearch.h
patching file src/nsMsgSearchAdapter.cpp
Hunk #2 succeeded at 787 (offset -15 lines).
patching file src/nsMsgSearchSession.cpp
Hunk #2 FAILED at 63.
Hunk #3 succeeded at 68 (offset -4 lines).
Hunk #4 succeeded at 93 (offset -6 lines).
Hunk #5 succeeded at 103 (offset -6 lines).
Hunk #6 succeeded at 647 (offset 3 lines).
Hunk #7 succeeded at 753 (offset 11 lines).
1 out of 7 hunks FAILED -- saving rejects to file
src/nsMsgSearchSession.cpp.rej
patching file src/nsMsgSearchSession.h
Hunk #2 FAILED at 108.
1 out of 2 hunks FAILED -- saving rejects to file src/nsMsgSearchSession.h.rej
Attached patch refreshed patch (obsolete) — Splinter Review
OK, this is a fresh diff from a current tree.
Attachment #184281 - Attachment is obsolete: true
Attachment #200249 - Attachment is obsolete: true
Attachment #200258 - Attachment description: refreshed, with missing header diff → refreshed, with missing header diff [checked in]
So nsMsgSearchSession.cpp lacked an #include nsMsgSearchBoolExpression.h and
some trees got burned down, but no other harm was done. ;-)
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
This bug here might have caused Bug 313357, maybe the patch author and/or
reviewers can take a look at it.
Blocks: 314027
*** Bug 330816 has been marked as a duplicate of this bug. ***
Keywords: fixed1.8.1
*** Bug 357083 has been marked as a duplicate of this bug. ***
Product: Core → MailNews Core
Depends on: 193612
Keywords: perf
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: