Closed Bug 484639 Opened 16 years ago Closed 15 years ago

gloda should not index junk messages

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0rc1

People

(Reporter: asuth, Assigned: asuth)

Details

(Whiteboard: [no l10n impact][gloda key])

From newsgroup discussion on this a bit back, it was determined that it's best to never index junk messages. We can fallback to non-gloda messages of search when the user opts to include junk in the results.
Whiteboard: [b3ux][m4]
That assertion was certainly made, but I'm still not entirely sold. In particular, why not index junk messages but simply have a constraint in the default SQL search that masks them out? That would presumably provide a better user experience for the use case where a user is trying to find a message that someone claims they sent but that the user doesn't remember seeing.
For some people (esp. without gateway-level spam filtering), the volume of junk far outweighs the volume of ham. For some of those, the indexing load of that mail might warrant turning that off. Agreed that the default search should exclude spam, and that searching through spam folders would be much easier w/ gloda.
The main reason not to index spam is the performance impact on the full-text search. The term lists have a list of all the documents with the term and the offsets within the documents. In order for us to know that the document in question is a spam document and filter it out, we need to retrieve the document row. This is not without cost, and causes the cost of a non-spam search to incur the cost of the existence of the spam (which may grow at a higher rate). This can be mitigated by creating a separate full-text search table for spam messages. That seems like a waste, and could quickly become more complex if we decide to give it its own database so it's a separate file on disk.
It may be helpful to frame the default question this way: * what default is likely to work best for 80% of our target users? * how bad is it for the other 20%, and how hard it is for them to ameliorate that badness?
Can you frame a user scenario that meets your question? If our users' primary concern is searching their spam, our search design seems wrong to hide the spam from them on the first pass. Non-gloda search is a pretty good performer as long as we don't punt to the IMAP server for body search (quick search is dumb about this). Since gloda would only index the message body if we have it offline anyways, this means we would not need to punt to the IMAP server. This should be a more than adequate fallback search for the user, especially if we perform a subject/sender search by default.
I don't believe their primary concern is searching their spam, but I don't have a good feel for how their various concerns balance against each other. The problem is that picking the best default is really not about framing a single user scenario, but having a good feel for as many of the important user scenarios as possible and having some intuition about how to balance them against each other. I'm hoping Bryan can provide us with some guidance here. I'm not personally a big fan of a design hiding the spam on the first pass; Mail.app doesn't do it, and when I've spent time living in Mail.app, that's felt like a good design choice to me. That said, this is Bryan's call to make, not mine. Assuming that we do go ahead with heading the spam on the first pass, what does the proposed interaction look like for the user who decides they do actually want to include their spam in the search?
Er, "hiding the spam" is what I meant, of course.
I think I understand better now what you're getting at. There are a couple of things here that I think we can approach separately. In search results we shouldn't be defaulting to including spam in the results. I think this is pretty obvious because spam results are a mix of keywords such that on full-text indexing the odds are very slim that we'll match something correctly and very likely they will bloat the real results. However there is a good case that I think Dan is getting at about finding messages from people you know in the spam folder. This makes me wonder if we could do something to search spam for email addresses in the headers only. What I mean is that when I search for asuth@example.com in my messages (I believe) it is highly unlikely that I've received spam with the email address of a contact that was also sent to me. This searching for mail from people you know and might be missing also happens to be the prime spam search use case. So I'm wondering if we could create a solution that when a user searches gloda for an email address we also search spam message headers for that email address. I'm not sure how that works out tech wise and I hope asuth can answer if it's too far off, but I think it would help this use case that is pretty common. I'm also not sure exactly how we float this information to the interface, much of it depends on what is possible. The 'preferred' path for searching for messages that may have been wrongly marked spam is to go into the spam folder and then search for the messages specifically. The subject/sender search in the spam folder is likely to return the best results even when compared to a body text search.
I'm taking this to force myself to learn something about gloda.
Assignee: bugmail → kent
(In reply to comment #8) > > In search results we shouldn't be defaulting to including spam in the results. > I think this is pretty obvious because spam results are a mix of keywords such > that on full-text indexing the odds are very slim that we'll match something > correctly and very likely they will bloat the real results. While the reasoning there sounds sensible, I don't recall ever noticing that as a problem during my time spent living in Mail.app. To be fair, while I know that Mail.app has found things in my Junk folder for me, I haven't done the testing to find out whether it searches the bodies there or not. > However there is a good case that I think Dan is getting at about finding > messages from people you know in the spam folder. I haven't done a good job of being clear about what feels to me like an important use case; I'll give it another shot: what happens to me from time to time (and what I think is likely to be common among folks that get non-trivial amounts of email) is this: someone pings me via whatever medium (phone, email, chat) and starts asking about some message that they sent me. I don't remember that message. There's a reasonable chance that I've simply forgotten it, and done one of a number of things: a) let it sit in my Inbox, b) filed it into a folder of stuff to process later, c) archived it. It might also have been marked as spam, and I never saw it. Note that while it's often people I know, it's sometimes people I don't who are contacting me for the first time. > So I'm wondering if we could create a solution that when a user searches gloda > for an email address we also search spam message headers for that email > address. I think something like that could indeed work. It does feel like it's placing an unnecessary bet on the user typing a full email address rather than a partial one, so I'd be more inclined to search (at least) Subject & From for whatever the string is. > I'm also not sure exactly how we float this information to the > interface, much of it depends on what is possible. Which information? The details of what sorts of searches are going on under the hood?
(In reply to comment #10) > I think something like that could indeed work. It does feel like it's placing > an unnecessary bet on the user typing a full email address rather than a > partial one, so I'd be more inclined to search (at least) Subject & From for > whatever the string is. We can do this easily using non-gloda search on spam folders. The Subject & From are already in the msf, and the non-indexed nature of the search is well suited to substring searches.
I was looking for a first gloda bug and thought this might be it. But I see now that the issue here is more what to do rather than just doing it, so I'm going to unassign myself, as those issues are more concerned with the overall strategy for gloda search rather than the details of interacting with the junk system.
Assignee: kent → nobody
Status: ASSIGNED → NEW
Sorry for leaving this ambiguously hanging, Kent. I appreciate your previously volunteering to help out with this bug; I need to learn to be proactive about bug triage where I'm potentially getting something for free :) I have logged bug 486849 to track an enhancement to bug 474701 to track dmose's use case. Note that the use case does not imply any specific technical implementation and can be implemented without gloda, whereas having gloda index spam has several ramifications on resource usage, performance, and data utility. (Namely, auto-creation of contacts for bogus spam addresses 'spams' our contact database.) Here is what we are doing on this bug: 1) gloda will not index a message until it is clear the junk detector has made a determination one way or the other. 2) gloda will not index messages that are marked as junk/spam 3) gloda will treat messages that become marked as junk/spam as if they had been deleted. Gloda's treatment of deleted messages avoids pathological behavior should the user sit there toggling a single message's junk. Kent, if you want to take the bug again, I won't complain...
Assignee: nobody → bugmail
Whiteboard: [b3ux][m4] → [b3ux][m6]
Whiteboard: [b3ux][m6] → [b3ux][m6][gloda key]
I am planning to work this for rc1; not sure if it should actually block.
Status: NEW → ASSIGNED
Flags: blocking-thunderbird3?
Whiteboard: [b3ux][m6][gloda key] → [gloda key]
Target Milestone: --- → Thunderbird 3.0rc1
Whiteboard: [gloda key] → [no l10n impact][gloda key]
(In reply to comment #14) > I am planning to work this for rc1; not sure if it should actually block. Not blocking as this doesn't seem a significant issue - worst case junk mail is typically moved into one folder that could be filtered out in a faceted search. Would be nice to have for the TB 3.
Flags: wanted-thunderbird3+
Flags: blocking-thunderbird3?
Flags: blocking-thunderbird3-
rkent, bienvenu, sid0, any thoughts on adding an explicit notification mechanism for when messages have been junk classified or it has been determined that junk classification has not been attempted? and/or for after a message has had all filters applied? I am trying to implement the #1 case where gloda doesn't index messages until they have been classified if they are going to be classified, and there is not an easy way to detect this case as things are currently implemented. The most useful indicator is a folder's processing flags, but they are not set until nsMsgDBFolder::CallFilterPlugins is invoked. That definitely happens after the add notification, but how long after can vary. In the longest case, when there is a filter that needs the message body, the call is deferred until all the bodies have been fetched. There aren't particularly any notifications associated with such events; one would need to sign up as a url-ish listener on urls that never get explicitly exposed to the world. There's the additional complication where if the user has never marked any messages, junk processing may not be run at all. So even if there were a simple way to listen for the SetStringProperty calls for "junkscore" and friends, we have no guarantee such an event would ever occur. (The string property setting only generates db change listener notifications. While gloda could register as a db change listener once it sees an add event, the frequency of notifications, lack of specificity, and potentially large number of listeners that might have to be registered would be concerning.) So, I would propose that the notification would be generated at: 1) The classification point: http://mxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgDBFolder.cpp#2210 2) The not-gonna-classify point: http://mxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgDBFolder.cpp#2371 It might make more sense to pose the notification as "after junk classification and filters have run". Gloda would certainly benefit from not doing redundant work because it indexed a message before a filter did something to it, and I certainly don't see any benefits to gloda getting a whack at a message before filters have run.
Biff has a similar issue - we don't want to pop up the biff alert after having fetched a bunch of new messages until we've done all the junk analysis we're going to do; otherwise, we run the risk of putting up the new mail alert when only junk mail has arrived. I'm not sure we're getting this 100% right, but it's something we should get right. So we could go at this in a different way - a notification on a folder/server when we start fetching new hdrs, and a notification when all junk analysis is done for that fetch.
> any thoughts on adding an explicit notification mechanism for when messages > have been junk classified or it has been determined that junk classification > has not been attempted? I think that would be an excellent idea. Would this be an nsIFolderListener or nsIMsgFolderListener? The classification operations are typically done in batches, so would the listener receive an array of message keys and a folder reference? I think that listener needs to understand the specific messages that have been processed, not simply a "we're done now" without specific references. Related issues: 1) I am trying to generalize this code so that the generalized trait can do whatever junk processing does, so you want the notification to be after all classifications have been done, not simply after all junk classification. This would be identical unless someone extends to include a trait. 2) The concept of "don't index" is important, so it should be possible to affect this in a message filter. Rather then having gloda decide that it does not want to index junk messages, there should be a permanent message attribute set (either a new flag or a property) that says don't index this message, then set that flag during junk processing. I guess then we also need some way to deindex a message if the flag is set later, for example if you manually classify a message as junk.
(In reply to comment #18) > I think that would be an excellent idea. Would this be an nsIFolderListener or > nsIMsgFolderListener? nsIMsgFolderListener. > The classification operations are typically done in batches, so would the > listener receive an array of message keys and a folder reference? I think that > listener needs to understand the specific messages that have been processed, > not simply a "we're done now" without specific references. I think an array of nsIMsgDBHdr instances is ideal. > 1) I am trying to generalize this code so that the generalized trait can do > whatever junk processing does, so you want the notification to be after all > classifications have been done, not simply after all junk classification. This > would be identical unless someone extends to include a trait. I concur. > 2) The concept of "don't index" is important, so it should be possible to > affect this in a message filter. Rather then having gloda decide that it does > not want to index junk messages, there should be a permanent message attribute > set (either a new flag or a property) that says don't index this message, then > set that flag during junk processing. I guess then we also need some way to > deindex a message if the flag is set later, for example if you manually > classify a message as junk. I agree that it makes sense to let users indicate they don't want a message indexed. From a usability/understandability perspective, I think it may make more sense to make that happen on a per-folder basis. I would have clarkbw and davida render opinions on that separate from this bug. Junk mail is a sufficiently important concept that I think we can address it specifically and not have it use a more generic mechanism and risk conflating things. That way if we did adopt the per-message flag then the act of toggling a message's junk state would not lose an explicit "do not index this". I would much prefer handling junk specifically rather than having to create a more complex per-message flag mechanism to handle all possible cases involving junk.
From a feedback POV, I've seen b4 users request the ability to turn off indexing for folders, but haven't gotten that request yet per message. The specific use cases mentioned were "automated messages that i don't want to search", "old archives", that sort of thing -- and they seemed all to fit in a per-folder model. From a UI POV, we have a folder properties dialog which would accomodate a "don't index" checkbox easily enough, and we don't have anything but the threadpane to set state per message AFAICT, so I would agree that the per-folder model is easier to accomodate with the current model. Kent -- out of curiosity, apart from junk, can you describe more scenarios where avoiding indexing at all would be helpful? I can come up with one, which would be trying to avoid polluting the FTS index w/ "noise" data. But I could see that being done _in_ gloda, as opposed to in the msgdbhdr (meaning that we could find those messages in gloda-dependent views, but they wouldn't come up in full-text search results).
(In reply to comment #20) > From a feedback POV, I've seen b4 users request the ability to turn off > indexing for folders, but haven't gotten that request yet per message. Part of that reason is that the TB UI strongly focuses on real folders rather than alternate ways of organizing messages. I'm a strong believer philosophically that folders are a storage mechanism, and should not be used as a method to categorize messages. That is what message metadata, such as tags, is useful for. Whenever we add some new feature, and make it only available on a per-folder basis, we dig a deeper hole that forces us into the folder-as-category work model. But using folders to catagorize is so ingrained already in the UI, that I agree that a "don't index" checkbox on the folder properties dialog is useful. I would just encourage you to consider using an inherited folder property, so that it would be possible to shut off indexing at the account, folder, or tree-of-folders level. > > Kent -- out of curiosity, apart from junk, can you describe more scenarios > where avoiding indexing at all would be helpful? As a personal but strange case, I run both IMAP and POP versions of my main email for testing purposes, and would like to only index the POP version so that I don't see duplicates. As a contrived but probably more realistic example, I can see someone having a number of high-volume email lists that they subscribe to but are not interested in seeing in search, since they show false positives that are not of interest. This is virtually the same as your: > I can come up with one, which > would be trying to avoid polluting the FTS index w/ "noise" data. But I could > see that being done _in_ gloda gloda still needs a source of data on dontIndex. If you try to add this directly to gloda, then you have subtly changed gloda to be a permanent source of message (or folder) metadata, which is a line I encouraged in my m.d.a.t posting that we not cross without serious thought as to the long-term architectural consequences. It was tagging and junk status that moved the existing message summary files into permanent messages stores without serious thought. As a pre-TB3 activity, it would be nice to include now some of these hooks directly into gloda, so that an extension (such as GlodaQuilla) could allow more closely controlling indexing without breaking the TB3 string freeze. I think the hooks will be fairly simple and risk-free to add.
(In reply to comment #21) > > I can come up with one, which > > would be trying to avoid polluting the FTS index w/ "noise" data. But I could > > see that being done _in_ gloda > > gloda still needs a source of data on dontIndex. If you try to add this > directly to gloda, then you have subtly changed gloda to be a permanent source My interpretation of davida's response is that this is something that faceting or other advanced heuristics can deal with. For example, with faceting we can say "eliminate all the messages sent to mailing lists" in a single stroke from the result set. Blinding gloda to the messages entirely is a more extreme measure. I agree with your concerns about the danger of violating the "gloda is an index" axiom. > As a pre-TB3 activity, it would be nice to include now some of these hooks > directly into gloda, so that an extension (such as GlodaQuilla) could allow > more closely controlling indexing without breaking the TB3 string freeze. I > think the hooks will be fairly simple and risk-free to add. Gloda currently has an incomplete mechanism to mark a folder as not to be indexed. Each GlodaFolder has a priority. Only sweep indexing pays attention to it. As a forward-thinking move, gloda persists the priority, but there is no way to actually manipulate the priority. If you wanted to implement a patch to make gloda pay more attention to the priority, base its priority decision on properties set on the folders (using inheritance) and clever enough to update the gloda priorities when changes are made on the folder, I'd be game for that.
Fixed in gloda correctness patch tracked on bug 465618. trunk commit: http://hg.mozilla.org/comm-central/rev/413b2018349c Tested by base_index_junk.js and variants.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.