Open Bug 168905 Opened 22 years ago Updated 2 years ago

intelligent mail classification

Categories

(MailNews Core :: Backend, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: Henry.Jia, Unassigned)

References

Details

Attachments

(2 files)

It is a big issue to handle mail more efficiently since we receive so many mail
a day. It will help us a lot by using some agent to intelligently classify mail
for us into some folders. At the same time, no more operations are needed for
the user. Example based study is a good solution.

Filter is for static mail classification. What I mean here is to make mozilla
intelligently classify the user's mail by studying the user's operation, such as
drag some mail to some folder.

Assign to Fangfang for continuous handling.
Blocks: 168902
Reporter, you need to be much more specific. Please clarify what you mean and
how this would be implemented.

Severity -> Enhancement.
Severity: normal → enhancement
I mean to let mozilla study the user's intension then gives some suggestions for
mail classification. After mozilla observes the user drags some mail to some
folders, it summarize the user's intension by analyzing the mails' similarity
(using mail header and mail body). Then when new mail arrives, mozilla gives the
user suggestion to devide the mail to some folder. Fangfang will give a more
detailed description a little later.
The idea of this "Personal Email Agent" is to help users organize the large
amount of emails in inbox. It is somewhat like a filter but an agent is more
intelligent and not that fixed. The functionality should at least include
intelligent email classification, similar email retrieval and incremental
learning on user feedbacks.

The basic scenarios are as follows,
Scenario 1: When the agent is set up for the first time:
The agent will scan all the folders, fetch each message and record its
interested information like phrases, frequencies, etc.
Scenario 2: When the user changes the structure of his folders, like dragging
one message from one folder to another:
The agent will adjust the information it has recorded to keep in accordance with
DB and also try to learn the user’s mail sorting preferences by certain query
refinement strategies.
Scenario 3: When a new message comes in:
The agent will fetch the body of the message, judge which category it falls in
and labels it.

One of the key ideas in this approach is extracting both the high-level semantic
features (e.g., concept information) from the body text and other low-level
email features (e.g., sender, time, importance, etc.) from the entire email
message for similarity assessment based on the standard Information Retrieval
(IR) approach. Since IMAP does not fetch the message body by default, the agent
will listen to the DB changes and is always ready to fetch the body of a changed
message. 
Status: NEW → ASSIGNED
Towards a Personalized Agent for Email Management

I. Semi-automatic email classification based on user organized folders. 

Functionality
   A new button named "classify" is added to the primary tool bar. The first
click on this button will invoke the agent to perform a thorough scan over the
messages in the user maintained folders; a virtual category will be formed out
of each folder during this process with its name assigned the same to that of
the folder. Then a second click will generate the classification results of the
emails in INBOX; each message will be assigned a suggested category with the
relevance given. The user may accept or ignore the classification result. They
may even press the 'Archive' button in the dialog to make the selected emails
go to their destination folders. The same dialog will also pop up whenever new
messages arrive.

Notes 
1. Only the emails on the first IMAP server are processed by the current agent.

2. Only the user folders in the same directory as INBOX are regarded as
user-defined examples. Trash, Drafts, Sent folders are ignored.
3. Only the recent 10 emails in each example folder are scanned for training
purpose. The corresponding controlling threshold is AGENT_FOLDER_SIZE in
"nsImapAgent.h".
4. The initial scanning step may take a long time depending on the volume of
the folders and emails that a user currently has. A brief notice of progress
percentile is given in the console window. During this process, any clicks on
the "classify" button are ignored.

Algorithms
1. Similarity measurement
   The frequency of each word in a message is put to a hash table when
performing the initial scan or on the arrival of a new email. Then after
sorting, the words with top frequencies are recorded in the form of a vector.
The similarity of two emails can thus be calculated by comparing their
representing vectors. As for the assessment of the similarity between a folder
and an individual message, the k-Nearest-Neighbor method is used.

2. Customized hash table
   A customized hash table class is implemented. The hash function (Bob
Jenkins, 1996) hashes a variable-length string key into a 32-bit value. It is a
fairly fast hashing algorithm that produces few collisions (approximate
collision ratio: one in 2^32). This class also implemented member functions
that can dump the keys and values into an array, which facilitates word
frequency sorting.

3. Query refinement strategies
   The agent listens to the changes of the message DB and makes adjustments
accordingly. 

Performance
The agent has been tested upon simple cases. Using the filter classification
result (around 20 rules) as ground truth, the agent reaches accuracy up to 90%.
Since only the text feature is currently chosen by the agent as the
classification criterion, the preliminary promising results indicate the need
for future work. In reality, the filters should take precedence over the agent
and, to achieve fine accuracy, the number of training emails in each example
folder should be set to as large as 100. Serialization is necessary both to
ensure classification efficiency and to achieve personalization (one agent
matching one account). Multiple-feature extraction and the corresponding
methods for similarity assessment and learning user preferences are also being
devised.
Initial comment - all this stuff is going to want to go into an extension (e.g.,
mailnews/extensions/mailclassify), not into the core imap directory.
I'd like to draw your attention by visualizing the basic UI in the below text
form. The intelligent interface is always a big problem in agent systems. I am
looking forward to your comments.
__________________________________________________
|  Email  |  Category  |  Relevance  |  Archive  |
|------------------------------------------------|
|   #1    |   Folder4  |    0.870    |     Y     |
|------------------------------------------------|
|   #2    |   Folder2  |    0.902    |     Y     |
|------------------------------------------------|
|   #3    |   Folder3  |    0.622    |     N     |
|------------------------------------------------|
|   ...   |     ...    |     ...     |    ...    |
|------------------------------------------------|
|                                                |
|                       OK    Cancel     Archive |
|________________________________________________|
QA Contact: huang → gchan
Yes, a confirmatory dialoge before anything is moved seems necessary.

This dialoge (because it is, in a way, a nuissance) should allow VERY rapid
changing of the pre-selections.

1. rename "Archive" to "Move?" 
   (it is less unclear - I had to read this entire bug to figure it out)

2. Add a collumn for "Move to directory:"

3. Make "move to directory" items selectable (like the "File" icon)

4. Change "email" collumn to "Subject"

5. Add "Sender" collumn

6. Remove "Relevance" collumn (user will make decision based on subj & sender)

7. one-click changing between Yes and No

8. The target area to click should be as LARGE as possible.

9. CTRL-click and SHIFT-click to select and then EASILY change "Yes <-> No".

10. CTRL-A to select and then EASILY change all "Yes <-> No".

11. OK button should be the default selection 
   (hitting RETURN or SPACE activates it).

12. If user places focus elsewhere, TAB should go *immediately* to OK button.

13. Window should remember user-selected size and position!

So the dialoge could look like this:

+-----------------------------------------------------------------------------+
| Subject                    |  Sender     | Destination Directory    | Move? |
|-----------------------------------------------------------------------------|
| Hi Peter, wanna visit      |  Charly Chan| Local Folder / Pers    \/|   Y   |
|-----------------------------------------------------------------------------|
| Proposal for bridge cons.. |  Mr. Big    | Peter@Lairo.com / Work \/|   Y   |
|-----------------------------------------------------------------------------|
| Money, Free, Sex, Pyramid, |  idiot@yahoo| Peter@Lairo.com / Junk \/|   Y   |
|-----------------------------------------------------------------------------|
| Some vague subject line    |  Joe Shmo   | Peter@Lairo.com / Comp \/|   N   |
|-----------------------------------------------------------------------------|
|                                                                             |
|                                                       [ OK ]   [ CANCEL ]   |
|                                                                             |
+-----------------------------------------------------------------------------+

Notice, the fourth line had a low "relevance". Therefore the default "Move?"
selection is "NO".
To what extent does the work here duplicate what can be done with the Bayesian 
spam filters and/or can it be integrated with that?
I assumed that this was riding on the shoulders of the existing Bayesian
classification code, but from browsing the prototype patch it looks like
everything has been re-implemented.  I am not familiar with the existing
spam-filtering implementation so I can't say how reusable that code is... but
I'd be surprised if it couldn't be put to this general-classification use.
I don't see any reason why there should have to be two backends (for this and
the Bayesian spam classification system).  I don't think we should need two
frontends either; I would think a single "intelligent filter" system that could
filter messages into an arbitrary number of folders (one of which would probably
be "spam" in most people's usage) would be much less confusing to the end users,
and simpler to have to maintain in the tree.  Secondly, I was under the
impression that the Bayesian filters were accurate enough that a confirmation
dialog was unnecessary, at least once the filters had enough messages in each
folder to get good probability numbers.
RE comment #10 "no need for confirmation": For junkmail the user only has to
"confirm" one folder, but if the filter starts spreading mail to n folders...
First, the current spam filtering is supposed to be almost perfectly reliable,
although I don't have any raw data to back that up.  In any case, I don't see
why, if the filter determines that a certain message has a 98% chance of
belonging in my "PrefBar" folder, and a less then 10% chance of belonging in any
other folder, that I should have to confirm the choice before it gets moved. 
Yes, I might get one message a month put in the wrong place, but if we don't
confirm for spam (where a false positive could have serious negative side
effects) I don't see why we need to confirm for generic folders, where I'm going
to read them all anyway and can move the one message in a month that gets
misdirected to the correct folder when I read it.  Or maybe I'm missing the
point somehow.  Feel free to correct me (or politely tell me to shut up) if this
makes no sence to anyone.
This is, of course, very similar to bug 181866. Should that one be attached to this?

The only difference is that this one proposes a more complex user interface,
doesn't propose reusing the Bayesian code, and includes a patch.

-Billy
Comment on attachment 114645 [details]
Description of the algothrim and code

Thanks for the comments. I will adjust the UI according to Peter Lairo's
advice.
Attachment #114645 - Attachment description: Description of the argothrim and code → Description of the algothrim and code
Reading comment 3 I think this is awfully close to bug 181866. Is bayesian
filtering code being reused here or are you guys reinventing the wheel?
I think that what POPFile does is excatly that, apart from the
learning-by-monitoring-GUI part. It classifies mail into many types which it
learns to recognize by the user's corrections. Perhaps the people at POPFile
would be willing to help encorporate their code into mozilla.
Just a thought.
*** Bug 218692 has been marked as a duplicate of this bug. ***
Blocks: majorbugs
You guys think very complicated.
I think the majority of people have the first layer of mails classification
creating different mail account for work in the office, for work on weekends,
for the family, etc.
Yes, it may be practical to sort mails among several subfolders of inbox (i.e.
related to different projects), but I would not do this automatically unless I'm
absolutely sure.
So what is really needed is a way to quickly / efficiently process mails from
one or several inboxes.
I suggest very simple and effective solution.

Cool feature request:

“Deferred filter”

I'd like to define filters saying where to put mails from inboxes after I've
read them and let them go.

UI design may look like generalization of Junk column, but icons should already
be present in the column. 
The moving action should be fired when icon is clicked. 
Icons should be customizable on filter definition form.
The default icon should look like small picture with number of filter inside it.
Mozilla applies filters in the order they appear in the filters list. 
One filter may consist of several rules.
The first rule plays and gives its icon to the message.
If a message wasn't caught by any rule, I should be able to right-click in the
column and select an icon from list.
In the filter definition form for better usability should be check box allowing
to set “apply immediately after manual selection” option. As it is clear from
the name, it forces a message to go to folder defined in a filter immediately
after manual selection of corresponding icon for an uncaught message.
And also we may define “default filter” for double click event. Here may be “old
good junk filter set” out of the box.
Any massage after clicking on “release filter” icon goes to a folder defined in
a filter. It is very similar to existing feature of going to a junk folder after
clicking on the junk column, but much more functional.

To make this feature “killing” Mozilla may generate rules when I move an
iconized message to a destination folder, exactly the same way as when I apply
Junk action to a message or create rule from a message, except that with junk
the destination is known in advance but here it should be figured out after I
put there a message.

Also I wanted to see junk filter in my filter list to be able to make it more
wise then it is now. 
*** Bug 243608 has been marked as a duplicate of this bug. ***
Here my RFE (from the bug 243608

----

It would be useful to have more than one learnable Filter.


Ex: you are in multiple Mailing Lists and want to sort the mails by topic (not
by list)

Ex: with a management (import, export, sign (PGP - for trusted filters) etc)
companies/persons can create their own filters to remove virusses, worms and
hoaxes. You just need to import those filters and you get easyly rid of those
**** without need to train your filter.

----

I'm not exactly sure that i'm right here ...
Product: MailNews → Core
No longer blocks: majorbugs
Blocks: 297108
Blocks: 66425
Looks interesting, but I doubt anyone's working on this anymore.
Assignee: fangfang.xia → nobody
Status: ASSIGNED → NEW
Component: MailNews: Networking → MailNews: Backend
QA Contact: grylchan → backend
Product: Core → MailNews Core
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: