Open Bug 353839 Opened 14 years ago Updated 9 years ago

Add support for third-party spam filters (SpamSieve)

Categories

(Thunderbird :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: sheppy, Unassigned)

References

(Blocks 1 open bug)

Details

I like to use SpamSieve (http://c-command.com/spamsieve/) for my spam filtering since it works well for me.  The author says that he'd like to add support for Thunderbird except it doesn't offer the hooks needed to connect a third-party spam filter application to it.
Did the author say what kind of hooks he wanted? I'd be happy to work with him to add hooks...
I've emailed the author to ask him to post his needs here.
OS: Mac OS X 10.4 → All
Hardware: Macintosh → All
Summary: Add support for third-party spam filters → Add support for third-party spam filters (SpamSieve)
Version: 1.5 → Trunk
I'm the SpamSieve developer, and I'd like to help you get Thunderbird
and SpamSieve integrated.

Communication with SpamSieve is via three main Apple events, which are
documented in its scripting dictionary. The "score" command sends
SpamSieve an RFC-822 message and returns an integer indicating how
spammy it thinks the message is (0-49 means not spam, 50-100 means
spam). The "add spam" and "add good" commands each send SpamSieve a
message and train it that it's spam or not spam. Getting the two
programs integrated is mainly a matter of getting Thunderbird to send
SpamSieve the events at the proper times.

I'm not sure what you have in mind for how this would work. I think it
would be simplest from the user's point of view if it were possible to
enable SpamSieve in the Adaptive Filter tab of the Junk Mail Controls
window--that is, if there were a switch for choosing between the
built-in engine and SpamSieve (or another filter). If the latter were
selected, the rest of the Thunderbird user interface would stay the
same, but the junk commands would be routed to SpamSieve. When a message
arrives, Thunderbird would send the "score" command to SpamSieve and
then set the message's junk status and (depending on the score and the
Handling setting) move it to the Junk folder. When the user clicks the
Junk or Not Junk button (or uses a menu command), Thunderbird would send
the "add spam" or "add good" command.

Ideally, from my perspective, the code to launch SpamSieve and send it
the three commands would be built into Thunderbird. (I could supply
sample code if you want.) Then, if the two applications were both
installed, they would automatically work together, without the user
having to install anything. This is how SpamSieve works with Mailsmith
and GyazMail.

Another option would be for you to build a more general system where
Thunderbird lets the user choose AppleScripts or command-line tools to
be run when messages need to be scored or marked as Junk or Not Junk.
Then I would maintain these scripts/tools as part of SpamSieve. If you
go this route, I would strongly recommend having an automated way to
tell Thunderbird which scripts/tools to use. The user shouldn't have to
select anything in the filesystem; rather, he should be able to choose a
menu item in SpamSieve and have everything hooked up, e.g. by SpamSieve
writing the locations into a Thunderbird config file or
copying/symlinking them into a known location in a Thunderbird support
folder.
It sounds to me like a Thunderbird extension is the natural way to go here. I think what this extension should do is replace Thunderbird's built-in bayesian filter with SpamSieve. We already have to notify our bayesian filter when the user marks a message as junk/non junk, and we already ask it to classify messages as junk or non-junk.  Our bayesian filter sits behind an interface, nsIJunkMailPlugin - http://lxr.mozilla.org/mozilla/source/mailnews/base/search/public/nsIMsgFilterPlugin.idl#81

and the SpamSieve extension would implement the same interface. That would make it pretty seamless as far as the user is concerned. It would need to do similar things as our existing bayesian filter. For example, when asked to classify messages, it would need to stream the messages,one after the other, get their text, and classify them. See http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1225

and http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1032

And then asynchronously notify the core code about the results, for each request:

http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1097

What I'm not sure about is how easy it is to override an existing contract id and provide your own module that implements that contract id...I think it should just work, but if worst comes to worst, we could add a pref that the extension would set which would control the contract-id the core code accesses to find the spam filter to run. 

So you'd have to write a bit of glue code to stream the messages, and either combine it into one big buffer, or feed parts of the message to your analysis code - the nice thing about streaming messages is that for the caller, it doesn't matter if it's imap or pop3 - you just get called back as data is available.

Do you do your whitelisting by parsing the message headers as they go by? We actually have access to the sender without needing to stream the whole message - you'd get this by getting the message header from the URI and asking the header for the author...


I don't see any reason to run both bayesian filters, do you? I can imagine other scenarios where we want to run multiple plugins, but in this case, it just sounds like duplication.

Does this sound like an approach that can work for you? It does involve writing an extension, but it's by far the most natural way to do it from Thunderbird's point of view, and doesn't require any changes to the core TB code.
I don't mind writing the glue code for the extension, but would it be possible for someone to help me figure out how to compile it? The Mozilla build system baffles me. The extension examples that I found aren't for extensions with C code. And the bayesian-spam-filter extension seems to be compiled directly into the huge thunderbird-bin. What I'd like to know is, how do I get from my nsSpamSieveBayesianFilter.cpp to a .xpi file that will override the code in nsBayesianFilter.cpp?

> For example, when asked to classify messages, it would need to stream the messages,one after the other, get their text, and classify them.

Does TokenStreamListener::OnDataAvailable receive all of the message source, or just the body? The reason I ask is that I also see ProcessHeaders, HandleAttachment, etc. I don't want Thunderbird to do the parsing; I want to feed the entire message to SpamSieve's parser.

> Do you do your whitelisting by parsing the message headers as they go by?

No, it all happens after receiving the entire message text. Whitelist rules can refer to any part of the message.

> I don't see any reason to run both bayesian filters, do you?

No. Plus, I think it would make training very confusing.
Assignee: mscott → nobody
Duplicate of this bug: 360984
Blocks: junktracker
The current situation is that SpamSieve has a plug-in for Thunderbird 1.5.x and 2.x. It works as described in David's message. We weren't able to get at the original message source, so it tries to get various pieces and glue them together. It doesn't have access to all of the data for the attachments, so the filtering is not as accurate as with other mail clients, but overall it works pretty well.

Unfortunately, I was never able to get Thunderbird 3.x to load the plug-in, even after adapting it to the interface changes and recompiling it against the new source base. Various Mozilla people tried to help me but were stumped.
(In reply to comment #7)
> Unfortunately, I was never able to get Thunderbird 3.x to load the plug-in,
> even after adapting it to the interface changes and recompiling it against the
> new source base. Various Mozilla people tried to help me but were stumped.

Can you elaborate on this?  Are you saying the binary extension would not even get loaded into Thunderbird's address space, or that once it was there that nothing interesting happened?  Did you step through whatever platform code was doing the loading?  (Not being accusatory here, just would like to know what was tried and did not work out :)
Er, and it may be best to have this discussion in the dev-apps-thunderbird newsgroup.  There are apparently a number of third-party binary thunderbird add-on developers out there and they may be able to lend a hand if they are privy to the thread.
(In reply to comment #8)
> Can you elaborate on this?  Are you saying the binary extension would not even
> get loaded into Thunderbird's address space, or that once it was there that
> nothing interesting happened?

I verified (with some print statements) that the component registration code in my binary extension was executing, but nothing else happened. Thunderbird never called my code when new messages arrived or the user marked messages as junk.

We also looked at compreg.dat. With 2.x (where the extension worked) it shows my dylib registered under the contract ID (f1070bfa-d539-11d6-90ca-00039310a47a). However, with 3.x the compreg.dat is identical before and after installing my extension.
Michael, any progress?

Also, is any aspect of Bug 493309 needed?
Wayne: No progress. If someone wants to help, I think the best thing would be to create a small example demonstrating how to build an extension that overrides the built-in filter. That is, use the existing Bayesian filter code and just show how to do the extension part (Makefile, install.rdf, etc.). If this is possible, it should be easy for someone who understands Thunderbird. I think that makes more sense than going back and forth explaining what I did that didn't work.

I don't think I need anything in Bug 493309.
(In reply to comment #12)
> Wayne: No progress. If someone wants to help, I think the best thing would be
> to create a small example demonstrating how to build an extension that
> overrides the built-in filter. That is, use the existing Bayesian filter code
> and just show how to do the extension part (Makefile, install.rdf, etc.). If
> this is possible, it should be easy for someone who understands Thunderbird. I
> think that makes more sense than going back and forth explaining what I did
> that didn't work.

Kent, do you have an extension like this already?
I don't have such an extension. I don't think that it would be terribly difficult to throw one together though. The main issue for me is that my existing EWS extension connects in mostly through existing hooks that look for components with specific names based on a server type (which for me is "ews" instead of "imap") but the spam filter replacement needs to override an existing component. Not difficult in theory, but it's different magic then I normally use.

Is the main issue here a binary C++ component, or javascript?
Kent, I'm assuming that it would need to be written in C++, and what's what I did before, but JavaScript would be fine, too, if Thunderbird allows calling out of JavaScript (e.g. to a shell tool).
You could probably get away with jsctypes if you needed to connect to a binary library. It depends on how the filter tool would be activated--if it's just calling a shell command, that is easily doable with nsIProcess (I think).
You need to log in before you can comment on or make changes to this bug.