Closed Bug 231873 Opened 21 years ago Closed 21 years ago

libmime should strip out html tags for bayesian spam filter

Categories

(MailNews Core :: Filters, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mscott, Assigned: mscott)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

According to a lot of the research done by the spambayes folks, we don't want to be tokenizing html tags in the message body when trying to determine if the message is junk. Stripping out HTML tokens will allow us to properly catch things like: Via<asd>gra</asd> or Via<!-- nothing -->gra which confuses the tokenizer today. It looks like we can do this very easily by leveraging BenB's work and forcing plain text conversion on the message if libmime is processing it for the bayesian engine. One thing I did notice though is that the plain text mode for libmime does some basic html substitution with plain text equivalents which ideally we would not want. i.e. Hel<i>lo</i> becomes Hel/lo/ if there was an easy way to turn that part of even better! Regardless just this small change should help the tokenizer quite a bit. Patch coming up
Blocks: 230093
Status: NEW → ASSIGNED
Attached patch the fixSplinter Review
The only downside of letting libmime do the work of stripping html tags is that spam bayes claims there is a lot of value in being able to tokenize embedded urls (such as an img src url) and they do tokenize those. With this solution, we won't see the embedded image source urls anymore. But I think that negative is outweighed by the benefit of us no longer tokenizing html tags.
please forgive my ignorance on issue, but wouldn't it be ideal to parse msg content (subject, body, etc) for embedded URL's. I watch a SpamAssassin instance parse spam message and see various interations of embedded URL that signify spam. It would be unfortunate to lose this resolution. Any chance embedded URL scan could be "preliminary step for spam analysis" before run through libmime?? Thx -GA
Scott, I'm pretty sure that the tokenizer is already stripping out html tags. I don't know where they get stripped out, but when I've looked through the debug logs there are no html tags. Also, if you do a 'strings' on the training.dat file you won't see any html tags.
Interesting...when I sat there in the debugger I saw: Tokenizer::add get called with values of html, body, div, etc. so we do generate tokens for them. Are you thinking that someone else knows to ignore these tokens? Note: we tokenize off of < and > so you won't see <html> you would see html in your training.dat
When I scan through my training dat, I do see words like font, div, body, etc. Maybe they got there because they were actual words in a message and not part of html tags. But I see a difference in the tokens we generate with html elements before and after this patch.
I'm wrong, I just went through and I see a bunch of html stuff. I'm not sure what it was that I was thinking about. Never mind me, continue on.
Miguel, do you agree with me that we should do this? I'd like to put together a single patch with some of these token changes + your algorithm changes and have you run through your test set again so we can see how we compare to: existing app app + your algorithm change app + algo change + start of tokenizer changes
I've setup the SpamAssasin public corpus for testing on my mail server. I'll test against each of the changes individually and then all of them together. There are some complicated training ideas at http://www.entrian.com/sbwiki/TrainingIdeas, but I'm just going to keep it simple and train on half the emails and rate the other half. I'll try to get this done tonight.
Scott, the HTML->TXT converter definitely can leave the /italics/ markers etc. out, if you give it the right flags. Here's a quote from mimethpl.cpp, the "As Plaintext" libmime class: PRUint32 flags = nsIDocumentEncoder::OutputFormatted | nsIDocumentEncoder::OutputWrap | nsIDocumentEncoder::OutputFormatFlowed | nsIDocumentEncoder::OutputLFLineBreak | nsIDocumentEncoder::OutputNoScriptContent | nsIDocumentEncoder::OutputNoFramesContent | nsIDocumentEncoder::OutputBodyOnly; HTML2Plaintext(cb, asPlaintext, flags, 80); HTML2Plaintext() is a helper function in mimemoz2.*. The possible flags should be documented in nsIDocumentEncoder.*. Note that the "As Plaintext" libmime class is quite slow. It does a HTML (input) to TXT conversion, then a TXT->HTML conversion again for display. Unless you use some uncommon source flags for libmime, your output is HTML (different from the input), not plaintext. I think it makes more sense to just call HTML2Plaintext() directly in/for your spam filter.
thanks for the pointers Ben. I was looking at the flags for nsIDocumentEncoder and the only one I see that will come close to what we want is: 115 // Plaintext output: Convert html to plaintext that looks like the html. 116 // Implies wrap (except inside <pre>), since html wraps. 117 // HTML output: always do prettyprinting, ignoring existing formatting. 118 // (Probably not well tested for HTML output.) 119 OutputFormatted = 2, But that still marks up the plain text which is what we are hoping to avoid. Or are you saying even though I'll still get some mark up, it will be more efficient calling this directly instead of going through the PlainText converter because it makes that extra text back to HTML translation?
If you don't want /italics/ and * list items, you should *not* set |OutputFormatted|. > Or are you saying even though I'll still get some mark up, it will be more > efficient calling this directly instead of going through the PlainText converter > because it makes that extra text back to HTML translation? Yes, that and buffers and other stuff you'll really dislike :). Because the "HTML as Plaintext" class does HTML2Plaintext plus other things, HTML2Plaintext alone is garanteed to be faster.
*** Bug 213614 has been marked as a duplicate of this bug. ***
this was fixed as part of Bug #230093 where the junk mail tokenizer runs the mime text through the serializer.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: