231873 - libmime should strip out html tags for bayesian spam filter

Assignee

Description

•

21 years ago

According to a lot of the research done by the spambayes folks, we don't want to be tokenizing html tags in the message body when trying to determine if the message is junk. Stripping out HTML tokens will allow us to properly catch things like: Via<asd>gra</asd> or Viagra which confuses the tokenizer today. It looks like we can do this very easily by leveraging BenB's work and forcing plain text conversion on the message if libmime is processing it for the bayesian engine. One thing I did notice though is that the plain text mode for libmime does some basic html substitution with plain text equivalents which ideally we would not want. i.e. Hel<i>lo</i> becomes Hel/lo/ if there was an easy way to turn that part of even better! Regardless just this small change should help the tokenizer quite a bit. Patch coming up

Scott MacGregor

Assignee

Updated

•

21 years ago

Blocks: 230093

Status: NEW → ASSIGNED

Scott MacGregor

Assignee

Comment 1

•

21 years ago

Attached patch the fix — Details — Splinter Review

Scott MacGregor

Assignee

Comment 2

•

21 years ago

The only downside of letting libmime do the work of stripping html tags is that spam bayes claims there is a lot of value in being able to tokenize embedded urls (such as an img src url) and they do tokenize those. With this solution, we won't see the embedded image source urls anymore. But I think that negative is outweighed by the benefit of us no longer tokenizing html tags.

GA

Comment 3

•

21 years ago

please forgive my ignorance on issue, but wouldn't it be ideal to parse msg content (subject, body, etc) for embedded URL's. I watch a SpamAssassin instance parse spam message and see various interations of embedded URL that signify spam. It would be unfortunate to lose this resolution. Any chance embedded URL scan could be "preliminary step for spam analysis" before run through libmime?? Thx -GA

Miguel Vargas

Comment 4

•

21 years ago

Scott, I'm pretty sure that the tokenizer is already stripping out html tags. I don't know where they get stripped out, but when I've looked through the debug logs there are no html tags. Also, if you do a 'strings' on the training.dat file you won't see any html tags.

Scott MacGregor

Assignee

Comment 5

•

21 years ago

Interesting...when I sat there in the debugger I saw: Tokenizer::add get called with values of html, body, div, etc. so we do generate tokens for them. Are you thinking that someone else knows to ignore these tokens? Note: we tokenize off of < and > so you won't see <html> you would see html in your training.dat

Scott MacGregor

Assignee

Comment 6

•

21 years ago

When I scan through my training dat, I do see words like font, div, body, etc. Maybe they got there because they were actual words in a message and not part of html tags. But I see a difference in the tokens we generate with html elements before and after this patch.

Miguel Vargas

Comment 7

•

21 years ago

I'm wrong, I just went through and I see a bunch of html stuff. I'm not sure what it was that I was thinking about. Never mind me, continue on.

Scott MacGregor

Assignee

Comment 8

•

21 years ago

Miguel, do you agree with me that we should do this? I'd like to put together a single patch with some of these token changes + your algorithm changes and have you run through your test set again so we can see how we compare to: existing app app + your algorithm change app + algo change + start of tokenizer changes

Miguel Vargas

Comment 9

•

21 years ago

I've setup the SpamAssasin public corpus for testing on my mail server. I'll test against each of the changes individually and then all of them together. There are some complicated training ideas at http://www.entrian.com/sbwiki/TrainingIdeas, but I'm just going to keep it simple and train on half the emails and rate the other half. I'll try to get this done tonight.

Ben Bucksch (:BenB)

Comment 10

•

21 years ago

Scott, the HTML->TXT converter definitely can leave the /italics/ markers etc. out, if you give it the right flags. Here's a quote from mimethpl.cpp, the "As Plaintext" libmime class: PRUint32 flags = nsIDocumentEncoder::OutputFormatted | nsIDocumentEncoder::OutputWrap | nsIDocumentEncoder::OutputFormatFlowed | nsIDocumentEncoder::OutputLFLineBreak | nsIDocumentEncoder::OutputNoScriptContent | nsIDocumentEncoder::OutputNoFramesContent | nsIDocumentEncoder::OutputBodyOnly; HTML2Plaintext(cb, asPlaintext, flags, 80); HTML2Plaintext() is a helper function in mimemoz2.*. The possible flags should be documented in nsIDocumentEncoder.*. Note that the "As Plaintext" libmime class is quite slow. It does a HTML (input) to TXT conversion, then a TXT->HTML conversion again for display. Unless you use some uncommon source flags for libmime, your output is HTML (different from the input), not plaintext. I think it makes more sense to just call HTML2Plaintext() directly in/for your spam filter.

Scott MacGregor

Assignee

Comment 11

•

21 years ago

thanks for the pointers Ben. I was looking at the flags for nsIDocumentEncoder and the only one I see that will come close to what we want is: 115 // Plaintext output: Convert html to plaintext that looks like the html. 116 // Implies wrap (except inside <pre>), since html wraps. 117 // HTML output: always do prettyprinting, ignoring existing formatting. 118 // (Probably not well tested for HTML output.) 119 OutputFormatted = 2, But that still marks up the plain text which is what we are hoping to avoid. Or are you saying even though I'll still get some mark up, it will be more efficient calling this directly instead of going through the PlainText converter because it makes that extra text back to HTML translation?

Ben Bucksch (:BenB)

Comment 12

•

21 years ago

If you don't want /italics/ and * list items, you should *not* set |OutputFormatted|. > Or are you saying even though I'll still get some mark up, it will be more > efficient calling this directly instead of going through the PlainText converter > because it makes that extra text back to HTML translation? Yes, that and buffers and other stuff you'll really dislike :). Because the "HTML as Plaintext" class does HTML2Plaintext plus other things, HTML2Plaintext alone is garanteed to be faster.

Mike Cowperthwaite

Comment 13

•

21 years ago

*** Bug 213614 has been marked as a duplicate of this bug. ***

Scott MacGregor

Assignee

Comment 14

•

21 years ago

this was fixed as part of Bug #230093 where the junk mail tokenizer runs the mime text through the serializer.

Status: ASSIGNED → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

Bugzilla

libmime should strip out html tags for bayesian spam filter

Categories

(MailNews Core :: Filters, defect)

Tracking

(Not tracked)

People

(Reporter: mscott, Assigned: mscott)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Updated

Attachment

General

Description

File Name

Content Type