Closed
Bug 231873
Opened 21 years ago
Closed 21 years ago
libmime should strip out html tags for bayesian spam filter
Categories
(MailNews Core :: Filters, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mscott, Assigned: mscott)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
1.23 KB,
patch
|
Details | Diff | Splinter Review |
According to a lot of the research done by the spambayes folks, we don't want to
be tokenizing html tags in the message body when trying to determine if the
message is junk.
Stripping out HTML tokens will allow us to properly catch things like:
Via<asd>gra</asd>
or
Via<!-- nothing -->gra
which confuses the tokenizer today.
It looks like we can do this very easily by leveraging BenB's work and forcing
plain text conversion on the message if libmime is processing it for the
bayesian engine. One thing I did notice though is that the plain text mode for
libmime does some basic html substitution with plain text equivalents which
ideally we would not want. i.e.
Hel<i>lo</i>
becomes
Hel/lo/
if there was an easy way to turn that part of even better!
Regardless just this small change should help the tokenizer quite a bit.
Patch coming up
Assignee | ||
Comment 1•21 years ago
|
||
Assignee | ||
Comment 2•21 years ago
|
||
The only downside of letting libmime do the work of stripping html tags is that
spam bayes claims there is a lot of value in being able to tokenize embedded
urls (such as an img src url) and they do tokenize those. With this solution, we
won't see the embedded image source urls anymore.
But I think that negative is outweighed by the benefit of us no longer
tokenizing html tags.
please forgive my ignorance on issue, but wouldn't it be ideal to parse msg
content (subject, body, etc) for embedded URL's.
I watch a SpamAssassin instance parse spam message and see various interations
of embedded URL that signify spam. It would be unfortunate to lose this
resolution.
Any chance embedded URL scan could be "preliminary step for spam analysis"
before run through libmime?? Thx -GA
Comment 4•21 years ago
|
||
Scott, I'm pretty sure that the tokenizer is already stripping out html tags. I
don't know where they get stripped out, but when I've looked through the debug
logs there are no html tags. Also, if you do a 'strings' on the training.dat
file you won't see any html tags.
Assignee | ||
Comment 5•21 years ago
|
||
Interesting...when I sat there in the debugger I saw:
Tokenizer::add
get called with values of html, body, div, etc. so we do generate tokens for
them. Are you thinking that someone else knows to ignore these tokens?
Note: we tokenize off of < and > so you won't see <html> you would see html in
your training.dat
Assignee | ||
Comment 6•21 years ago
|
||
When I scan through my training dat, I do see words like font, div, body, etc.
Maybe they got there because they were actual words in a message and not part of
html tags. But I see a difference in the tokens we generate with html elements
before and after this patch.
Comment 7•21 years ago
|
||
I'm wrong, I just went through and I see a bunch of html stuff. I'm not sure
what it was that I was thinking about.
Never mind me, continue on.
Assignee | ||
Comment 8•21 years ago
|
||
Miguel, do you agree with me that we should do this?
I'd like to put together a single patch with some of these token changes + your
algorithm changes and have you run through your test set again so we can see how
we compare to:
existing app
app + your algorithm change
app + algo change + start of tokenizer changes
Comment 9•21 years ago
|
||
I've setup the SpamAssasin public corpus for testing on my mail server. I'll
test against each of the changes individually and then all of them together.
There are some complicated training ideas at
http://www.entrian.com/sbwiki/TrainingIdeas, but I'm just going to keep it
simple and train on half the emails and rate the other half. I'll try to get
this done tonight.
Comment 10•21 years ago
|
||
Scott, the HTML->TXT converter definitely can leave the /italics/ markers etc.
out, if you give it the right flags.
Here's a quote from mimethpl.cpp, the "As Plaintext" libmime class:
PRUint32 flags = nsIDocumentEncoder::OutputFormatted
| nsIDocumentEncoder::OutputWrap
| nsIDocumentEncoder::OutputFormatFlowed
| nsIDocumentEncoder::OutputLFLineBreak
| nsIDocumentEncoder::OutputNoScriptContent
| nsIDocumentEncoder::OutputNoFramesContent
| nsIDocumentEncoder::OutputBodyOnly;
HTML2Plaintext(cb, asPlaintext, flags, 80);
HTML2Plaintext() is a helper function in mimemoz2.*. The possible flags should
be documented in nsIDocumentEncoder.*.
Note that the "As Plaintext" libmime class is quite slow. It does a HTML (input)
to TXT conversion, then a TXT->HTML conversion again for display. Unless you use
some uncommon source flags for libmime, your output is HTML (different from the
input), not plaintext.
I think it makes more sense to just call HTML2Plaintext() directly in/for your
spam filter.
Assignee | ||
Comment 11•21 years ago
|
||
thanks for the pointers Ben. I was looking at the flags for nsIDocumentEncoder
and the only one I see that will come close to what we want is:
115 // Plaintext output: Convert html to plaintext that looks like the html.
116 // Implies wrap (except inside <pre>), since html wraps.
117 // HTML output: always do prettyprinting, ignoring existing formatting.
118 // (Probably not well tested for HTML output.)
119 OutputFormatted = 2,
But that still marks up the plain text which is what we are hoping to avoid.
Or are you saying even though I'll still get some mark up, it will be more
efficient calling this directly instead of going through the PlainText converter
because it makes that extra text back to HTML translation?
Comment 12•21 years ago
|
||
If you don't want /italics/ and * list items, you should *not* set
|OutputFormatted|.
> Or are you saying even though I'll still get some mark up, it will be more
> efficient calling this directly instead of going through the PlainText converter
> because it makes that extra text back to HTML translation?
Yes, that and buffers and other stuff you'll really dislike :). Because the
"HTML as Plaintext" class does HTML2Plaintext plus other things, HTML2Plaintext
alone is garanteed to be faster.
Comment 13•21 years ago
|
||
*** Bug 213614 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 14•21 years ago
|
||
this was fixed as part of Bug #230093 where the junk mail tokenizer runs the
mime text through the serializer.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Updated•20 years ago
|
Product: MailNews → Core
Updated•17 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•