Bug 213614
Opened 22 years ago
Closed 21 years ago
HTML comments should be stripped before spam filtering
(MailNews Core :: Filters, enhancement)
MailNews Core
(Not tracked)
of bug 231873
(Reporter: cedilla, Assigned: sspitzer)
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624
Spammers seem to be sending messages with HTML comments breaking up words, to
avoid Baysian filters. Mozilla should strip out all HTML comments before sending
it to the spam filter. Looking through my training.dat file, I see lots of
things like "--s9aq3ibk7n--" which I found in a message as:
The spam filtering did not catch the word "premature" as it should have
Reproducible: Always
Steps to Reproduce:
1. Recieve HTML spam
Actual Results:
Mozilla ran the SPAM filter on the HTML, which can break up the word of the
messages using comments.
Expected Results:
Stripped the message of HTML comments, then ran the spam filter.
Reporter | ||
Comment 1•22 years ago
I also just noticed a base64-encoded spam message (presumably for the same
reason of thwarting unintelligent filters. The Mozilla filter did not catch it,
but I don't see the nasty base64 garbage in my training.dat file.
As I said before, spam filtering should be done last in line (or close to last;
I'm not sure exactly what hoops Mozilla puts my mail through).
Actually, it should probably strip all html tags. I have seen numerous spams
using tables to split words:
<tr><td>V</td><td>i</td><td>a</td><td>g</td><td>r</td><td>a</td></tr>. If they
set the borders to none, you won't even notice this visually.
You should probably change OS to All.
And I am not sure about the component. I don't know where does junk mail
classifying belong, but it does not meet the description of FrontEnd component.
what about Filters?
Reporter | ||
Comment 4•21 years ago
I had set the OS to WinXP, simply because that's what I had.
The 'front end' does seem inappropriate, and I do agree that the spam filter is
a filter. I didn't set it as that because mozilla thinks of 'filters' and 'junk
mail controls' as two separate things. You're probably right, though. It's more
of a 'filter' than a 'front end'.
I've been thinking about this more, and it would probably be easiest to send a
copy of the message through the HTML sanitizer(*), analyzing that, then
attaching the result to the un-sanitized copy. This would make the spam
filtering based on what you see, but still allow you to keep HTML formatting if
you want.
Mozilla could at least do the HTML sanitizing step before spam filtering, if you
have HTML sanitizer on.
(*) The HTML sanitizer is the "View > Message Body As" option. It's basically an
HTML->text converter.
Component: Mail Window Front End → Filters
OS: Windows XP → All
Hardware: PC → All
Comment 5•21 years ago
Another variant of this seems to be insertig completely bogus HTML tags, like
this "He<kmnsbclbnulync>re<kwwrrvuawruymd> To
L<kpdnqncysknxvdo>ear<kptqsgfdzow>n M<klkgkffrxdl>or<kfsgvxpdbgjcyl>e" into a
message. I noticed that these tags actually end up in training.dat, so another
reason for implementing this suggestion would be to keep the number of tokens
down that have to be processed.
And argument could be made, though, for keeping the URLs of links and external
images as part of the tokens, because they may be more difficult to disguise and
still keep functional (if you want to promote a website, you end up having to
provide some kind of link there).
Comment 6•21 years ago
*** Bug 212656 has been marked as a duplicate of this bug. ***
Comment 7•21 years ago
Looks to me as if the HTML sanitizer will wipeout not only links and external
images which may be useful but also other potentially useful tokens that are
sandwiched inbetween spurious html tags.
For example
The Spam message: Free Cable TV
meessage source : <p> Fr</buxtehud>ee Ca</broaden>ble TV</p>
Does not appear once it is HTML sanitized.
This is unacceptable for the spam filter since it would allow spammers to
slip entire messages by which appaer blank to the filter.
As I see it we have to options to resolve this bug.
1) Rip out all the tags from message source and feed that to the bayesian filter
2) Rip out the tags with the exception of <a and <img tags, which we
replace with the href and source arguments. This shouldn't be terribly hard
although slightly more complicated than option 1.
I think I could do this, fairly easily. Anyone else have some comments and/or
would like to point me to the tokenizer in the mozilla source?
Comment 8•21 years ago
I think simply ripping out the tags could make the Bayesian filter *less*
effective. Reordering the text to move but preserve the tags might, possibly,
make sense, since the Bayes filter could use those odd tags (which may well be
repeated) as tokens to identify junk.
See also bug 204322.
Reporter | ||
Comment 9•21 years ago
OK, so running the message through the sanitizer is probably overkill, as it
would remove potentially incrimiating evidence.
I was thinking about removing only the comments and invalid tags (like
"enlar<qptovkr>gement"), but then the spammers could do something like
"enlar<b></b>gement" and get around it (I assume).
I do like Mike's idea of simply moving the tags to the end (or wherever) which
would catch the text and tags. This wouldn't help with the huge number of
completely random text that is currently in my training.dat file, but would be
quite good for spam-blocking purposes.
I don't, however, see that it would catch spam based on the faked html tags, as
my training.dat file doesn't have a single repeated random comment.
Quoth paul miller:
> For example
> The Spam message: Free Cable TV
> meessage source : <p> Fr</buxtehud>ee Ca</broaden>ble TV</p>
> Does not appear once it is HTML sanitized.
I'm not quite sure what you're saying. Do you mean that the entire line will be
erased when putting it through the sanitizer, or just the tags? (I'm currently
back home for the holidays, so I can't do any testing to see what the sanitizer
actually does to that message.) If the entire line is deleted, I would say
that's an issue with the sanitizer, as it should be printed "Free Cable TV". If,
however, the tags are all that's deleted, then there's no reason to worry about
spammers slipping messages throught the filter in tags, because the user
wouldn't see the tags in the first place.
Or am I completely missing something?
Comment 10•21 years ago
The special MSIE comment code <!--[IF IE5] --> (not sure of the exact syntax)
should not be stripped, though
Comment 11•21 years ago
I found the following syntax for IE-specific conditional code:
<!--[If IE5]>
IE 5 _only_ will get this code (including any amount of markup)
Anyway, from this I couldn't see a reason why this shouldn't be stripped out as
any other comment for the purpose of spam filtering - this is supposed to look
like a comment for any browser except Internet Explorer 5, so Mozilla would
display nothing in its place. If this were left in the code, it could still be
used to disguise tokens like as in "via<!--[If IE5]><![ENDIF]-->gra", couldn't it?
Or am I missing something here?
Reporter | ||
Comment 12•21 years ago
Although I have never seen those IE-secific tags before, what you describe is
correct. If left alone, they could be used in the same way that HTML comments or
bogus HTML tags are used to split up spammy words.
I agree that they should be treated the same as HTML comments.
Comment 13•21 years ago
Looks like there's a patch in progress; duping.
*** This bug has been marked as a duplicate of 231873 ***
Closed: 21 years ago
Resolution: --- → DUPLICATE
Updated•20 years ago
Product: MailNews → Core
Updated•17 years ago
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.