Closed Bug 213614 Opened 22 years ago Closed 21 years ago

HTML comments should be stripped before spam filtering

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 231873

People

(Reporter: cedilla, Assigned: sspitzer)

References

Details

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 Spammers seem to be sending messages with HTML comments breaking up words, to avoid Baysian filters. Mozilla should strip out all HTML comments before sending it to the spam filter. Looking through my training.dat file, I see lots of things like "--s9aq3ibk7n--" which I found in a message as: pre<!--s9aq3ibk7n-->ma<!--e1xfh61b7k4t53-->ture The spam filtering did not catch the word "premature" as it should have Reproducible: Always Steps to Reproduce: 1. Recieve HTML spam Actual Results: Mozilla ran the SPAM filter on the HTML, which can break up the word of the messages using comments. Expected Results: Stripped the message of HTML comments, then ran the spam filter.
I also just noticed a base64-encoded spam message (presumably for the same reason of thwarting unintelligent filters. The Mozilla filter did not catch it, but I don't see the nasty base64 garbage in my training.dat file. As I said before, spam filtering should be done last in line (or close to last; I'm not sure exactly what hoops Mozilla puts my mail through).
This seems reasonable. I also get many spams broken with comments.
Actually, it should probably strip all html tags. I have seen numerous spams using tables to split words: <tr><td>V</td><td>i</td><td>a</td><td>g</td><td>r</td><td>a</td></tr>. If they set the borders to none, you won't even notice this visually. You should probably change OS to All. And I am not sure about the component. I don't know where does junk mail classifying belong, but it does not meet the description of FrontEnd component. what about Filters?
I had set the OS to WinXP, simply because that's what I had. The 'front end' does seem inappropriate, and I do agree that the spam filter is a filter. I didn't set it as that because mozilla thinks of 'filters' and 'junk mail controls' as two separate things. You're probably right, though. It's more of a 'filter' than a 'front end'. I've been thinking about this more, and it would probably be easiest to send a copy of the message through the HTML sanitizer(*), analyzing that, then attaching the result to the un-sanitized copy. This would make the spam filtering based on what you see, but still allow you to keep HTML formatting if you want. Mozilla could at least do the HTML sanitizing step before spam filtering, if you have HTML sanitizer on. (*) The HTML sanitizer is the "View > Message Body As" option. It's basically an HTML->text converter.
Component: Mail Window Front End → Filters
OS: Windows XP → All
Hardware: PC → All
Another variant of this seems to be insertig completely bogus HTML tags, like this "He<kmnsbclbnulync>re<kwwrrvuawruymd> To L<kpdnqncysknxvdo>ear<kptqsgfdzow>n M<klkgkffrxdl>or<kfsgvxpdbgjcyl>e" into a message. I noticed that these tags actually end up in training.dat, so another reason for implementing this suggestion would be to keep the number of tokens down that have to be processed. And argument could be made, though, for keeping the URLs of links and external images as part of the tokens, because they may be more difficult to disguise and still keep functional (if you want to promote a website, you end up having to provide some kind of link there).
*** Bug 212656 has been marked as a duplicate of this bug. ***
Looks to me as if the HTML sanitizer will wipeout not only links and external images which may be useful but also other potentially useful tokens that are sandwiched inbetween spurious html tags. For example The Spam message: Free Cable TV meessage source : <p> Fr</buxtehud>ee Ca</broaden>ble TV</p> Does not appear once it is HTML sanitized. This is unacceptable for the spam filter since it would allow spammers to slip entire messages by which appaer blank to the filter. As I see it we have to options to resolve this bug. 1) Rip out all the tags from message source and feed that to the bayesian filter 2) Rip out the tags with the exception of <a and <img tags, which we replace with the href and source arguments. This shouldn't be terribly hard although slightly more complicated than option 1. I think I could do this, fairly easily. Anyone else have some comments and/or would like to point me to the tokenizer in the mozilla source? Miller
I think simply ripping out the tags could make the Bayesian filter *less* effective. Reordering the text to move but preserve the tags might, possibly, make sense, since the Bayes filter could use those odd tags (which may well be repeated) as tokens to identify junk. See also bug 204322.
OK, so running the message through the sanitizer is probably overkill, as it would remove potentially incrimiating evidence. I was thinking about removing only the comments and invalid tags (like "enlar<qptovkr>gement"), but then the spammers could do something like "enlar<b></b>gement" and get around it (I assume). I do like Mike's idea of simply moving the tags to the end (or wherever) which would catch the text and tags. This wouldn't help with the huge number of completely random text that is currently in my training.dat file, but would be quite good for spam-blocking purposes. I don't, however, see that it would catch spam based on the faked html tags, as my training.dat file doesn't have a single repeated random comment. Quoth paul miller: > For example > The Spam message: Free Cable TV > meessage source : <p> Fr</buxtehud>ee Ca</broaden>ble TV</p> > Does not appear once it is HTML sanitized. I'm not quite sure what you're saying. Do you mean that the entire line will be erased when putting it through the sanitizer, or just the tags? (I'm currently back home for the holidays, so I can't do any testing to see what the sanitizer actually does to that message.) If the entire line is deleted, I would say that's an issue with the sanitizer, as it should be printed "Free Cable TV". If, however, the tags are all that's deleted, then there's no reason to worry about spammers slipping messages throught the filter in tags, because the user wouldn't see the tags in the first place. ... Or am I completely missing something?
The special MSIE comment code <!--[IF IE5] --> (not sure of the exact syntax) should not be stripped, though
I found the following syntax for IE-specific conditional code: <!--[If IE5]> IE 5 _only_ will get this code (including any amount of markup) <![ENDIF]--> Anyway, from this I couldn't see a reason why this shouldn't be stripped out as any other comment for the purpose of spam filtering - this is supposed to look like a comment for any browser except Internet Explorer 5, so Mozilla would display nothing in its place. If this were left in the code, it could still be used to disguise tokens like as in "via<!--[If IE5]><![ENDIF]-->gra", couldn't it? Or am I missing something here?
@Marcus: Although I have never seen those IE-secific tags before, what you describe is correct. If left alone, they could be used in the same way that HTML comments or bogus HTML tags are used to split up spammy words. I agree that they should be treated the same as HTML comments.
Looks like there's a patch in progress; duping. *** This bug has been marked as a duplicate of 231873 ***
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → DUPLICATE
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.