Open
Bug 280716
Opened 20 years ago
Updated 2 years ago
Determine a way to effectively handle image spam in the Baysian junk filter
Categories
(MailNews Core :: Filters, enhancement)
MailNews Core
Filters
Tracking
(Not tracked)
NEW
People
(Reporter: htworze, Unassigned)
References
(Blocks 1 open bug, )
Details
(Whiteboard: [sg:nse])
Attachments
(1 file)
|
22.32 KB,
message/rfc822
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Here's spam email that junk control filter has trouble with. I keep getting messages like one copied below (same content, same HTML structrue), mark each one manually as junk--at least a dozen times in last week--but junk filter doesn't learn to recognize it. Could be spammers have found some weakness in the adaptive filter. Subject: Re: MIME-Version: 1.0 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.4.3 (1.4.3-2) X-pstn-version: pmps:sps_solaris_1_1_0c0 pase:2.8 X-pstn-levels: (C:78.1961 P:95.9108 R:95.9108 S: 3.2295 ) X-pstn-settings: 4 (0.2500:0.7500) p:14 m:14 C:14 r:14 X-pstn-addresses: from <pjym@o2.pl> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META http-equiv=Content-Type content="text/html; charset=iso-8859-1"> <META content="MSHTML 6.00.2800.1479" name=GENERATOR> <STYLE></STYLE> </HEAD> <BODY bgColor=#ffffff> <DIV><FONT face=Verdana size=2>Hello,</FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>I am contacting you about the only true way you will be able to really start saving.</FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>Please review the information below. </FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>Upon completion of the following process by this time next month you will have at least $300 to $400 Extra Cash in your pocket per month.</FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>Your application is waiting to be received,</FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>We are ready to give you a loan. </FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>Approval process will take 1 minute.</FONT></DIV> <DIV><FONT face=Verdana></FONT> </DIV> <DIV><FONT face=Verdana size=2>CIick <A href="http://www.lowest-mort.com/?d37">here</A> and fill the quick form.</FONT></DIV> <DIV><FONT face=Verdana size=2></FONT> </DIV> <DIV><FONT face=Verdana size=2>If you would prefer not to go <A href="http://www.lowest-mort.com/rem.php">here.</A></FONT></DIV></BODY></HTML> Reproducible: Always
Comment 1•20 years ago
|
||
Clearing security flag: spam is not a security hole and most bugs have a better chance of getting fixed with more eyes on the problem.
Group: security
Whiteboard: [sg:nse]
Comment 2•20 years ago
|
||
(In reply to comment #0) As you see, there are not so many "dangerous words"/"often used words by spam" in the spam mail. I think "$", "cash", "loan" can be candidates for such words, but these are used in normal mail too, and it's hard to say "this is spam mail" only by existence of these three words for adaptive filter, even after several time of learning. And "From:" can not be used as indicator of spam in many cases, because spammer lies on "From:". My action for such spam is : - Create filters which do "Mark as Junk" & "Move to Junk" Although generic spam detection is impossible, common characteristics of spam exist in many cases, when spams are sent by same spammer or spamers who use same spaming software/data, during same term. In your case, it is probably URL of "http://www.lowest-mort.com", because spammer's objective is to force user to click the link. So condition of "If body contains www.lowest-mort.com" is set, the spam mail can be moved to Junk folder. I think this approach can be used for "phishing" mail too, because "phising" can not usually be achieved without link clicking. "X-Mailer: Ximian Evolution" can also be used, since many spammers use same spaming software in many cases. Condition of "X-Mailer: header contains 'Ximian Evolution'" will be effective, if you, your family, your friends and your company don't use 'Ximian Evolution'. Black listing of URLs by adaptive filter is probably very hard, but will possibly be implemented in near future by "Enhancement on phishing" such as > Bug 254913 : add "phishing" detection to mail client > Bug 278490 : Make link href clear in mail window to thwart phishing attacks > Bug 279191 : Add Phishing Detection Support to Thunderbird I think this bug can be closed as WONTFIX(impossible), or DUP of Bug 254913 / Bug 279191.
Comment 3•19 years ago
|
||
(In reply to comment #2) > (In reply to comment #0) > As you see, there are not so many "dangerous words"/"often used words by spam" > in the spam mail. Hello, I'd like to add another kind of spam mail I get receiving in Thunderbird (currently the latest 1.0.6 and observed with previous version 1.0.2 as well), although having clicked at least 5 times on different mails of this kind to mark it as spam. I suspect the 'Subject' being the troublemaker (private data and non-relevant data edited out with <...>): -- Begin of spam email -- Return-Path: <info@djdeep.ch> Received: from smtp.messaging.ch (exsmtp02.agrinet.ch [81.221.252.201]) by <...> (Postfix) with ESMTP id 845011CC012 for <...>; Mon, 8 Aug 2005 19:32:48 +0200 (MEST) Received: from neo-kdbug5b04ws ([80.219.207.194]) by smtp.messaging.ch with Microsoft SMTPSVC(6.0.3790.211); Mon, 8 Aug 2005 19:33:23 +0200 Message-ID: <4145-22005818173519515@neo-kdbug5b04ws> Errors-To: unscribe@djdeep.ch From: <info@djdeep.ch> To: <...> Subject: [SPAM?] PLAY IT LOUD @ LE BAL Date: Mon, 8 Aug 2005 19:35:19 +0200 MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_94915C5ABAF209EF376268C8" X-OriginalArrivalTime: 08 Aug 2005 17:33:23.0682 (UTC) FILETIME=[46EE4420:01C59C3F] This is a multi-part message in MIME format. ------=_NextPart_94915C5ABAF209EF376268C8 Content-Type: multipart/alternative; boundary="----=_NextPart_84815C5ABAF209EF376268C8" ------=_NextPart_84815C5ABAF209EF376268C8 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable PLAY IT LOUD @ Le Bal=20 Mittwoch, 10=2E August '05 23=2E00 - 04=2E00 Uhr Play it Loud Part=2E 2=2E Nach der ersten coolen Party im Le Bal setzten w= ir unsere Serie fort=2E Diesmal haben wir f=FCr Euch ein spezielles Line-U= p zusammengestellt=2E DJ Tremendo (Subliminal, NYC, Banditz) und DJ Deep (= Banditz) droenen Euch die Ohren mit den Schaerfsten Elektrobeats zu=2E Int= roset by DJ Rena Moreno, der Euch so richtig auf Stimmung bringen wird=2E = Also nicht verpassen!!!=20 Der Eintritt ist Gratis!! =20 NEWSLETTER ABMELDEN =20 ------=_NextPart_84815C5ABAF209EF376268C8 Content-Type: text/html; charset=US-ASCII Content-Transfer-Encoding: quoted-printable <html> <head> <meta http-equiv=3D"Content-Language" content=3D"de-ch"> <meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 5=2E0"> <meta name=3D"ProgId" content=3D"FrontPage=2EEditor=2EDocument"> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dwindows-= 1252"> <title>FRESH</title> <style> <!-- <...> --> </style> </head> <...> </html> ------=_NextPart_84815C5ABAF209EF376268C8-- ------=_NextPart_94915C5ABAF209EF376268C8 Content-Type: image/jpeg; name="playitloud.jpg" Content-Transfer-Encoding: base64 Content-Description: playitloud.jpg Content-Id: <314640-2200581817335282897@neo-kdbug5b04ws> <... some image data ...> -- End of spam email -- I have edited out the HTML content since it contains the same content more or less as the ASCII part, and the content changes from email to email anyway (some party announcements etc.). I _think_ the bugger could be the subject line, since this is the one which ALWAYS contains [SPAM?] in it: Subject: [SPAM?] PLAY IT LOUD @ LE BAL That's the only thing which remains constant (the actual subject changes off course, e.g. "[SPAM?] SOME OTHER PARTY"), apart from the sender's address (From: <info@djdeep.ch>) which is also always the same. I would expect that it should be fairly easy for the spam filter to filter out the same sender's address, but obviously it gets tricked somehow over and over again (by the subject line?)! Observed both on Windows and Linux Thunderbird, if that matters... Thanks, Oliver
Cross-ref: Bug 339617 which appears to describe the same issue reported here.
Comment 5•19 years ago
|
||
*** Bug 339617 has been marked as a duplicate of this bug. ***
Updated•19 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 6•18 years ago
|
||
Getting tonnes and tonnes of these starting to come through over the past two weeks. Have gone from one or two spam's per day getting through, to around thirty ( all base64 spam )
Re: comment 6. I've been getting a huge number of those lately as well. It seems to me that the problem isn't necessarily anything to do with the encoding, or even anything really wrong with the pseudo-Bayesian filter, but that it's a new and dangerous escalation of spam tactics. Basically what they've done is slap the spam into an image at the top of an HTML mail, and then filled in the rest of the HTML (and an alternative text block at the top, I think) with semi-random plausible sentences that *don't* look like spam for most people. Even if we could manage to train the filter on that text, it wouldn't matter, because it's random and from what little I can tell it appears to be chosen from a corpus of actual non-spam emails. The only general thing I can think of that might solve this would be converting over to a true Bayesian filter that tracks multi-word conditional probabilities, because most of the crap in the random text block is incoherent. But that would exponentiate the size of the training set and make it really slow. Also, it's not clear that it would solve the underlying problem, which is that none of the words in the messages have anything to do with the spam message... choosing to attach actual non-spam emails in their entirety rather than just snippets would fool even a true Bayesian filter. Perhaps someone can think of some hack to deal with this tactic, though... Hard to imagine what except to treat initial large images with suspicion... I guess theoretically such images could be OCR'd and included in the spam filtering... ugh.
Comment 8•18 years ago
|
||
I've been seeing a lot of spam just like that described in comment 7 and there's nothing there for our filter to pick up on because the actual text words are random. We're actively looking for ideas on how to catch this if folks have them. Maybe we could try to add a token for messages with an image at the very top of a message...but that could of course get fooled too.
You know, I *was* kidding originally, but it occurs to me that computers are fast enough these days that we actually *could* OCR the images... perhaps at low priority or something... I imagine there probably are good enough open-source OCR libraries...
Comment 10•18 years ago
|
||
Another question is why image based spam is displayed when I have HTML display turned off and remote images blocked ( i know it's not remote ) .... :/
Comment 11•18 years ago
|
||
I'm going to hijack this bug to focus on the new kind of image spam talked about in comment 7 and comment 8. Neil Turner just posted a tip he and some others have been using: http://www.neilturner.me.uk/2006/Aug/06/stopping_image_spam_in_th.html based on: http://www.tuaw.com/2006/08/04/a-mail-app-rule-for-catching-image-spam/1#comments Although I'd recommend modifying the suggested filter to skip e-mail from addresses in your address book. In one of the comments in the tuaw link, somone suggested filtering of messages from people you don't know which have an attachment name that ends in .gif. I don't think our current filters can do that, but maybe we can try to expand them to include a rule like 'contains image'. Not sure how easily that would work for IMAP though since we don't fetch the body.
Summary: spam that evades junk control filter → image spam evades bayesian junk control filter
Target Milestone: --- → Thunderbird2.0
Comment 12•18 years ago
|
||
cc'ing Neil since I referenced his blog post.
Updated•18 years ago
|
OS: Windows 2000 → All
Hardware: PC → All
Comment 13•18 years ago
|
||
(In reply to comment #11) That tip is nice but is nothing more than a workaround. It's also annoying to apply if you have several email accounts - a rule must be created (and "maintained"?) for each account. Are you planning to add actual OCR support to the junk detection algorithms? By the way, if you haven't noticed, the spammers are already using noise and different colors in their spam images so that OCR can't be used easily. IIRC, there's a commercial implementation of OCR-capable spam filter out there - they're probably targeting that.
Comment 14•18 years ago
|
||
In response to the recent animated Gif mails i've been receiving. Its pretty easy to figure out the meta-data that a gif is based on, by meta data i mean the number of frames, their display times etc. Normal animated gif images have the same display time for each and every frame, usually because trey are continuous animations. (and keep look looping) The spam messages i've been getting, have a bunch of short empty/lined frames to avoid orc fitlers, and then display the actual text for a far longer period. Now i tihnk it should be possible to store these characteristics in the bayesian filter, and not mark legit animations as spam. It might also help to store the color palette (per frame?) in the filter.
Comment 15•18 years ago
|
||
I'm going to have to sit down and grok these postings and the two suggestions linked to, but I just wanted to say that I got here because I was looking to see if a request to detect image spam had already been made. I am using my own tbird mail filters, because Cox was throwing away real email if I let it handle spam. It still throws away some stuff, it's clear, even though I selected mark it as spam and send it to me. I even caught it thinking I was a spammer - I sent a print screen image to myself as a reminder, and it never arrived. So I retried with a little text and that got thru, but after I had experimented awhile I could never receive any email containing that image nor could anyone else, regardless of the text in the email. Of course, Cox being Cox, they never told me they had not sent the email on, so now I think some unknown amount of my email with an image in it just goes into the bit bucket. I mention this as a caution for anyone designing a trap for image spam. I have the filters set up to toss anything they aren't sure of into a "look at this maybe its spam maybe not" filter. Image spam winds up there, as does email from new real senders. I look at the messages in there to see if I can improve my filters. It's bad enough seeing 10 emails touting the same stupid stock, but 10 emails with a drawing of a penis is just incredibly aggravating. Just venting, and hoping someone has a solution to this. Maybe we should chip in and hire the Russian mafia to blow a few spammers away.
Updated•18 years ago
|
QA Contact: general
Comment 16•18 years ago
|
||
I've (anecdotally) found that by just deleting the jpeg/picture spam and not marking it as junk, tbird's marking ratio of all my other spam seems to have gone up. I guess the classifier is not getting as 'gunked up' (technical term) with the non-spam-like content of these jpeg-spam emails. Of course, this don't really help solve the problem.
Comment 17•18 years ago
|
||
Some FYI: http://csoonline.com/read/040107/fea_spam_by_the_numbers.html from a Bruce Schneier post: http://www.schneier.com/blog/archives/2007/05/image_spam.html Idea from a comment in the post: how about a filter that marks as spam any msg that has an image and the From: is not in the personal address book/whitelist? Maybe we could at least enable this functionality through the filters, and ship a standard one?
Comment 18•17 years ago
|
||
the latest wonderful spammer variation is a pdf with text or image within the pdf (rather than attach an outright graphic image)
Comment 19•16 years ago
|
||
bug 333501 wants same (image spam detect) but asks as filter criteria rfe
Updated•16 years ago
|
Assignee: mscott → nobody
Updated•15 years ago
|
Target Milestone: Thunderbird2.0 → ---
Comment 22•15 years ago
|
||
I am getting a lot of these lately, and clicking on the "Junk" button every time but not having Thunderbird remember that it is junk is really frustrating. I get the exact same image every time. Why can't the junk filter just remember the actual image and throw out any message that attaches that image? I know, eventually the spammers will just start changing the image slightly with each send, but at least it will make their job a bit more difficult, and when they do that, we'll figure out a new way to see through the deception. Thanks.
Comment 23•15 years ago
|
||
(In reply to comment #22) It's not necessary to just remember the image as-is. There are ways of comparing 2 images to see how similar they are. I remember using some image dupe finder programs even several years back. First, both versions are resized to a small standard size and colors are reduced to a small set (e.g. 16 colors). Then the algorithm checks how close they are. If they're similar, there'll be a high percentage score of how closely they match.
Updated•15 years ago
|
Blocks: junktracker
Comment 26•14 years ago
|
||
Hmm, five and a half years and no action on this. This one will be old enough to wear long trousers soon. Spammers have been shifting to avoid filters but this basic problem remains. Far from trying to keep ahead, Tbird is still on the starting blocks on this one. From the above dupe: Using customised headers is a good option and allows matching of X-Mozilla-Status2 to detect an attachment and Content-Type: multipart/mixed. But this is still too general. Whatever solution is found here will need to be flexible since the target will continue to move. I think the simple lexical "contains/begins with/ends with" filters are too limited. For example in filtering X-Mozilla-Status2 , arithmetic or bitwise boolean would be useful rather than lexical tests. Probably a mechanism to call some outside script or executable filter would offer a means to add OCR or image comparison and to adapt local filtering as spam evolves.
Comment 27•13 years ago
|
||
elaborating on my previous comment, combining several filter conditions would be a lot more powerful and probably not hard to code. (just a bit more parsing). eg. (subject [contains] HP) OR (subject [contains] xerox ) AND (subject [contains] scanner) Currently each individual filter line gets combined with OR to make the total filter. To implement this some extra button could be added at the start of each line that could be selected to AND / OR / XOR , also some means of bracketting (grouping) these logical operations.
Updated•12 years ago
|
Component: General → Filters
Product: Thunderbird → MailNews Core
Comment 28•10 years ago
|
||
ideas of how to avoid image spam: http://kb.mozillazine.org/Junk_Mail_Controls#Image_spam http://computing.physics.harvard.edu/imagespam Thunderbird, and bayes in general, is historically designed to act on text only. And several reference including http://en.wikipedia.org/wiki/Email_spam#Image_spam indicate that determining spam based on image contents can be unreliable.
Keywords: sec-other
Type: defect → enhancement
Summary: image spam evades bayesian junk control filter → Determine a way to effectively handle image spam in the Baysian junk filter
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•