Open Bug 280716 Opened 20 years ago Updated 2 years ago

Determine a way to effectively handle image spam in the Baysian junk filter

Categories

(MailNews Core :: Filters, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: htworze, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [sg:nse])

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Here's spam email that junk control filter has trouble with. I keep getting
messages like one copied below (same content, same HTML structrue), mark each
one manually as junk--at least a dozen times in last week--but junk filter
doesn't learn to recognize it. Could be spammers have found some weakness in the
adaptive filter.

Subject: Re: 
MIME-Version: 1.0
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Ximian Evolution 1.4.3 (1.4.3-2)
X-pstn-version: pmps:sps_solaris_1_1_0c0 pase:2.8
X-pstn-levels:     (C:78.1961 P:95.9108 R:95.9108 S: 3.2295 )
X-pstn-settings: 4 (0.2500:0.7500) p:14 m:14 C:14 r:14
X-pstn-addresses: from <pjym@o2.pl> 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1479" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Verdana size=2>Hello,</FONT></DIV>

<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>I am contacting you about the only true way you 
will be able to really start saving.</FONT></DIV>
<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>Please review the information below. 
</FONT></DIV>

<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>Upon completion of the following process by this 
time next month you will have at least $300 to $400 Extra Cash in your pocket 
per month.</FONT></DIV>
<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>Your application is waiting to be 
received,</FONT></DIV>

<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>We are ready to give you a loan. </FONT></DIV>
<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>Approval process will take 1 minute.</FONT></DIV>

<DIV><FONT face=Verdana></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>CIick <A
href="http://www.lowest-mort.com/?d37">here</A> and fill the quick 
form.</FONT></DIV>
<DIV><FONT face=Verdana size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Verdana size=2>If you would prefer not to go <A 
href="http://www.lowest-mort.com/rem.php">here.</A></FONT></DIV></BODY></HTML>



Reproducible: Always
Clearing security flag: spam is not a security hole and most bugs have a better
chance of getting fixed with more eyes on the problem.
Group: security
Whiteboard: [sg:nse]
(In reply to comment #0)
As you see, there are not so many "dangerous words"/"often used words by spam"
in the spam mail.
I think "$", "cash", "loan" can be candidates for such words, but these are used
in normal mail too, and it's hard to say "this is spam mail" only by existence
of these three words for adaptive filter, even after several time of learning.
And "From:" can not be used as indicator of spam in many cases, because spammer
lies on "From:".  

My action for such spam is :
 - Create filters which do "Mark as Junk" & "Move to Junk" 
Although generic spam detection is impossible, common characteristics of spam
exist in many cases, when spams are sent by same spammer or spamers who use same
spaming software/data, during same term.

In your case, it is probably URL of "http://www.lowest-mort.com", because
spammer's objective is to force user to click the link.
So condition of "If body contains www.lowest-mort.com" is set, the spam mail can
be moved to Junk folder.
I think this approach can be used for "phishing" mail too, because "phising" can
not usually be achieved without link clicking.

"X-Mailer: Ximian Evolution" can also be used, since many spammers use same
spaming software in many cases.
Condition of "X-Mailer: header contains 'Ximian Evolution'" will be effective,
if you, your family, your friends and your company don't use 'Ximian Evolution'.

Black listing of URLs by adaptive filter is probably very hard, but will
possibly be implemented in near future by "Enhancement on phishing" such as
> Bug 254913 : add "phishing" detection to mail client
> Bug 278490 : Make link href clear in mail window to thwart phishing attacks
> Bug 279191 : Add Phishing Detection Support to Thunderbird
I think this bug can be closed as WONTFIX(impossible), or DUP of Bug 254913 /
Bug 279191.
(In reply to comment #2)
> (In reply to comment #0)
> As you see, there are not so many "dangerous words"/"often used words by spam"
> in the spam mail.

Hello,

I'd like to add another kind of spam mail I get receiving in Thunderbird
(currently the latest 1.0.6 and observed with previous version 1.0.2 as well),
although having clicked at least 5 times on different mails of this kind to mark
it as spam. I suspect the 'Subject' being the troublemaker (private data and
non-relevant data edited out with <...>):

-- Begin of spam email --

Return-Path: <info@djdeep.ch>
Received: from smtp.messaging.ch (exsmtp02.agrinet.ch [81.221.252.201])
	by <...> (Postfix) with ESMTP id 845011CC012
	for <...>; Mon,  8 Aug 2005 19:32:48 +0200 (MEST)
Received: from neo-kdbug5b04ws ([80.219.207.194]) by smtp.messaging.ch with
Microsoft SMTPSVC(6.0.3790.211);
	 Mon, 8 Aug 2005 19:33:23 +0200
Message-ID: <4145-22005818173519515@neo-kdbug5b04ws>
Errors-To: unscribe@djdeep.ch
From: <info@djdeep.ch>
To: <...>
Subject: [SPAM?] PLAY IT LOUD @ LE BAL
Date: Mon, 8 Aug 2005 19:35:19 +0200
MIME-Version: 1.0
Content-Type: multipart/related; 
	boundary="----=_NextPart_94915C5ABAF209EF376268C8"
X-OriginalArrivalTime: 08 Aug 2005 17:33:23.0682 (UTC) FILETIME=[46EE4420:01C59C3F]

This is a multi-part message in MIME format.

------=_NextPart_94915C5ABAF209EF376268C8
Content-Type: multipart/alternative; 
	boundary="----=_NextPart_84815C5ABAF209EF376268C8"

------=_NextPart_84815C5ABAF209EF376268C8
Content-type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

PLAY IT LOUD @ Le Bal=20
Mittwoch, 10=2E August '05

23=2E00 - 04=2E00 Uhr



Play it Loud Part=2E 2=2E Nach der ersten coolen Party im Le Bal setzten w=
ir unsere Serie fort=2E Diesmal haben wir f=FCr Euch ein spezielles Line-U=
p zusammengestellt=2E DJ Tremendo (Subliminal, NYC, Banditz) und DJ Deep (=
Banditz) droenen Euch die Ohren mit den Schaerfsten Elektrobeats zu=2E Int=
roset by DJ Rena Moreno, der Euch so richtig auf Stimmung bringen wird=2E =
Also nicht verpassen!!!=20

Der Eintritt ist Gratis!!

=20

NEWSLETTER ABMELDEN

=20

------=_NextPart_84815C5ABAF209EF376268C8
Content-Type: text/html; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

<html>

<head>
<meta http-equiv=3D"Content-Language" content=3D"de-ch">
<meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 5=2E0">
<meta name=3D"ProgId" content=3D"FrontPage=2EEditor=2EDocument">
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dwindows-=
1252">
<title>FRESH</title>
<style>
<!--
 <...>
-->
</style>
</head>

<...>

</html>
------=_NextPart_84815C5ABAF209EF376268C8--

------=_NextPart_94915C5ABAF209EF376268C8
Content-Type: image/jpeg; name="playitloud.jpg"
Content-Transfer-Encoding: base64
Content-Description: playitloud.jpg
Content-Id: <314640-2200581817335282897@neo-kdbug5b04ws>

<... some image data ...>

-- End of spam email --

I have edited out the HTML content since it contains the same content more or
less as the ASCII part, and the content changes from email to email anyway (some
party announcements etc.).

I _think_ the bugger could be the subject line, since this is the one which
ALWAYS contains [SPAM?] in it:

Subject: [SPAM?] PLAY IT LOUD @ LE BAL

That's the only thing which remains constant (the actual subject changes off
course, e.g. "[SPAM?] SOME OTHER PARTY"), apart from the sender's address (From:
<info@djdeep.ch>) which is also always the same. 

I would expect that it should be fairly easy for the spam filter to filter out
the same sender's address, but obviously it gets tricked somehow over and over
again (by the subject line?)!

Observed both on Windows and Linux Thunderbird, if that matters...


Thanks, Oliver
Cross-ref: Bug 339617 which appears to describe the same issue reported here.
*** Bug 339617 has been marked as a duplicate of this bug. ***
Status: UNCONFIRMED → NEW
Ever confirmed: true
Getting tonnes and tonnes of these starting to come through over the past two weeks. Have gone from one or two spam's per day getting through, to around thirty ( all base64 spam )
Re: comment 6. I've been getting a huge number of those lately as well. 

It seems to me that the problem isn't necessarily anything to do with the encoding, or even anything really wrong with the pseudo-Bayesian filter, but that it's a new and dangerous escalation of spam tactics.

Basically what they've done is slap the spam into an image at the top of an HTML mail, and then filled in the rest of the HTML (and an alternative text block at the top, I think) with semi-random plausible sentences that *don't* look like spam for most people.

Even if we could manage to train the filter on that text, it wouldn't matter, because it's random and from what little I can tell it appears to be chosen from a corpus of actual non-spam emails. 

The only general thing I can think of that might solve this would be converting over to a true Bayesian filter that tracks multi-word conditional probabilities, because most of the crap in the random text block is incoherent. But that would exponentiate the size of the training set and make it really slow. Also, it's not clear that it would solve the underlying problem, which is that none of the words in the messages have anything to do with the spam message... choosing to attach actual non-spam emails in their entirety rather than just snippets would fool even a true Bayesian filter. 

Perhaps someone can think of some hack to deal with this tactic, though... Hard to imagine what except to treat initial large images with suspicion... I guess theoretically such images could be OCR'd and included in the spam filtering... ugh.
I've been seeing a lot of spam just like that described in comment 7 and there's nothing there for our filter to pick up on because the actual text words are random. We're actively looking for ideas on how to catch this if folks have them. Maybe we could try to add a token for messages with an image at the very top of a message...but that could of course get fooled too.
You know, I *was* kidding originally, but it occurs to me that computers are fast enough these days that we actually *could* OCR the images... perhaps at low priority or something... I imagine there probably are good enough open-source OCR libraries...
Another question is why image based spam is displayed when I have HTML display turned off and remote images blocked ( i know it's not remote ) .... :/
I'm going to hijack this bug to focus on the new kind of image spam talked about in comment 7 and comment 8.

Neil Turner just posted a tip he and some others have been using:

http://www.neilturner.me.uk/2006/Aug/06/stopping_image_spam_in_th.html
based on:
http://www.tuaw.com/2006/08/04/a-mail-app-rule-for-catching-image-spam/1#comments

Although I'd recommend modifying the suggested filter to skip e-mail from addresses in your address book. In one of the comments in the tuaw link, somone suggested filtering of messages from people you don't know which have an attachment name that ends in .gif. I don't think our current filters can do that, but maybe we can try to expand them to include a rule like 'contains image'. Not sure how easily that would work for IMAP though since we don't fetch the body.
Summary: spam that evades junk control filter → image spam evades bayesian junk control filter
Target Milestone: --- → Thunderbird2.0
cc'ing Neil since I referenced his blog post.
OS: Windows 2000 → All
Hardware: PC → All
(In reply to comment #11)

That tip is nice but is nothing more than a workaround. It's also annoying to apply if you have several email accounts - a rule must be created (and "maintained"?) for each account.

Are you planning to add actual OCR support to the junk detection algorithms?

By the way, if you haven't noticed, the spammers are already using noise and different colors in their spam images so that OCR can't be used easily. IIRC, there's a commercial implementation of OCR-capable spam filter out there - they're probably targeting that.
In response to the recent animated Gif mails i've been receiving.

Its pretty easy to figure out the meta-data that a gif is based on, by meta data i mean the number of frames, their display times etc.

Normal animated gif images have the same display time for each and every frame, usually because trey are continuous animations. (and keep look looping)

The spam messages i've been getting, have a bunch of short empty/lined frames to avoid orc fitlers, and then display the actual text for a far longer period.

Now i tihnk it should be possible to store these characteristics in the bayesian filter, and not mark legit animations as spam.

It might also help to store the color palette (per frame?) in the filter.
I'm going to have to sit down and grok these postings and the two suggestions linked to, but I just wanted to say that I got here because I was looking to see if a request to detect image spam had already been made.

I am using my own tbird mail filters, because Cox was throwing away real email if I let it handle spam.  It still throws away some stuff, it's clear, even though I selected mark it as spam and send it to me.  I even caught it thinking I was a spammer - I sent a print screen image to myself as a reminder, and it never arrived.  So I retried with a little text and that got thru, but after I had experimented awhile I could never receive any email containing that image nor could anyone else, regardless of the text in the email.  Of course, Cox being Cox, they never told me they had not sent the email on, so now I think some unknown amount of my email with an image in it just goes into the bit bucket.  I mention this as a caution for anyone designing a trap for image spam.

I have the filters set up to toss anything they aren't sure of into a "look at this maybe its spam maybe not" filter.  Image spam winds up there, as does email from new real senders.  I look at the messages in there to see if I can improve my filters.  It's bad enough seeing 10 emails touting the same stupid stock, but 10 emails with a drawing of a penis is just incredibly aggravating.

Just venting, and hoping someone has a solution to this.  Maybe we should chip in and hire the Russian mafia to blow a few spammers away.
QA Contact: general
I've (anecdotally) found that by just deleting the jpeg/picture spam and not marking it as junk, tbird's marking ratio of all my other spam seems to have gone up. I guess the classifier is not getting as 'gunked up'  (technical term) with the non-spam-like content of these jpeg-spam emails. Of course, this don't really help solve the problem.
Some FYI:
http://csoonline.com/read/040107/fea_spam_by_the_numbers.html
from a Bruce Schneier post:
http://www.schneier.com/blog/archives/2007/05/image_spam.html

Idea from a comment in the post: how about a filter that marks as spam any msg that has an image and the From: is not in the personal address book/whitelist? Maybe we could at least enable this functionality through the filters, and ship a standard one?
the latest wonderful spammer variation is a pdf with text or image within the pdf (rather than attach an outright graphic image)
bug 333501 wants same (image spam detect) but asks as filter criteria rfe
Assignee: mscott → nobody
Target Milestone: Thunderbird2.0 → ---
I am getting a lot of these lately, and clicking on the "Junk" button every time but not having Thunderbird remember that it is junk is really frustrating.

I get the exact same image every time. Why can't the junk filter just remember the actual image and throw out any message that attaches that image?

I know, eventually the spammers will just start changing the image slightly with each send, but at least it will make their job a bit more difficult, and when they do that, we'll figure out a new way to see through the deception.

Thanks.
(In reply to comment #22)

It's not necessary to just remember the image as-is. There are ways of comparing 2 images to see how similar they are. I remember using some image dupe finder programs even several years back.
First, both versions are resized to a small standard size and colors are reduced to a small set (e.g. 16 colors). Then the algorithm checks how close they are. If they're similar, there'll be a high percentage score of how closely they match.
Blocks: junktracker
Hmm, five and a half years and no action on this. This one will be old enough to wear long trousers soon. 

Spammers have been shifting to avoid filters but this basic problem remains. Far from trying to keep ahead, Tbird is still on the starting blocks on this one. 

From the above dupe:

Using customised headers is a good option and allows matching of X-Mozilla-Status2 to detect an attachment and Content-Type: multipart/mixed. But this is still too general.

Whatever solution is found here will need to be flexible since the target will continue to move. I think the simple lexical "contains/begins with/ends with" filters are too limited.

For example in filtering X-Mozilla-Status2 , arithmetic or bitwise boolean would be useful rather than lexical tests. 

Probably a mechanism to call some outside script or executable filter would offer a means to add OCR or image comparison and to adapt local filtering as spam evolves.
elaborating on my previous comment, combining several filter conditions would be a lot more powerful and probably not hard to code. (just a bit more parsing).

eg. 

(subject [contains] HP) OR (subject [contains] xerox ) AND (subject [contains] scanner)

Currently each individual filter line gets combined with OR to make the total filter. To implement this some extra button could be added at the start of each line that could be selected to AND / OR / XOR , also some means of bracketting (grouping) these logical operations.
Component: General → Filters
Product: Thunderbird → MailNews Core
ideas of how to avoid image spam:
http://kb.mozillazine.org/Junk_Mail_Controls#Image_spam
http://computing.physics.harvard.edu/imagespam

Thunderbird, and bayes in general, is historically designed to act on text only.  And several reference including http://en.wikipedia.org/wiki/Email_spam#Image_spam indicate that determining spam based on image contents can be unreliable.
Blocks: 401568
Type: defect → enhancement
Summary: image spam evades bayesian junk control filter → Determine a way to effectively handle image spam in the Baysian junk filter
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: