Closed Bug 401568 Opened 17 years ago Closed 9 years ago

Junk filter does not detect particular mails (sender: VIAGRA (R))

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: cjoanidis, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: testcase)

Attachments

(2 files, 1 obsolete file)

Don't include <style> in plain text, v.1 17 years ago Phil Ringnalda (:philor) 1.48 KB, patch		Details \| Diff \| Splinter Review
mbox with a couple of examples 17 years ago Phil Ringnalda (:philor) 10.01 KB, application/mbox		Details
NSPR log running junk controls on that folder 17 years ago Phil Ringnalda (:philor) 44.75 KB, text/plain		Details

Christian Joanidis

Reporter

Description

•

17 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; es-ES; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8
Build Identifier: versión 2.0.0.6 (20070728)

Some mails arrive to my mailbox and they are not detected as spam, in spite of the fact that I already tought Thunderbird that these mails should be considered as SPAM. The sender's name is Viagra (R). But the sender's adress is my own adress. The subject is something like "October 76% off"

Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Jo Hermans

Comment 1

•

17 years ago

Thunderbird uses a pattern-based analysis on the message - if it wasn't recognized then it was because it didn't use enough "bad patterns" (words that were used in previous junk messages). But the junk analysis can *learn* from your spam mail (it's trainable) - if you mark the mail as junk, then its content will be used in the analysis too. That's why there's a Junk button on your toolbar, otherwise a Delete button will suffice. 

Just mark them as junk, the next time they should be recognized.

Status: UNCONFIRMED → RESOLVED

Closed: 17 years ago

Resolution: --- → WORKSFORME

Phil Ringnalda (:philor)

Comment 2

•

17 years ago

Apparently Jo isn't on the list for those Viagra(R) spams, which consist of a couple of large sets of randomness wrapped in <style></style> surrounding a small obvious payload. I haven't been keeping mine like I should have, but I looked at a couple, and I suspect they are exploiting two of our bugs:

The most obvious one is that the plaintext serializer passes through the content of <style> elements in the body of an HTML message, as you can see by composing an HTML message with "What is the frequency, <style><em>Kenneth</em></style>" and sending HTML+text - the text part will be "What is the frequency, <em>Kenneth</em>". I tried sending myself just the payload of the ones I looked at, and it alone was caught by the filter, so stripping the content of style elements, rather than just passing it through, ought to be a quick win.

The more subtle one is that poisoning like that isn't actually supposed to work - I don't understand bayesian filters as well as I should, but all the literature about poisoning seems to be saying that adding a bunch of random words (in the case of the one I remember, a bunch of random Spanish words, which ought to be nearly all unknown in my corpus) shouldn't work either to pass through, or to gradually poison the filter when you train it.

(The fact that the sender is "Viagra", while rubbing our nose in it, I don't think is a bug - as the comment in the code says, sender probably is too strong of a signal to use, given the high percentage of spams that are from "you" or someone you know who was nearby where they harvested your address.)

Status: RESOLVED → UNCONFIRMED

Resolution: WORKSFORME → ---

Phil Ringnalda (:philor)

Updated

•

17 years ago

Status: UNCONFIRMED → NEW

Component: General → MailNews: Filters

Ever confirmed: true

OS: Windows Vista → All

Product: Thunderbird → Core

QA Contact: general → filters

Hardware: PC → All

Version: unspecified → Trunk

Phil Ringnalda (:philor)

Comment 3

•

17 years ago

Hmm, is passing through <style> a trunk-only regression from bug 308145? Before that, the parser would relocate them to the head, but I don't actually see where the plaintext serializer does anything with the nsIDocumentEncoder::OutputBodyOnly flag we pass to it.

Severity: minor → normal

Phil Ringnalda (:philor)

Comment 4

•

17 years ago

Attached patch Don't include <style> in plain text, v.1 (obsolete) — Details — Splinter Review

Go, me. I can't remember the last time I was right the first time in a bug. It *is* a bug that the plaintext serializer includes the content of <style>, and this patch fixes that, near as I can tell, but since that's not a problem on the 1.8 branch, where I've been hearing from people that Tbird's being sullenly untrainable about the October 76% off spam, that can't be the real problem with them.

Attachment #286640 - Flags: review?(mrbkap)

Blake Kaplan (:mrbkap) (inactive)

Comment 5

•

17 years ago

Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

This probably was a regression from bug 272702, but was probably broken for XHTML documents already.

Attachment #286640 - Flags: review?(mrbkap) → review+

Reed Loden [:reed]

Updated

•

17 years ago

Assignee: nobody → philringnalda

Phil Ringnalda (:philor)

Comment 6

•

17 years ago

Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

On second thought, a sane person would put this patch which will require approval, and affect Fx's copy-paste, in a bug in a component that gets approval triage.

Attachment #286640 - Attachment is patch: false

Attachment #286640 - Flags: review+

Reed Loden [:reed]

Updated

•

17 years ago

Attachment #286640 - Attachment is patch: true

Phil Ringnalda (:philor)

Updated

•

17 years ago

Depends on: 401662

Phil Ringnalda (:philor)

Comment 7

•

17 years ago

Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

Moved to bug 401662.

Attachment #286640 - Attachment is obsolete: true

Phil Ringnalda (:philor)

Comment 8

•

17 years ago

And I'm *so* not taking this bug: the <style>-poison was my only idea, and unless everyone on the branch having trouble is having it because they don't train, and only trunk people were training with too much chaff, that can't be it.

Assignee: philringnalda → nobody

Thomas Wilson

Comment 9

•

17 years ago

I'm just a TB user who is very happy with TB - in 4 years of always checking what TB has thrown away, I have found less than 10 messages that should not have been tossed. The filter works great, with the exception of this single email. Getting rid of this nuisance would be excellent. 

I have received this email HUNDREDS of times, have dutifully marked every one as spam, and TB has never been able to handle it. The fact that if we mouse over the "viagra@official site" it shows our own email address, means that it is very hard to figure out where it comes from. 

What I would like someone to do is to tell us where it originates. I have no skills in this area, but unless the message was generated by my own computer (very unlikely since my computer is armed like a fortress), there has to be tracks it left as it arrived. Isn't that always how the FBI finds those MSoft messages that confirm what MS has firmly denied? 

This shouldn't be difficult for someone who specializes in email science. Some progress against this would be greatly appreciated.

Thanks

Phil Ringnalda (:philor)

Comment 10

•

17 years ago

Attached file mbox with a couple of examples — Details

The last two I've gotten, in an mbox for experimental fun.

Phil Ringnalda (:philor)

Comment 11

•

17 years ago

Attached file NSPR log running junk controls on that folder — Details

A few things that stand out to me, from logging what the junk filter sees from those two messages:

* Even with my fix from bug 401662, we're still getting garbage in the "stripped html" - <center><a><img><style> apparently isn't getting to the right place to realize it should drop the contents of the <style>

* If we weren't getting garbage, we'd be getting nothing - in an earlier example I found on the web, the payload between the style blocks was "<p><font color="#FF0000"><a href="http://www.preparenine.com"><b><font size="+2">Click to buy Viagra for as low as $1.53 <b></a></font></p>" but in both of my current ones, it's an open link and an image with no alt or title, or in plaintext terms, ""

* the busted encoding in the from header (where we're adding the token "from:viagra ¬Æ official site <webmaster@philringnalda.com>" from an iso-8859-1 mail) worries me a bit, since it's very nearly all we have to work with. Given proper HTML stripping, the mail boils down to "subject:{current month}" "subject:7x%" "subject:off" and the From: header, so if we screw up the comparison between ¬Æ and ® then we've got nothing.

Phil Ringnalda (:philor)

Comment 12

•

17 years ago

Well, at least I make a lot of noise - that would be a log of running the junk filter in a build that _didn't_ have my fix from bug 401662 - with, it's a more reasonable

-1610559488[1907ac0]: tokenize: 
<meta http-equiv="Context-Type" content="text/html; charset=iso-8859-1">
<html><body>
 
<a href="http://www.ropepopulate.com"><img>



</a>
-1610559488[1907ac0]: tokenize stripped html:  
-1610559488[1907ac0]: tokenize: </body></html>
-1610559488[1907ac0]: tokenize stripped html:

Nobody; OK to take it and work on it

Assignee

Updated

•

16 years ago

Product: Core → MailNews Core

Wayne Mery (:wsmwk)

Comment 13

•

14 years ago

philor/others,
 is this still evident today?  (sorry I can't test well due to spamassassin front end)  And the primary cause is poisoning?  How much does this correlate to bug 280716?

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Depends on: 280716

Summary: Spam filter does not detect particular mails (sender: VIAGRA (R)) → Junk filter does not detect particular mails (sender: VIAGRA (R))

Wayne Mery (:wsmwk)

Comment 14

•

9 years ago

Christian's email address is dead.  But we have his testcase.

Is this entirely/mostly dependent on serializer issues?   (except item 2 of comment 11, which presumably is bug 280716)

Flags: needinfo?(philringnalda)

Keywords: testcase

Phil Ringnalda (:philor)

Comment 15

•

9 years ago

It's unclear to me what information you might think I possess that you need, 8 years after I last looked at our spam filtering, so let's fix the bug that I look like someone to ask. If some hypothetical person comes along wanting to work on catching the chaff-filled spam I got this weekend about home warranties, they'll file a new bug rather than looking at one from 2007 anyway.

Status: NEW → RESOLVED

Closed: 17 years ago → 9 years ago

Flags: needinfo?(philringnalda)

Resolution: --- → INCOMPLETE

You need to log in before you can comment on or make changes to this bug.