Closed Bug 401568 Opened 17 years ago Closed 9 years ago

Junk filter does not detect particular mails (sender: VIAGRA (R))

Categories

(MailNews Core :: Filters, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: cjoanidis, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: testcase)

Attachments

(2 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; es-ES; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8
Build Identifier: versión 2.0.0.6 (20070728)

Some mails arrive to my mailbox and they are not detected as spam, in spite of the fact that I already tought Thunderbird that these mails should be considered as SPAM. The sender's name is Viagra (R). But the sender's adress is my own adress. The subject is something like "October 76% off"

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Thunderbird uses a pattern-based analysis on the message - if it wasn't recognized then it was because it didn't use enough "bad patterns" (words that were used in previous junk messages). But the junk analysis can *learn* from your spam mail (it's trainable) - if you mark the mail as junk, then its content will be used in the analysis too. That's why there's a Junk button on your toolbar, otherwise a Delete button will suffice. 

Just mark them as junk, the next time they should be recognized.
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
Apparently Jo isn't on the list for those Viagra(R) spams, which consist of a couple of large sets of randomness wrapped in <style></style> surrounding a small obvious payload. I haven't been keeping mine like I should have, but I looked at a couple, and I suspect they are exploiting two of our bugs:

The most obvious one is that the plaintext serializer passes through the content of <style> elements in the body of an HTML message, as you can see by composing an HTML message with "What is the frequency, <style><em>Kenneth</em></style>" and sending HTML+text - the text part will be "What is the frequency, <em>Kenneth</em>". I tried sending myself just the payload of the ones I looked at, and it alone was caught by the filter, so stripping the content of style elements, rather than just passing it through, ought to be a quick win.

The more subtle one is that poisoning like that isn't actually supposed to work - I don't understand bayesian filters as well as I should, but all the literature about poisoning seems to be saying that adding a bunch of random words (in the case of the one I remember, a bunch of random Spanish words, which ought to be nearly all unknown in my corpus) shouldn't work either to pass through, or to gradually poison the filter when you train it.

(The fact that the sender is "Viagra", while rubbing our nose in it, I don't think is a bug - as the comment in the code says, sender probably is too strong of a signal to use, given the high percentage of spams that are from "you" or someone you know who was nearby where they harvested your address.)
Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---
Status: UNCONFIRMED → NEW
Component: General → MailNews: Filters
Ever confirmed: true
OS: Windows Vista → All
Product: Thunderbird → Core
QA Contact: general → filters
Hardware: PC → All
Version: unspecified → Trunk
Hmm, is passing through <style> a trunk-only regression from bug 308145? Before that, the parser would relocate them to the head, but I don't actually see where the plaintext serializer does anything with the nsIDocumentEncoder::OutputBodyOnly flag we pass to it.
Severity: minor → normal
Go, me. I can't remember the last time I was right the first time in a bug. It *is* a bug that the plaintext serializer includes the content of <style>, and this patch fixes that, near as I can tell, but since that's not a problem on the 1.8 branch, where I've been hearing from people that Tbird's being sullenly untrainable about the October 76% off spam, that can't be the real problem with them.
Attachment #286640 - Flags: review?(mrbkap)
Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

This probably was a regression from bug 272702, but was probably broken for XHTML documents already.
Attachment #286640 - Flags: review?(mrbkap) → review+
Assignee: nobody → philringnalda
Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

On second thought, a sane person would put this patch which will require approval, and affect Fx's copy-paste, in a bug in a component that gets approval triage.
Attachment #286640 - Attachment is patch: false
Attachment #286640 - Flags: review+
Attachment #286640 - Attachment is patch: true
Depends on: 401662
Comment on attachment 286640 [details] [diff] [review]
Don't include <style> in plain text, v.1

Moved to bug 401662.
Attachment #286640 - Attachment is obsolete: true
And I'm *so* not taking this bug: the <style>-poison was my only idea, and unless everyone on the branch having trouble is having it because they don't train, and only trunk people were training with too much chaff, that can't be it.
Assignee: philringnalda → nobody
I'm just a TB user who is very happy with TB - in 4 years of always checking what TB has thrown away, I have found less than 10 messages that should not have been tossed. The filter works great, with the exception of this single email. Getting rid of this nuisance would be excellent. 

I have received this email HUNDREDS of times, have dutifully marked every one as spam, and TB has never been able to handle it. The fact that if we mouse over the "viagra@official site" it shows our own email address, means that it is very hard to figure out where it comes from. 

What I would like someone to do is to tell us where it originates. I have no skills in this area, but unless the message was generated by my own computer (very unlikely since my computer is armed like a fortress), there has to be tracks it left as it arrived. Isn't that always how the FBI finds those MSoft messages that confirm what MS has firmly denied? 

This shouldn't be difficult for someone who specializes in email science. Some progress against this would be greatly appreciated.

Thanks



 
The last two I've gotten, in an mbox for experimental fun.
A few things that stand out to me, from logging what the junk filter sees from those two messages:

* Even with my fix from bug 401662, we're still getting garbage in the "stripped html" - <center><a><img><style> apparently isn't getting to the right place to realize it should drop the contents of the <style>

* If we weren't getting garbage, we'd be getting nothing - in an earlier example I found on the web, the payload between the style blocks was "<p><font color="#FF0000"><a href="http://www.preparenine.com"><b><font size="+2">Click to buy Viagra for as low as $1.53 <b></a></font></p>" but in both of my current ones, it's an open link and an image with no alt or title, or in plaintext terms, ""

* the busted encoding in the from header (where we're adding the token "from:viagra ¬Æ official site <webmaster@philringnalda.com>" from an iso-8859-1 mail) worries me a bit, since it's very nearly all we have to work with. Given proper HTML stripping, the mail boils down to "subject:{current month}" "subject:7x%" "subject:off" and the From: header, so if we screw up the comparison between ¬Æ and ® then we've got nothing.
Well, at least I make a lot of noise - that would be a log of running the junk filter in a build that _didn't_ have my fix from bug 401662 - with, it's a more reasonable

-1610559488[1907ac0]: tokenize: 
<meta http-equiv="Context-Type" content="text/html; charset=iso-8859-1">
<html><body>
 
<a href="http://www.ropepopulate.com"><img>



</a>
-1610559488[1907ac0]: tokenize stripped html:  
-1610559488[1907ac0]: tokenize: </body></html>
-1610559488[1907ac0]: tokenize stripped html: 
Product: Core → MailNews Core
philor/others,
 is this still evident today?  (sorry I can't test well due to spamassassin front end)  And the primary cause is poisoning?  How much does this correlate to bug 280716?
Depends on: 280716
Summary: Spam filter does not detect particular mails (sender: VIAGRA (R)) → Junk filter does not detect particular mails (sender: VIAGRA (R))
Christian's email address is dead.  But we have his testcase.

Is this entirely/mostly dependent on serializer issues?   (except item 2 of comment 11, which presumably is bug 280716)
Flags: needinfo?(philringnalda)
Keywords: testcase
It's unclear to me what information you might think I possess that you need, 8 years after I last looked at our spam filtering, so let's fix the bug that I look like someone to ask. If some hypothetical person comes along wanting to work on catching the chaff-filled spam I got this weekend about home warranties, they'll file a new bug rather than looking at one from 2007 anyway.
Status: NEW → RESOLVED
Closed: 17 years ago9 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: