Closed Bug 136055 Opened 22 years ago Closed 18 years ago

Filter/Search on Body erroneously applied to encoded binary attachments

Categories

(MailNews Core :: Filters, defect)

x86
Windows 98
defect
Not set
major

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 37031

People

(Reporter: dmitry, Assigned: sspitzer)

References

Details

If I have a filter rule 

Body contains "porn"

then a legitimate message with a binary attachment gets filtered if the
attachment happens to contain 

...
1jQNPorNpERB6fvelBUi1+XqGieb7gKwd8asCQNRAO7Uf6AINv/A/+DgH3DW/gJXyejoLLjVRSpI
...
related: bug 67421
Severity: critical → major
also related: bug 98141
*** Bug 153973 has been marked as a duplicate of this bug. ***
On the other hand, if the message is multipart/mixed and its html part is
encoded as base64, filters of the form "body contains string" do not apply to
it. I think the html part of a message should be considered its "body" for
filtering purposes, to be decoded if necessary and fed through the "body" filters.
*** Bug 166573 has been marked as a duplicate of this bug. ***
*** Bug 159645 has been marked as a duplicate of this bug. ***
*** Bug 181418 has been marked as a duplicate of this bug. ***
Confirmed.  Voted for.

Suggest changing OS to all.

This is quite annoying to me also.  I have a filter called "porn mail" that
checks to see if the body contains "sex", "porn", "farm", etc.  I use match "any
of these words" because "match all words" doesn't work well for this
application.  There is therefore no way to setup filters like these without
making each filter only 1 word + "and only if message doesn't have attachment"
(because I would have to use match all).

Filters would be greatly improved if this bug were to be fixed.
Status: UNCONFIRMED → NEW
Ever confirmed: true
don't a rule "body doesn't contain 'Content-Transfer-Encoding: base64'" (or any
other way to find out attachements. this was the best that passed my mind now)
solve this?
Well, try it yourself.  I don't think that will work though, because you would
have to use "match all", instead of "match any of the below rules".  Using match
all would work, but it doesn't solve the problem.  It's just a very awkward
workaround.
You cannot stop the filters from being applied to a message's attachments. 
Using match all extremely limits the usefulness of making filters, unless you
want to make one filter per word to match.  Correct me if I'm wrong.
Well, I have a rule set up that moves Klez to a Viruses folder using a "Body
contains <first line of encoded Klez>" filter, so if this bug were fixed that
would stop working.
mass re-assign.
Assignee: naving → sspitzer
Maybe a "Body" and a "text-only body" that will get only  the parts with
"Content-type: text/anything". For example, if the Reporter of that bug set the
rule:
"text-only body" match any porn or xxx
it would not tigger the filter on a message that have a gif with
...
1jQNPorNpERB6fvelBUi1+XqGieb7gKwd8asCQNRAO7Uf6AINv/A/+DgH3DW/gJXyejoLLjVRSpI
...
because it will look on text/plain, text/html, and others text/*

but the Justin Kerk from comment #11 would be able to set
"body" match <first line of encoded Klez>
workaround that worked for me and a few others, I get probably 30 or 40 junk
emails a day, i have maybe 30 or 40 ppl in my address book and those that i
confer with daily. Since most of them are from the same domain mailer, i setup
something like so...
if sender deos not contain <specified domain> then move to trash
if sender deos not contain <specified email> Or <specified email>, etc.

i find that filtering out the emails i do want instead of those i dont' want
works faster and is much less work.

I would still like an filter attachment though, i am using currently 3 rules
explicitly stating different parts of the header info using the AND feature I
can get 90% of them but a custom one named attachment would be nice.
This bug is still present in current builds, e.g. Thunderbird 0.7.*. I would
really appreciate if it would be fixed before 1.0 because it's bugging me since
Netscape 6.0.
I have it set up to mark all emails that contain the word "yps" to be moved to a
seperate folder, but it works so bad that almost any mail with an attachment is
being moved.
*** Bug 267230 has been marked as a duplicate of this bug. ***
That duplicate notes that the problem exists for UUencoded attachments, as well 
as MIME ones.
Product: MailNews → Core
*** Bug 272042 has been marked as a duplicate of this bug. ***
Summary: Filters erroneously apply to encoded binary attachments → Filter/Search on Body erroneously applied to encoded binary attachments
This is a big problem, and while one result of the bug is talked about 
frequently (the fact that the wrong messages are matched with a 
search/filter), another problem is mostly overlooked, but arguably even more 
important: a full body search takes extremely long to complete in folders with 
many attachments, since those attachments can easily be megabytes in size, 
while the text in the message (that should be searched) is perhaps only a few 
percent of that. So, the search could easily be made 95% quicker or so in most 
situations, where people send a few pictures or other documents every now and 
then. I really don't understand that it takes years to fix this bug: i thought 
open-source actually meant that things get fixed quickly, but this really 
makes me lose confidence in this process and this product.
*** Bug 282682 has been marked as a duplicate of this bug. ***
I think this is a dup of bug #132340
(In reply to comment #21)
> I think this is a dup of bug #132340

Not exactly -- there we *do* want to search the body after decoding; here, we 
not only don't want to search within (binary) attachments, we don't even want to 
decode them in the first place (during search).
(In reply to comment #22)
> (In reply to comment #21)
> > I think this is a dup of bug #132340
> 
> Not exactly -- there we *do* want to search the body after decoding; here, we 
> not only don't want to search within (binary) attachments, we don't even want to 
> decode them in the first place (during search).

Ah, ok, that makes sense. But in light of my efforts to unify the junk filter
and regular filters, I think we need to make this difference explicit in the UI.
E.g., there should be separate criteria:
    Body Text <contains/etc>  <-- that is, only search the plaintext
    Attachment <contains/etc>  <-- only search the attachments, decode as necessary
    Body <contains/etc>   <-- everything

and some other bug reports have also requested
    All Headers
and
    Entire Message

as criteria scopes. That would probably cover all the bases.
(In reply to comment #23)

> E.g., there should be separate criteria:
>     Body Text <contains/etc>  <-- that is, only search the plaintext
>     Attachment <contains/etc>  <-- only search the attachments, decode as
necessary
>     Body <contains/etc>   <-- everything

After re-reading some more, that doesn't seem to really cover it completely. It
helps to be able to decide which portions of the message to filter. If the
portion you choose is encoded, it should always be decoded. (And the spam filter
will always operate on the whole message, I didn't need to mention that here.)
(In reply to comment #24)
> If the portion you choose is encoded, it should always be decoded. (And the
> spam filter will always operate on the whole message, I didn't need to
> mention that here.)

I think that's all fine; but I wonder if it's ever necessary to perform a text 
search within a binary attachment.

As an example, suppose you regularly get messages with text/html attachments, 
and others with (large) image/jpeg.  If you're searching "attachment body" for a 
string that you expect in some html files, you don't need to spend the time 
decoding the JPEG and then performing a probably-fruitless search in that file's 
data; even if you *did* get a match (on a short string, presumably, like that in 
the original bug report here), it would probably be a false positive.

Generally, I would prefer that "attachment body" searching was limited to text 
attachments (including message/rfc822 attachments).  On the other hand, there 
are Word and PDF and Postscript docs which are in fact mostly text that might 
well be a good target for filtering.  But there are many, many JPEGs, MP3s and 
the like being mailed around these days which do not seem like good targets for 
text filtering.
Can anyone explain why it takes > 3 years to fix something simple like this? 
I'm still getting the wrong search results, and my searches take way too long 
because it's searching through attachments unnecessarily as i explained 
earlier. It shouldn't take one programmer more than a day or so to make sure 
that attachments aren't searched. Or should it?

LOL, if it was that easy, yes, it would be done. Body search has no idea about
the mime structure of a message, and teaching it about mime would be non-trivial.
But surely it shouldn't have to take over 3 years?! I mean, MIME isn't rocket 
science.. just follow the specs from the corresponding RFC document. You don't 
even have to decode anything, just ignore the binary parts. Have a look 
at 'view source' in an email message: all that basically needs to be done is 
search through everything below "Content-Type: text/plain", and "Content-Type: 
text/html", and ignore the rest. It's rather trivial.

If the reason that this isn't done is that Thunderbird basically isn't being 
supported anymore by the developer community, then so be it, but in that case 
it would be nice to put some sort of note on the main page - like: "this 
product is no longer being supported or improved", so potential users don't 
make the mistake to download this program expecting broken things to be fixed 
within a reasonable time span. 

Or am i basically just expecting too much? How do other users / developers 
feel about this? Do most people think it's normal to wait 3 years for simple 
bugs to be fixed? Also, there doesn't really seem to be any (visible) 
progress: it would be nice if the people managing this part of the program 
would post something on this bug page, like "yeah.. we're working on it: it'll 
be fixed in release 1.xx". That's the way it's done for instance by Sun on the 
Java bug parade. Right now - i wouldn't be surprised if it's still not fixed 
in another 3 years.







(In reply to comment #27)
> LOL, if it was that easy, yes, it would be done. Body search has no idea 
about
> the mime structure of a message, and teaching it about mime would be non-
trivial.

Just because we're not working on your favorite bug doesn't mean we've done
nothing over the last three years.

>"Content-Type: text/plain", and "Content-Type: 
>text/html", and ignore the rest. It's rather trivial.

What about messages with nested attachments?

Anyway, we're just going to have to agree to disagree about the relative
importance and difficult of fixing this bug. Any help you want to offer in
coding up a fix for this would be greatly appreciated...

*** This bug has been marked as a duplicate of 37031 ***
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.