Open Bug 1211128 Opened 10 years ago Updated 1 year ago

Apply filter to URLs in message body

Categories

(Thunderbird :: Filters, enhancement)

38 Branch
x86_64
Windows 7
enhancement

Tracking

(Not tracked)

People

(Reporter: giovanni.gozzi, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0 Build ID: 20150826023504 Steps to reproduce: Good morning, I'm receiving smap from particular "companies", I can recognize them because all urls redirect to their domain, then probably to the final sponsor's url. They are very smart as they change continuosly the bodies and objects and formal filters poorly work. I would ask you introduce a filter who is able to look inside hypertext and can recognize the url, if it contains the bad domain I can trash it automatically. Yes I tried all the spamkillers apps, etc, etc, but they don't work as I guessed. Thank you for adding it, I think it's not much work and doesn't depend on IMAP/POP, it just has to filter messages into a folder. Actual results: I wrote all in first message Expected results: I wrote all in first message
OS: Unspecified → Windows 7
Hardware: Unspecified → x86_64
There are other bugs that complain about the fact that a text string in a message body is not detected or not detected reliably when this string is inside a link.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Filter URLs in body message → Apply filter to URLs in message body
Bug 1230815 comment #1 has a reproducible example. Bug 1245157 reports that "://t.co/" is not filtered in a link in a text.
Blocks: 1245157
After a little reading it becomes clear that the rules don't match on tag content, that is for example: <a href="http://spammer-site.com">: http://mxr.mozilla.org/comm-central/source/mailnews/base/search/src/nsMsgBodyHandler.cpp#310 Without looking into too much detail, the processing appears to be line based, so a tag split across two lines is not stripped, as was observed in bug 1230815. nsMsgBodyHandler::StripHtml() has been there from the beginning of MXR, so should we change that? We could look into the href part of a link. Opinions?
Flags: needinfo?(rkent)
Flags: needinfo?(mkmelin+mozilla)
Added Mark to the CC list of the bug since some action is likely to happen here.
The body filter has always searched only the message text and not tags, and we would be unlikely to change the default behavior. If you would like alternate behavior, you would need to propose some method of allowing that without adding UI complexity. The body filter is not intended as a spam filter, and most interest in looking at tags is for that purpose, like your "<a href="http://spammer-site.com">:" Like, allow a hidden option that could be set by a filtering management extension (such as FiltaQuilla). I doubt if you could justify making this a core option. Ideally for an extension you would want to be able to have a custom search term "Search Full Message" or something (that might also include any header?) Not clear how you would do that with a custom search term as currently defined.
Flags: needinfo?(rkent)
So this would be a WONTFIX? We just make sure that tags broken across lines also get ignored (see bug 1230815 for an example).
I think we need to discuss the requirements a little more before WONTFIX. The original hope was that a custom search term could be defined that used the body, and you could add custom search over the body in an addon. But that is not actually working. So I think that trying to match "<a href="http://spammer-site.com">" is a legitimate request, and there ought to be some pathway that allows that. Now it is not possible either in core, or in an addon. We could at least investigate how an addon might do it, which will need some changes to core to support.
Hi, I'm the reporter of the spammerwebsite-example in the duplicate #1230815. I'm not a professional in this, just an "advanced user". But from a user's point of view, a filter that is named "Body contains" should search all the message source for the search word and not sometimes in HTML-tags and sometimes only outside the HTML-tags. An addon solution that works reliably would be also great, but a build-in filter would be nicer. I think we can agree, that nobody relies on this strange behaviour of the filter as it is right now, I mean that the broken lines are searched and the unbroken lines not. Therefore, I think it would be no problem for anyone, if the filter would be changed to searching always everything in the message source. Most people's filters would then work even better. If you don't want to tough it, I suggest to include a new filter option in the drop-down menu below "Body", named "Full Body" or "Message source" to search really the full message source including all HTML-tags and commands. Thanks for your work. Bye (eagerly waiting for the TB update with this function)
Yes, I agree, there is need for action here. Two different body searches could be useful. (In reply to ski from comment #10) > Bye (eagerly waiting for the TB update with this function) Don't hold your breath. TB 45 ESR will be released in March, and it won't have a fix; the next ESR release 52 is in December.
Yeah !!! That would be a decent X-mas present then :-)
"But from a user's point of view, a filter that is named "Body contains" should search all the message source for the search word and not sometimes in HTML-tags and sometimes only outside the HTML-tags." I don't see how that follows. "Body" is the text of the message. Most users know nothing of HTML tags or message headers and such. The bugmail for your reply sent to me had a message header "Authentication-Results" in it. If I search for Body with "Authentication" wouldn't I be surprised to match this bugmail, when nothing in my system shows "Authentication" at all (short of viewing the message text?) I'm very cautious about recommending changes to existing behavior that are not obvious failures of the design intent. So "sometimes matches, sometimes doesn't" is a legitimate need to fix, "I wish it also searched headers" is a significant change.
It was suggested in comment #10 that the "Body contains" can/should be left as it is, but a new option "Body source contains" should be implemented. This could be useful. However, thinking about the immense amount of trash MS Outlook sends in the embedded stylesheets (see bug 1219928 comment #8 for an example (that tripped up the spell checker)) the body source could deliver some surprising results; all messages originating from Outlook will match "font", "normal" and "swiss".
Actually, the bug summary requests search in URLs, so the href part, nothing else.
Severity: normal → enhancement
Thanks Jorg, exactly. I could agree with Kent, that "Body" does not include all the invisible HTML-coding, but actually my intent is to filter link URLs that are behind a highlighted text which does not contain the URL visibly (see my example at Bug #1230815). I often get spam HTML-mails which have a highlighted text like "Click me" or "Watch the video here" and I see the URL where they link to in the status bar when hovering with the mouse over them. Hence, these hidden links are not really invisible to the user as they become visible when you put the mouse over them. These URLs are what I want to use as filter search words. All the other HTML-stuff which is really invisible to the user, I don't care about. Do you agree that these URLs are visible parts of the HTML to the user? And it is really annoying when you try to get rid of these spam, which is not caught by the junkfilter (even though marking them as junk) because the visible text changes so strongly that it cannot adapt. The only constant in that mails are the link URLs. And I see them when hovering with the mouse, but TB does not remove them, although it would be SO EASY if it just scanned the HTML-commands, at least the ones for links.
Agreed there could be some use of it, but it's hard to balance what to really expose to end-users.
Flags: needinfo?(mkmelin+mozilla)
Whiteboard: Good day all! Thank you on your efforts about this, filtering http is essential to fight SPAM, they usually put in links the same domains, even if the sender/purpose is different. You may leave all as it is, add a field HTTP link into the new filter rule…
Good day all! Thank you on your efforts about this, filtering http is essential to fight SPAM, they usually put in links the same domains, even if the sender/purpose is different. You may leave all as it is, add a field HTTP link into the new filter rules window, and by selecting it it will searh between the href strings. Thank you! PS: I would reccomend to fix a big old Thunderbird issue, when forwarding an email containg visible pictures, Thunderbird hanging on "attaching". It's needed to paste body with pics on word and copy/paste again in new message, annoying. It's related to html broken links, but it's a bit stupid they are broken when in the draft they are pretty visible.. and that means they aren't broken at all. Thank you
Whiteboard: Good day all! Thank you on your efforts about this, filtering http is essential to fight SPAM, they usually put in links the same domains, even if the sender/purpose is different. You may leave all as it is, add a field HTTP link into the new filter rule…
Thanks for the feedback. We're considering adding the href part of a link to the search. We are aware of the "attaching" issue.

Any progress here?

Severity: normal → S3
See Also: → 453385
See Also: → 1906516
Duplicate of this bug: 1913216
You need to log in before you can comment on or make changes to this bug.