Message filter with "Body contains" does not ignore tag content when tag is broken across lines.

RESOLVED FIXED in Thunderbird 65.0

Status

defect
RESOLVED FIXED
4 years ago
8 months ago

People

(Reporter: ski, Assigned: jorgk)

Tracking

(Blocks 2 bugs)

38 Branch
Thunderbird 65.0
x86
Windows Vista
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird_esr6064+ fixed, thunderbird64 fixed, thunderbird65 fixed)

Details

Attachments

(2 attachments, 7 obsolete attachments)

User Agent: Mozilla/5.0 (Windows NT 6.0; rv:42.0) Gecko/20100101 Firefox/42.0
Build ID: 20151029151421

Steps to reproduce:

I can reproduce the bug always in the following way. It need s an POP-account in order to use the "Body contains" filter option.

1.) Create the message filter:
 In the Menu go to Tools/Message Filters.
 Create a new filter via the button [New...].
 Give it some name. I did not change the hooks, so the hook on "Manually Run" and on "Getting New Mail: Filter before Junk Classification" are set. The next two hooks are unset. I set the radio button to "Match any of the following", although this is not necessary for one entry.
 For the filter options I choose "Body" "contains" "spamerwebsjte".
 (The "j" is just to create a new word.)
 Under "Performs these actions:" choose "Move Message to" "Trash on GMX"
 Press [OK]

2.) Switch to sending pure HTML-mails without plain text in the following way:
 In the Menu go to Tools/Options/Composition/[Send Options...].
 In the drop-down-menu choose "Send the message in HTML anyway".
 Press [OK] [OK].

3.) Send the following HTML-mail to yourself:
  Create a new message, address it to any POP-account you can read. A senseful subject could be "Test with short line".
  Click the message body and then in the Menu choose Insert/HTML... . 
  Paste the following content:

<html>
  <body>
  Dear Reader,<br> 
  <a href="http://www.some.spamerwebsjte.tat.i.want.to.filter.out.com">Click Me !</a>
  </body>
</html>

3.) Click [Insert]. Before sending you might like to save the message as template for further tests.
 Send the message.

4.) Get new messages from the POP account you have send the mail to.


Actual results:

When receiving the message, it is not filtered i.e. moved to the Trash but it stays in the Inbox. In the message source there is no line break before the "href"-command:

[...]
Dear Reader,<br>
    <a href="http://www.some.spamerwebsjte.tat.i.want.to.filter.out.com">Click

      Me !</a>
[...]


Expected results:

The word "spamerwebsjte" should have been found by the filter and the message should have been moved to the trash.

But the filter works, if you make the line with the link a bit longer in the test mail: use this test message, where only one additional letter is introduced in "that" instead of "tat":
 (a senseful subject would be "Test with long line")

<html>
  <body>
  Dear Reader,<br> 
  <a href="http://www.some.spamerwebsjte.that.i.want.to.filter.out.com">Click Me !</a>
  </body>
</html>

After sending and receiving this message, it is moved to the trash and the longer line is broken in the source before the "href"-command:

[...]
Dear Reader,<br>
    <a
      href="http://www.some.spamerwebsjte.that.i.want.to.filter.out.com">Click

      Me !</a>
[...]

Obviously, this additional line break, that appears automatically, makes the filter to work correctly.
But the filter must find the search word in both test mails.
OS: Unspecified → Windows Vista
Hardware: Unspecified → x86
Confirmed.

Adding:
  <body>
  Dear Reader,<br> 
  <a href="http://www.some.spamerwebsjte.tat.i.want.to.filter.out.com">Click Me !</a>
  </body>
the link line is broken before the href and the message is filtered, regardless of filter configuration
(Filter before/after Junk Classification).

Adding:
  <body bgcolor="#FFFFFF" text="#000000">
    <p> Dear Reader,<br>
      <a href="http://www.some.spamerwebsjte">Click Me !</a> </p>
  </body>
results in the link line not being broken and the message is NOT filtered correctly.

I wonder whether filtering should happen only on visible text or on all text in the message.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1211128
OK, after some discussion in bug 1211128 we concluded that text in HTML tags should be ignored in the search. That this doesn't happen when the tag is broken across lines is a bug.
Processing is here:
http://mxr.mozilla.org/comm-central/source/mailnews/base/search/src/nsMsgBodyHandler.cpp#310

So I am re-opening the bug for this and correcting the summary.

The discussion of whether to search inside the tags can happen in the other bug 1211128.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: Message filter with "Body contains" fails in some HTML-mails depending on line length → Message filter with "Body contains" does not ignore tag content when tag is broken across lines.
Status: REOPENED → NEW
Duplicate of this bug: 1498496
Something like this might do it. Sadly causes test failures in
mach xpcshell-test comm/mailnews/base/test/unit/test_searchBody.js

So something isn't right yet. Also I'd need to add a test case where a HTML tag is split across lines.
Assignee: nobody → jorgk
All tests in
mach xpcshell-test comm/mailnews/base/test/unit/test_searchBody.js
pass now. I'll add one for a tag broken between lines.
Attachment #9016862 - Attachment is obsolete: true
Attachment #9016872 - Flags: review?(acelists)
Comment on attachment 9016872 [details] [diff] [review]
1230815-html-search.patch (v2)

Well, firstly it still doesn't quite work, as the test shows, and secondly, StripHtml() is a little simplistic since it doesn't look at strings, so

<tag attr1="xx>" attr2="xxx"> would only remove up to the first > :-(
Attachment #9016872 - Flags: review?(acelists)
Sadly it does find the strings it shouldn't find :-(
OK, the code works for a multipart message, but doesn't work for a single part message yet.
Attachment #9016875 - Attachment is obsolete: true
Comment on attachment 9016880 [details] [diff] [review]
1230815-html-search-test.patch (v2)

The test is fine and passes with the next patch.
Attachment #9016880 - Flags: review?(acelists)
This works now on simple and multipart messages.

We still might to improve StripHtml() while we're here.
Attachment #9016872 - Attachment is obsolete: true
Attachment #9016884 - Flags: review?(acelists)
I noticed that when accumulating lines of HTML I need to insert a space between them so the last word of a line and the first word of the next line aren't glued together producing false positives.
Attachment #9016884 - Attachment is obsolete: true
Attachment #9016884 - Flags: review?(acelists)
Attachment #9016921 - Flags: review?(acelists)
Oops, the m_stripHtml removal hunks were missing.
Attachment #9016921 - Attachment is obsolete: true
Attachment #9016921 - Flags: review?(acelists)
Attachment #9016922 - Flags: review?(acelists)
Added another test that we find stuff again once the tag is closed and that we produce false positives at the end of the line.
Attachment #9016880 - Attachment is obsolete: true
Attachment #9016880 - Flags: review?(acelists)
Attachment #9016925 - Flags: review?(acelists)
Comment on attachment 9016922 [details] [diff] [review]
1230815-html-search.patch (v3c)

Review of attachment 9016922 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this seems to work for me. Strips the value in the tag attributes and even makes the split "Click   Me" findable via search.
Attachment #9016922 - Flags: review?(acelists) → review+
Comment on attachment 9016925 [details] [diff] [review]
1230815-html-search-test.patch (v2b)

Review of attachment 9016925 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/test/data/HTML-with-split-tag2.eml
@@ +24,5 @@
> +<body>
> +HTML part. Now comes the tag
> +
> +<tag attr1=ShouldNotFindThis
> +attr2=ShouldntFindThisEither>

Please quote one attribute so that we see proper syntax also works :)

And please make another tag that is properly closed to check also that case.

And also test searching the string inside tags that is wrapped, if we already do not have a test for it.
As in the report.
<a
href="xx">Click
Me</a>
... and search for "Click Me".
Attachment #9016925 - Flags: review?(acelists) → review+
(In reply to :aceman from comment #16)
> Please quote one attribute so that we see proper syntax also works :)
attr=xxx is proper syntax, no quotes required, but I can add them.

> And please make another tag that is properly closed to check also that case.
I don't understand. The tag is both properly closed. What would you like me to do exactly?

> <a
> href="xx">Click
> Me</a>
> ... and search for "Click Me".
Can do.
(In reply to Jorg K (GMT+2) from comment #17)
> > And please make another tag that is properly closed to check also that case.
> I don't understand. The tag is both properly closed. What would you like me
> to do exactly?

Tag <tag> in your test is unknown so it is not sure a real HTML parser would guess if it has to look for it's closing tag </tag> or not. Anyway, if you add the <a href>text </a> case, and search within the 'href' and the 'text' (where only 'text' will be found), you should have this covered.
More tests.
Attachment #9016925 - Attachment is obsolete: true
Attachment #9019281 - Flags: review+
Attachment #9019281 - Flags: approval-comm-esr60?
Attachment #9019281 - Flags: approval-comm-beta+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/8072f03fc22d
remove unused SetStripHtml() and accumulate HTML part to correct tag stripping. r=aceman
https://hg.mozilla.org/comm-central/rev/617e07ac0498
remove unused SetStripHtml() and accumulate HTML part to correct tag stripping, test. r=aceman
Status: NEW → RESOLVED
Closed: 4 years ago9 months ago
Resolution: --- → FIXED
Attachment #9016922 - Flags: approval-comm-esr60?
Attachment #9016922 - Flags: approval-comm-beta+
Comment on attachment 9019281 [details] [diff] [review]
1230815-html-search-test.patch (v3)

Review of attachment 9019281 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thank.
Attachment #9019281 - Flags: review+
Blocks: 1501358
Target Milestone: --- → Thunderbird 65.0
Comment on attachment 9016922 [details] [diff] [review]
1230815-html-search.patch (v3c)

Good for 60.4.
Attachment #9016922 - Flags: approval-comm-esr60? → approval-comm-esr60+
Attachment #9019281 - Flags: approval-comm-esr60? → approval-comm-esr60+
You need to log in before you can comment on or make changes to this bug.