Closed Bug 1498496 Opened 6 years ago Closed 6 years ago

"Search" function in IMAP folder, against message body (HTML format), produces wrong results

Categories

(Thunderbird :: Search, defect)

52 Branch
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1230815

People

(Reporter: konstantin, Unassigned)

Details

(Whiteboard: [dupeme])

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0

Steps to reproduce:

Opened "Search" over IMAP folder.

Entered terms (must match both):

Body contains "Payment"
Body contains ":PO"

(I was looking for invoice, where payment type has "PO" in it (purchase order)


Actual results:

HTML message matched, containing the following lines:

<p style=3D"margin: 0em 0; font-size: 0.8em;" id=3D"pa=
ymentType">Payment Type: Credit Card</p> (this one matched first term)

<td id=3D"action-btn" bgcolor=3D"#12=
6AFB" style=3D"background: #126AFB; background: -moz-linear-gradient(top, #=
126AFB 0%, #1DAEFC 100%); background: -webkit-gradient(linear, left top, r =
bottom, color-stop(0%,#126AFB), color-stop(100%,#1DAEFC)); background: -web=
kit-linear-gradient(top, #126AFB 0%,#1DAEFC 100%); background: -o-linear-gr=
adient(top, #126AFB 0%,#1DAEFC 100%); background: -ms-linear-gradient(top, =
#126AFB 0%,#1DAEFC 100%); background: linear-gradient(top, #126AFB 0%,#1DAE=
FC 100%); padding:15px 30px; margin: 10px 20px; -moz-border-radius:10px; -w=
ebkit-border-radius:10px; -khtml-border-radius:10px; border-radius:10px; cu=
rsor:pointer;  box-shadow: -3px 3px 5px #84B9FD;" align=3D"center"> (this one matched second term)


Expected results:

The above message should not be included into search results, since second term matched an attribute inside HTML tag. 

When I search for text, I search for human-readable text, not for HTML markup fragments. Alternately, there should be a separate option "Match in raw HTML".
I don't think your issue is related to html.  Global search indexes "strings" of 3 characters or more.  Strings are delineated by punctuation.  Therefore, ":PO" isn't indexed and cannot be searched on.
Whiteboard: [dupeme]
Let's clarify whether this is a Global search (Gloda, Ctrl+K) which has its own indexing rules, as Wayne stated, or a "normal" serch, "Search Messages" (Ctrl+Shift+F, or right-click menu). To me it looks like the latter.

Normal search is pretty poor when it comes to HTML, it tries to skip tags, but it has trouble when the tag is split across multiple lines (which can also lead to false negatives), see bug 1230815. Also related: Bug 1211128.

If it is "normal" search, it's a straight duplicate of bug 1230815.
(In reply to Jorg K (GMT+2) from comment #2)
> Let's clarify whether this is a Global search (Gloda, Ctrl+K) which has its
> own indexing rules, as Wayne stated, or a "normal" serch, "Search Messages"

The latter, "Search Messages" from right-click menu.

If this' a duplicate of 1230815, and that, in turn, is a duplicate of 1211128 (3 years old, still unassigned), then I assume I shouldn't expect that anyone would fix that in foreseeable future.

Whom shall I contact to offer, at least, to provide "strip HTML tags first, search next" approach for such a situation?

HTML messages are de facto standard these days. The fact that Thunderbird search over them is broken and is about to remain broken, isn't reassuring.
Body search sadly is a neglected feature. Until very recently, you couldn't even find anything in base64 encoded bodies, it would search the encoded body instead of decoding it first.

Processing HTML properly is somewhat harder. You also want to find München even if the HTML has M&uuml;nchen, right? That's bug 521649 which is ancient.

We'll be looking of addressing some technical debt in Thunderbird in the coming years, and proper filtering is one of the areas. So don't hold your breath. Under some circumstances searching in IMAP folders is even worse, see bug 1245532.

> Whom shall I contact to offer, at least, to provide "strip HTML tags first,
> search next" approach for such a situation?
If you want to have a go at fixing it, attach a patch to bug 1230815. Bear in mind that body search is truly horrible, it processes the body line by line which of course is a big design flaw that doesn't fly :-(
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
(In reply to Jorg K (GMT+2) from comment #4)
> Body search sadly is a neglected feature. Until very recently, you couldn't
> even find anything in base64 encoded bodies, it would search the encoded
> body instead of decoding it first. [...]

"DUPLICATE" I understand, but "RESOLVED" looks like a hypocrisy (I know that's just a flag, no offense meant).

So the full-body IMAP search is long neglected feature and it's unlikely anyone would attend to it in foreseeable future. You're right, lookup over HTML entities is also an obvious feature, since not all of us speak and write English.

I'll look at the bug you proposed to make a patch for, but looks like it's an attempt to "cure dead with poultice", citing Russian saying. Full-body search, as I see it, should be completely re-written from scratch.
Let's separate IMAP search from general body search. Body search has its own message parsing here, that's where I repaired the base64 stuff:
https://dxr.mozilla.org/comm-central/source/comm/mailnews/base/search/src/nsMsgBodyHandler.cpp

Looking at it now, I think the HTML search could be repaired fairly easily. Currently it works on individual lines:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/base/search/src/nsMsgBodyHandler.cpp#335

when for HTML it should accumulate the entire message part into a buffer (like it does for base64) and then strip the tags and the end when the entire HTML parts has been aggregated.

Long term we intend to run everything through our MIME parser.
(In reply to Jorg K (GMT+2) from comment #6)
> Let's separate IMAP search from general body search. Body search has its own
> message parsing here, that's where I repaired the base64 stuff:
> https://dxr.mozilla.org/comm-central/source/comm/mailnews/base/search/src/
> nsMsgBodyHandler.cpp

I looked at that. Two thoughts:

a. It's not a quick fix, the entire processing logic should be changed

b. Importing the entire message first isn't a solution. Messages can be arbitrarily large. The smarter approach is "sliding window", read new line in case the previous hasn't tag termination character yet, and output whatever is ready. But that's definitely *not* a quick fix.

Marked that to look at later, in case I get tired of adjusting my IMAP-searching script.
We read entire MIME parts all the time. Fix is in bug 1230815. You inspired me ;-)
(In reply to Jorg K (GMT+2) from comment #8)

I'm glad to be of use. Hope to contribute some time myself...
Life is all about synchronicity ;-) - I joined the project in 2015. After gathering enough knowledge, I was able to fix the base64 issue (bug 1259534) in late 2017, and while doing so, got familiar with nsMsgBodyHandler.cpp. You prompted me to look at it again in comment #6, and from that comment to actually fixing it only took two hours.
(In reply to Jorg K (GMT+2) from comment #10)
> Life is all about synchronicity ;-) 

Thanks. I assume the bug's comments isn't the best place to chat, but still I am impressed.
You need to log in before you can comment on or make changes to this bug.