1498496 - "Search" function in IMAP folder, against message body (HTML format), produces wrong results

Reporter

Description

•

7 years ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0 Steps to reproduce: Opened "Search" over IMAP folder. Entered terms (must match both): Body contains "Payment" Body contains ":PO" (I was looking for invoice, where payment type has "PO" in it (purchase order) Actual results: HTML message matched, containing the following lines: <p style=3D"margin: 0em 0; font-size: 0.8em;" id=3D"pa= ymentType">Payment Type: Credit Card</p> (this one matched first term) <td id=3D"action-btn" bgcolor=3D"#12= 6AFB" style=3D"background: #126AFB; background: -moz-linear-gradient(top, #= 126AFB 0%, #1DAEFC 100%); background: -webkit-gradient(linear, left top, r = bottom, color-stop(0%,#126AFB), color-stop(100%,#1DAEFC)); background: -web= kit-linear-gradient(top, #126AFB 0%,#1DAEFC 100%); background: -o-linear-gr= adient(top, #126AFB 0%,#1DAEFC 100%); background: -ms-linear-gradient(top, = #126AFB 0%,#1DAEFC 100%); background: linear-gradient(top, #126AFB 0%,#1DAE= FC 100%); padding:15px 30px; margin: 10px 20px; -moz-border-radius:10px; -w= ebkit-border-radius:10px; -khtml-border-radius:10px; border-radius:10px; cu= rsor:pointer; box-shadow: -3px 3px 5px #84B9FD;" align=3D"center"> (this one matched second term) Expected results: The above message should not be included into search results, since second term matched an attribute inside HTML tag. When I search for text, I search for human-readable text, not for HTML markup fragments. Alternately, there should be a separate option "Match in raw HTML".

Wayne Mery (:wsmwk)

Comment 1

•

7 years ago

I don't think your issue is related to html. Global search indexes "strings" of 3 characters or more. Strings are delineated by punctuation. Therefore, ":PO" isn't indexed and cannot be searched on.

Whiteboard: [dupeme]

Jorg K (CEST = GMT+2)

Comment 2

•

7 years ago

Let's clarify whether this is a Global search (Gloda, Ctrl+K) which has its own indexing rules, as Wayne stated, or a "normal" serch, "Search Messages" (Ctrl+Shift+F, or right-click menu). To me it looks like the latter. Normal search is pretty poor when it comes to HTML, it tries to skip tags, but it has trouble when the tag is split across multiple lines (which can also lead to false negatives), see bug 1230815. Also related: Bug 1211128. If it is "normal" search, it's a straight duplicate of bug 1230815.

Konstantin Boyandin

Reporter

Comment 3

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #2) > Let's clarify whether this is a Global search (Gloda, Ctrl+K) which has its > own indexing rules, as Wayne stated, or a "normal" serch, "Search Messages" The latter, "Search Messages" from right-click menu. If this' a duplicate of 1230815, and that, in turn, is a duplicate of 1211128 (3 years old, still unassigned), then I assume I shouldn't expect that anyone would fix that in foreseeable future. Whom shall I contact to offer, at least, to provide "strip HTML tags first, search next" approach for such a situation? HTML messages are de facto standard these days. The fact that Thunderbird search over them is broken and is about to remain broken, isn't reassuring.

Jorg K (CEST = GMT+2)

Comment 4

•

7 years ago

Body search sadly is a neglected feature. Until very recently, you couldn't even find anything in base64 encoded bodies, it would search the encoded body instead of decoding it first. Processing HTML properly is somewhat harder. You also want to find München even if the HTML has München, right? That's bug 521649 which is ancient. We'll be looking of addressing some technical debt in Thunderbird in the coming years, and proper filtering is one of the areas. So don't hold your breath. Under some circumstances searching in IMAP folders is even worse, see bug 1245532. > Whom shall I contact to offer, at least, to provide "strip HTML tags first, > search next" approach for such a situation? If you want to have a go at fixing it, attach a patch to bug 1230815. Bear in mind that body search is truly horrible, it processes the body line by line which of course is a big design flaw that doesn't fly :-(

Status: UNCONFIRMED → RESOLVED

Closed: 7 years ago

Resolution: --- → DUPLICATE

Konstantin Boyandin

Reporter

Comment 5

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #4) > Body search sadly is a neglected feature. Until very recently, you couldn't > even find anything in base64 encoded bodies, it would search the encoded > body instead of decoding it first. [...] "DUPLICATE" I understand, but "RESOLVED" looks like a hypocrisy (I know that's just a flag, no offense meant). So the full-body IMAP search is long neglected feature and it's unlikely anyone would attend to it in foreseeable future. You're right, lookup over HTML entities is also an obvious feature, since not all of us speak and write English. I'll look at the bug you proposed to make a patch for, but looks like it's an attempt to "cure dead with poultice", citing Russian saying. Full-body search, as I see it, should be completely re-written from scratch.

Jorg K (CEST = GMT+2)

Comment 6

•

7 years ago

Let's separate IMAP search from general body search. Body search has its own message parsing here, that's where I repaired the base64 stuff: https://dxr.mozilla.org/comm-central/source/comm/mailnews/base/search/src/nsMsgBodyHandler.cpp Looking at it now, I think the HTML search could be repaired fairly easily. Currently it works on individual lines: https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/base/search/src/nsMsgBodyHandler.cpp#335 when for HTML it should accumulate the entire message part into a buffer (like it does for base64) and then strip the tags and the end when the entire HTML parts has been aggregated. Long term we intend to run everything through our MIME parser.

Konstantin Boyandin

Reporter

Comment 7

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #6) > Let's separate IMAP search from general body search. Body search has its own > message parsing here, that's where I repaired the base64 stuff: > https://dxr.mozilla.org/comm-central/source/comm/mailnews/base/search/src/ > nsMsgBodyHandler.cpp I looked at that. Two thoughts: a. It's not a quick fix, the entire processing logic should be changed b. Importing the entire message first isn't a solution. Messages can be arbitrarily large. The smarter approach is "sliding window", read new line in case the previous hasn't tag termination character yet, and output whatever is ready. But that's definitely *not* a quick fix. Marked that to look at later, in case I get tired of adjusting my IMAP-searching script.

Jorg K (CEST = GMT+2)

Comment 8

•

7 years ago

We read entire MIME parts all the time. Fix is in bug 1230815. You inspired me ;-)

Konstantin Boyandin

Reporter

Comment 9

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #8) I'm glad to be of use. Hope to contribute some time myself...

Jorg K (CEST = GMT+2)

Comment 10

•

7 years ago

Life is all about synchronicity ;-) - I joined the project in 2015. After gathering enough knowledge, I was able to fix the base64 issue (bug 1259534) in late 2017, and while doing so, got familiar with nsMsgBodyHandler.cpp. You prompted me to look at it again in comment #6, and from that comment to actually fixing it only took two hours.

Konstantin Boyandin

Reporter

Comment 11

•

7 years ago

(In reply to Jorg K (GMT+2) from comment #10) > Life is all about synchronicity ;-) Thanks. I assume the bug's comments isn't the best place to chat, but still I am impressed.

Bugzilla

"Search" function in IMAP folder, against message body (HTML format), produces wrong results

Categories

(Thunderbird :: Search, defect)

Tracking

(Not tracked)

People

(Reporter: konstantin, Unassigned)

References

Details

(Whiteboard: [dupeme])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11