For certain MIME structures (see comment #22) search for a string in message body fails to find message if message parts are base64 encoded (searches the undecoded content and also related images)
Categories
(MailNews Core :: Search, defect)
Tracking
(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)
People
(Reporter: tom.phinney, Assigned: jorgk-bmo)
References
(Blocks 1 open bug)
Details
(Whiteboard: TB 52.6 ESR)
Attachments
(4 files, 7 obsolete files)
74.73 KB,
text/plain
|
Details | |
70.71 KB,
text/plain
|
Details | |
1.81 KB,
patch
|
Details | Diff | Splinter Review | |
29.87 KB,
patch
|
jorgk-bmo
:
review+
jorgk-bmo
:
approval-comm-beta+
jorgk-bmo
:
approval-comm-esr52+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0 Build ID: 20160315153207 Steps to reproduce: Right-click folder in left pane, select "Search Messages...", match all of following: Date is after <3 months ago>; Content contains <string, e.g., 65C/828> Actual results: Search failed to find message in folder that met the search criteria, e.g., received within the last three months and containing the specified string, even though that message is within the folder. Expected results: Search should have found the message(s) that met the criteria. This has happened multiple times over the past few months, though such searches using the identical procedure used to work. Thus a faulty modification in the T-bird source is probably to blame. Example from today's failed search attached. (Content not sensitive and will become semi-public, so may be disclosed to others.) T-bird version is 38.7.0.
Reporter | ||
Updated•8 years ago
|
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 1•8 years ago
|
||
> Content contains <string, e.g., 65C/828> You mean: Body contains? Is the message you're looking for in an IMAP folder or a local folder. If it's an IMAP folder you're reporting a duplicate of bug 1245532.
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #1) > > Content contains <string, e.g., 65C/828> > You mean: Body contains? > > Is the message you're looking for in an IMAP folder or a local folder. If > it's an IMAP folder you're reporting a duplicate of bug 1245532. Yes, I do mean "Body contains". Sorry for the misstatement. The specific folder for this example was a local T-bird Mail (not ImapMail) folder on a POP account. I recall having seen similar search failures for "> Local Folders > subfolder" where I store e-mail of this type after it's about six months old. (I do that to get it into a different .spd file.)
Assignee | ||
Comment 3•8 years ago
|
||
OK, I loaded the message into a local folder and can't find 65C although it's clearly in the message. I think it's a known problem with multipart messages. I'll look for a pre-existing bug later. (You can look at the message source to see its internal structure.)
Assignee | ||
Comment 4•8 years ago
|
||
I was wrong. We've seen the problem for an IMAP stored message (bug 1245532), but not for a local message. So your sample is valuable test data.
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 5•8 years ago
|
||
This is the same message, but I decoded the plaintext and HTML parts. This time the string 65C is found. Looks to me that the problem is caused by the base64 encoding. I tried looking for SGkgVG9tLiAg which is present in the base64, and lo and behold, the string is found!!
Assignee | ||
Comment 6•8 years ago
|
||
Kent, here we have another "search doesn't work" bug. This time the cause is clear. The search doesn't decode the body before looking. That's not very useful really. I can find the message looking for NDEgQU0NClRvOiBTQzY1Q0BJU0EtT05MSU5FLk9SRw0KU3ViamVjdDogT25lIElFQyBTQzY1QyBk
Assignee | ||
Comment 7•8 years ago
|
||
I've just tried a filter. Same behaviour. It filters on the raw message, not the decoded message.
Updated•8 years ago
|
Comment 8•8 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #7) > I've just tried a filter. Same behaviour. It filters on the raw message, not > the decoded message. Should be easy to fix then. Care to try?
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to Kent James (:rkent) from comment #8) > Should be easy to fix then. Care to try? Care to try to deal with some editor bugs (see bug 1248971)? You're an editor specialist, right, see bug 780908 ;-) If it's easy, just do it, I'll watch the master at work.
Comment 10•8 years ago
|
||
NI myself since Jorg continues to encourage me to look at this, and I might be able to extract a future favor from him if I do :)
Assignee | ||
Comment 11•8 years ago
|
||
Don't you think it's pretty poor that you can't find content just because its base64 encoded?
Updated•8 years ago
|
Assignee | ||
Comment 13•8 years ago
|
||
Kent, while looking at this, could you also glance at bug 1230815 comment #3. That bug is about not stripping HTML tags well in a body search. The processing here http://mxr.mozilla.org/comm-central/source/mailnews/base/search/src/nsMsgBodyHandler.cpp#310 is just a total misdesign since it processed the message body line by line and tries to strip tags in <>. That of course fails if the < and the > are not on the same line. So if you'll be working in nsMsgBodyHandler.cpp, perhaps you can kill more than one bird with one stone.
Comment 14•8 years ago
|
||
I won't give up on looking at this yet, but really this bug is part of a whole class of issues where filters and search (and this is really a search bug) make unreasonably compromises, or fail to include reasonable features, because filter processing can only be done sync, but many needed operations are async. Body search is a good example. The whole routine is a massive kludge. At some point incoming message acquisition is an async operation, and there ought to be a way to insert MIME processing into that operation so that we can do things like correctly process the body, get correct attachment existence, search by attachment name and content type, etc. Rather than kludge the kludge, I think that our time would be better spent trying to figure out how to do an async search (which would include MIME processing) as part of the incoming message processing.
Comment 15•8 years ago
|
||
45.4.0 I can not search base64 encoded messages as well. Here is sample message: http://pastebin.com/WAFDYi2c and try search for example JPME1.
Assignee | ||
Comment 16•8 years ago
|
||
We know it doesn't work. Unless you see activity in this bug it will remain broken since it won't get fixed magically.
Assignee | ||
Comment 19•6 years ago
|
||
Strangely there was bug 132340 where something like this got fixed(?). Good reading in bug 519202.
Comment 20•6 years ago
|
||
(In reply to Kent James (:rkent) from comment #14) > I won't give up on looking at this yet, >... trying to > figure out how to do an async search (which would include MIME processing) > as part of the incoming message processing. Have you given up? BTW, simple HTML entities in source (like ü) also make non-gloda body search fail, just tested on release 52.5.2 (32-bit).
Comment 21•6 years ago
|
||
(In reply to Thomas D. from comment #20) > BTW, simple HTML entities in source (like ü) also make non-gloda body > search fail, just tested on release 52.5.2 (32-bit). That's bug 521649 which I filed in 2009.
Assignee | ||
Comment 22•6 years ago
|
||
I'm surveying the code and tests added in bug 132340. Tests: https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/test/unit/test_searchBody.js#34 https://dxr.mozilla.org/comm-central/source/mailnews/test/data/base64-1 - single part, search works. https://dxr.mozilla.org/comm-central/source/mailnews/test/data/basic1 - not base64 encoded, huh? https://dxr.mozilla.org/comm-central/source/mailnews/test/data/multipart-base64-2 - multipart/mixed, search works https://dxr.mozilla.org/comm-central/source/mailnews/test/data/bug132340 - multipart/alternative, two text/plain text parts, the second one is shown (later, thus higher priority) and search works there. So let's look at a few MIME structures and see what works, also keeping bug 1263783 and bug 1420796 (although that's for a filter which may be different) where the search string is found in raw base64 of image data. 1) text/plain: Search works. 2) multipart/mixed: text/plain + attachment: Search works, attachment data not matched. 3) text/html: Search works. 4) multipart/mixed: text/html + attachment: Search works, attachment data not matched. 5) multipart/related: text/html + image: Search works, image data not matched. 6) multipart/alternative: text/plain + text/html: Search works, content found in both plain text and HTML parts. 7) multipart/alternative: text/plain + multipart/related (text/html + image): Search only works on plaintext part, string not found in HTML part, raw base64 of HTML and image data matched. 8) multipart/mixed: multipart/alternative (text/plain + text/html) + attachment (like in attachment 8734465 [details]) String not found in plaintext and HTML part, raw base64 of plaintext and HTML matched, attachment data not matched. 9) multipart/mixed: multipart/related (text/html + image)) + attachment String not found in HTML part, raw base64 HTML matched, image data matched, attachment data not matched. 10) multipart/mixed: multipart/alternative (text/plain + multipart/related (text/html + image)) + attachment String not found in plaintext and HTML part, raw base64 of plaintext and HTML and image data matched, attachment data not matched. So not working: 7, 8, 9 and 10. There's a pattern to the madness ;-) - if string not found in a part, then the part's raw data matches (7, 8, 9, 10) - if raw HTML data is matched, raw image data is also matched - attachment data never matched.
Assignee | ||
Comment 25•6 years ago
|
||
Assignee | ||
Comment 26•6 years ago
|
||
Ouch, cases 7-10 are what you could call "nested multipart" and we read here: https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/search/src/nsMsgBodyHandler.cpp#384 // TODO: make this work for nested multiparts (requires some redesign) And that comment has been there since the import into HG :-(
Assignee | ||
Comment 27•6 years ago
|
||
This integrates my 10 test cases into mailnews/base/test/unit/test_searchBody.js. The expected results are tweaked so that the test passed overall with the current code base. We see that 2+4 strings are not found and a lot of times the raw base64 is matched. So in the next few days I'll fix the code so that we get the correct expected results. One error I noticed is that when a MIME part terminates, m_partIsText is set to true, leading to the following embedded image used as text part (I think).
Assignee | ||
Comment 28•6 years ago
|
||
A few tweaks to the algorithm already reduce the false positives greatly. So two false positives remain and 2+4 strings still aren't found in nested multipart messages.
Assignee | ||
Comment 29•6 years ago
|
||
Correcting further fatal logic errors (reading much more body than belongs to the message!) fixes the two false positives. Now left to do is to correct the nested part processing to fix not finding stuff.
Assignee | ||
Comment 30•6 years ago
|
||
This works, all tests pass. No false positives, no false negatives. Nested multipart is made to work by collecting all boundaries in an array which works like a stack. We add to the end and search from the end, if we find our entry, later ones can go. Not so easy to review, I looked at the faulty algorithm for three days and did lots of debugging dumping out buffers and indices, etc. :-(
Comment 31•6 years ago
|
||
This does not add too much new code but it looks like it adds some parsing of message parts. Shouldn't we have ready made tools for this in the mime library?
Assignee | ||
Comment 32•6 years ago
|
||
Here's a patch to make the reviewer's life easier. With this patch applied you can see what the algorithm does. You could also re-enable the commented-out line // ApplyTransformations(buf, buf.Length(), eatThisLine, buf); to check that is leads to duplication, which is inconsequential, since if the body contains a match, so does the doubled up body. If might lead to false positives thought when a body line is ABBA and you're searching for AA. Not found in ABBA, but found in ABBAABBA. As a concrete example, import Joshua's message 'bug132340' and you'll see === buf |Semi case-insensitivity sucks. Match on "base 64 text"Semi case-insensitivity sucks. Match on "base 64 text"| in the debug. And you get a match on |text"Semi| :-(
Assignee | ||
Comment 33•6 years ago
|
||
(In reply to :aceman from comment #31) > This does not add too much new code but it looks like it adds some parsing > of message parts. Shouldn't we have ready made tools for this in the mime > library? Yes, the code in nsMsgBodyHandler.cpp is a mini-MIME-parser. With this in mind, the bug was never fixed since people were waiting for the all-happy-making new MIME implementation (which has never arrived). Quoting chat: {"date":"2017-12-29T21:50:08.000Z","who":"jcranmer","text":"that was one of the things that convinced me we needed a better mime library","flags":["incoming"]} Meanwhile TB extremely embarrassingly spews out false positives and false negatives which really makes *unfit* for message storage and search. The same, BTW, happened to the "Asian Crisis" I fixed for TB 45. I was told not to, since the new MIME parser would fix it. Well, I'm glad I didn't listen to the advice and I'm glad I didn't listen to the advice this time either. Here is your fix with 10 test cases taken from real life!
Assignee | ||
Comment 34•6 years ago
|
||
Minor simplification, we don't need m_partIsTextDefault and can use 'false' always.
Assignee | ||
Comment 35•6 years ago
|
||
Sorry about the noise, but it didn't compile on Mac. Now it should.
Assignee | ||
Comment 36•6 years ago
|
||
BTW, I have a "green" try run for this, the failures are due to other issues: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=46679cbb6d3b64949ce78a6acc03de5e2f729e7c
Comment 37•6 years ago
|
||
Comment on attachment 8939322 [details] [diff] [review] 1259534-base64-body-search.patch (v4c) Review of attachment 8939322 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for diving into this. It is a shame we do not have an universal mime parser that we could just ask for the relevant and decoded message parts. Looks like the mime rewrite is in order. I can't really confirm whether the mime parsing here is correct or not, but if the existing tests we had so far still pass, at least the new code shouldn't regress anything. Thanks for adding new tests. I have also manually tested the reporter's message that now works with your patch. You may want to also put example.com in all the Message-ID in the test messages instead of a real domain. ::: mailnews/base/search/public/nsMsgBodyHandler.h @@ +90,5 @@ > // With the exception of m_isMultipart, these all apply to the various parts > bool m_stripHeaders; // true if we're supposed to strip of message headers > bool m_stripHtml; // true if we're supposed to strip off HTML tags > + bool m_pastMsgHeaders; // true if we've already skipped over the message headers > + bool m_pastPartHeaders; // true if we've already skipped over the part headers I think you can align the comment on this one line. ::: mailnews/base/search/src/nsMsgBodyHandler.cpp @@ +276,5 @@ > else > { > + // It is wrong to call ApplyTransformations() here since this will > + // lead to the buffer being doubled-up at |buf.Append(line.get());| below. > + // ApplyTransformations(buf, buf.Length(), eatThisLine, buf); OK, you say it doubles the buffer, but why can we skip all the work the function does? Why is the call not needed now? Shouldn't we solve the doubling in a different way?
Assignee | ||
Comment 38•6 years ago
|
||
(In reply to :aceman from comment #37) > I have also manually tested the reporter's message ... Sure that works, it's case 08. > You may want to also put example.com in all the Message-ID in the test > messages instead of a real domain. I did now. However, I will be eternalised in the CIDs ;-) > I think you can align the comment on this one line. Done. > > + // It is wrong to call ApplyTransformations() here since this will > > + // lead to the buffer being doubled-up at |buf.Append(line.get());| below. > > + // ApplyTransformations(buf, buf.Length(), eatThisLine, buf); > OK, you say it doubles the buffer, but why can we skip all the work the > function does? Why is the call not needed now? Shouldn't we solve the > doubling in a different way? Good question. When we get there, we just matched a boundary and decoded the accumulated buffer from base64. The removed call wouldn't have done anything other than running through if (m_base64part) { // We need to keep track of all lines to parse base64encoded... buf.Append(line.get()); eatThisLine = true; return buf.Length(); } All the other if-blocks weren't entered: if (!m_pastPartHeaders) // line is a line from the part headers Nope, we were past headers already. if (matchedBoundary) Nope, the accumulated net content didn't match a boundary. if (!m_partIsText) Nope, the part was text. So you can see that this was a 100% useless call.
Comment 39•6 years ago
|
||
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/13a022fef0c8 fix body search of base64 encoded bodies (incl. 10 new test cases). r=aceman
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 40•6 years ago
|
||
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/759973c6bcae correct comment (insert 'not'). rs=comment-only DONTBUILD
Assignee | ||
Comment 41•6 years ago
|
||
Comment on attachment 8939303 [details] [diff] [review] 1259534-debug.patch Review of attachment 8939303 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/test/resources/searchTestUtils.js @@ +28,5 @@ > aArbitraryHeader, aHdrProperty) > { > var searchListener = > { > + onSearchHit: function(dbHdr, folder) { Damn, lost the hitCount++; :-( So this debug patch destroys the test.
Comment 42•6 years ago
|
||
Jörg, this is awesome. You ROCK! Thanks a lot.
Assignee | ||
Comment 43•6 years ago
|
||
Wait until bug 1427124 lands :-)
Assignee | ||
Comment 44•6 years ago
|
||
Beta (TB 58): https://hg.mozilla.org/releases/comm-beta/rev/cad3757238b0 https://hg.mozilla.org/releases/comm-beta/rev/17f03d87dc73
Comment 45•6 years ago
|
||
Cheers Jörg appreciate the work you've put into this.
Comment 46•6 years ago
|
||
really excellent work.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 48•6 years ago
|
||
TB 52.6 ESR: https://hg.mozilla.org/releases/comm-esr52/rev/a73bf68de26cb71ee54d91a002da3a9e628e9b23 (white-space fix) https://hg.mozilla.org/releases/comm-esr52/rev/45b228d3f8b51d0454dd3ea954a037ab1cc2c91c (typo-fix changeset already integrated into main changeset)
Comment 50•4 years ago
|
||
Emails with these headers
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
won't show up when searching for a string in message body. Only Global Search find them, Quick Filter or right click on folder->Search Messages both don't work. I'm using Thunderbird 68.2.2 on Ubuntu 18.04.3.
I first commented in bug 753307#1 but I think it's more appropriate here.
These emails are sent by mailman-bounces, the header says: X-Mailman-Version: 2.1.20
Assignee | ||
Comment 51•4 years ago
|
||
That's hard to believe since we even have test cases covering this, like:
https://searchfox.org/comm-central/rev/5cc3282b22456cd4a35497fe8ae3572ca2ee4b40/mailnews/base/test/unit/test_searchBody.js#39
https://searchfox.org/comm-central/rev/5cc3282b22456cd4a35497fe8ae3572ca2ee4b40/mailnews/base/test/unit/test_searchBody.js#52
https://searchfox.org/comm-central/rev/5cc3282b22456cd4a35497fe8ae3572ca2ee4b40/mailnews/base/test/unit/test_searchBody.js#64
and
https://searchfox.org/comm-central/source/mailnews/test/data/01-plaintext.eml
https://searchfox.org/comm-central/source/mailnews/test/data/11-plaintext.eml
https://searchfox.org/comm-central/source/mailnews/test/data/21-plaintext.eml
I wrote that stuff myself. Please file another bug supplying a sample and CC me.
Description
•