Closed Bug 1259534 Opened 8 years ago Closed 6 years ago

For certain MIME structures (see comment #22) search for a string in message body fails to find message if message parts are base64 encoded (searches the undecoded content and also related images)

Categories

(MailNews Core :: Search, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)

RESOLVED FIXED
Thunderbird 59.0
Tracking Status
thunderbird_esr52 58+ fixed
thunderbird58 --- fixed
thunderbird59 --- fixed

People

(Reporter: tom.phinney, Assigned: jorgk-bmo)

References

(Blocks 1 open bug)

Details

(Whiteboard: TB 52.6 ESR)

Attachments

(4 files, 7 obsolete files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160315153207

Steps to reproduce:

Right-click folder in left pane, select "Search Messages...", match all of following: Date is after <3 months ago>; Content contains <string, e.g., 65C/828>


Actual results:

Search failed to find message in folder that met the search criteria, e.g., received within the last three months and containing the specified string, even though that message is within the folder.


Expected results:

Search should have found the message(s) that met the criteria. This has happened multiple times over the past few months, though such searches using the identical procedure used to work. Thus a faulty modification in the T-bird source is probably to blame. Example from today's failed search attached. (Content not sensitive and will become semi-public, so may be disclosed to others.) T-bird version is 38.7.0.
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
Attachment #8734465 - Attachment mime type: message/rfc822 → text/plain
> Content contains <string, e.g., 65C/828>
You mean: Body contains?

Is the message you're looking for in an IMAP folder or a local folder. If it's an IMAP folder you're reporting a duplicate of bug 1245532.
Component: Untriaged → Search
(In reply to Jorg K (GMT+1) from comment #1)
> > Content contains <string, e.g., 65C/828>
> You mean: Body contains?
> 
> Is the message you're looking for in an IMAP folder or a local folder. If
> it's an IMAP folder you're reporting a duplicate of bug 1245532.

Yes, I do mean "Body contains". Sorry for the misstatement.

The specific folder for this example was a local T-bird Mail (not ImapMail) folder on a POP account. I recall having seen similar search failures for "> Local Folders > subfolder" where I store e-mail of this type after it's about six months old. (I do that to get it into a different .spd file.)
OK, I loaded the message into a local folder and can't find 65C although it's clearly in the message. I think it's a known problem with multipart messages. I'll look for a pre-existing bug later. (You can look at the message source to see its internal structure.)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Folder search for a string in message content fails to retrieve messages containing that string → Folder search for a string in message content fails to retrieve messages containing that string (multipart message)
I was wrong. We've seen the problem for an IMAP stored message (bug 1245532), but not for a local message. So your sample is valuable test data.
See Also: → 1245532
Summary: Folder search for a string in message content fails to retrieve messages containing that string (multipart message) → Folder search for a string in message content fails to retrieve messages containing that string (multipart message, plaintext+HTML, both base64 encoded)
This is the same message, but I decoded the plaintext and HTML parts. This time the string 65C is found. Looks to me that the problem is caused by the base64 encoding.

I tried looking for SGkgVG9tLiAg which is present in the base64, and lo and behold, the string is found!!
Kent, here we have another "search doesn't work" bug. This time the cause is clear. The search doesn't decode the body before looking. That's not very useful really. I can find the message looking for
NDEgQU0NClRvOiBTQzY1Q0BJU0EtT05MSU5FLk9SRw0KU3ViamVjdDogT25lIElFQyBTQzY1QyBk
Summary: Folder search for a string in message content fails to retrieve messages containing that string (multipart message, plaintext+HTML, both base64 encoded) → Search for a string in message body fails to find message if message parts are base64 encoded. Searches the undecoded content.
I've just tried a filter. Same behaviour. It filters on the raw message, not the decoded message.
(In reply to Jorg K (GMT+1) from comment #7)
> I've just tried a filter. Same behaviour. It filters on the raw message, not
> the decoded message.

Should be easy to fix then. Care to try?
(In reply to Kent James (:rkent) from comment #8)
> Should be easy to fix then. Care to try?
Care to try to deal with some editor bugs (see bug 1248971)? You're an editor specialist, right, see bug 780908 ;-)

If it's easy, just do it, I'll watch the master at work.
See Also: → 1263783
NI myself since Jorg continues to encourage me to look at this, and I might be able to extract a future favor from him if I do :)
Flags: needinfo?(rkent)
Don't you think it's pretty poor that you can't find content just because its base64 encoded?
Flags: needinfo?(rkent)
Product: Thunderbird → MailNews Core
Version: 38 Branch → Trunk
didn't mean to un-NI myself
Flags: needinfo?(rkent)
Kent, while looking at this, could you also glance at bug 1230815 comment #3. That bug is about not stripping HTML tags well in a body search.

The processing here
http://mxr.mozilla.org/comm-central/source/mailnews/base/search/src/nsMsgBodyHandler.cpp#310
is just a total misdesign since it processed the message body line by line and tries to strip tags in <>. That of course fails if the < and the > are not on the same line.

So if you'll be working in nsMsgBodyHandler.cpp, perhaps you can kill more than one bird with one stone.
I won't give up on looking at this yet, but really this bug is part of a whole class of issues where filters and search (and this is really a search bug) make unreasonably compromises, or fail to include reasonable features, because filter processing can only be done sync, but many needed operations are async.

Body search is a good example. The whole routine is a massive kludge.

At some point incoming message acquisition is an async operation, and there ought to be a way to insert MIME processing into that operation so that we can do things like correctly process the body, get correct attachment existence, search by attachment name and content type, etc. Rather than kludge the kludge, I think that our time would be better spent trying to figure out how to do an async search (which would include MIME processing) as part of the incoming message processing.
45.4.0 I can not search base64 encoded messages as well. Here is sample message: http://pastebin.com/WAFDYi2c and try search for example JPME1.
We know it doesn't work. Unless you see activity in this bug it will remain broken since it won't get fixed magically.
Strangely there was bug 132340 where something like this got fixed(?). Good reading in bug 519202.
(In reply to Kent James (:rkent) from comment #14)
> I won't give up on looking at this yet, 
>... trying to
> figure out how to do an async search (which would include MIME processing)
> as part of the incoming message processing.

Have you given up?

BTW, simple HTML entities in source (like &uuml;) also make non-gloda body search fail, just tested on release 52.5.2 (32-bit).
(In reply to Thomas D. from comment #20)
> BTW, simple HTML entities in source (like &uuml;) also make non-gloda body
> search fail, just tested on release 52.5.2 (32-bit).

That's bug 521649 which I filed in 2009.
I'm surveying the code and tests added in bug 132340.

Tests:
https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/test/unit/test_searchBody.js#34
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/base64-1 - single part, search works.
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/basic1 - not base64 encoded, huh?
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/multipart-base64-2 - multipart/mixed, search works
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/bug132340 - multipart/alternative, two text/plain text parts,
the second one is shown (later, thus higher priority) and search works there.

So let's look at a few MIME structures and see what works, also keeping bug 1263783 and bug 1420796 (although that's for a filter which may be different) where the search string is found in raw base64 of image data.

1) text/plain: Search works.
2) multipart/mixed: text/plain + attachment: Search works, attachment data not matched.
3) text/html: Search works.
4) multipart/mixed: text/html + attachment: Search works, attachment data not matched.

5) multipart/related: text/html + image: Search works, image data not matched.
6) multipart/alternative: text/plain + text/html: Search works, content found in both plain text and HTML parts.
7) multipart/alternative: text/plain + multipart/related (text/html + image):
   Search only works on plaintext part, string not found in HTML part,
   raw base64 of HTML and image data matched.

8) multipart/mixed: multipart/alternative (text/plain + text/html) + attachment (like in attachment 8734465 [details])
   String not found in plaintext and HTML part,
   raw base64 of plaintext and HTML matched, attachment data not matched.
9) multipart/mixed: multipart/related (text/html + image)) + attachment
   String not found in HTML part,
   raw base64 HTML matched, image data matched, attachment data not matched.
10) multipart/mixed: multipart/alternative (text/plain +  multipart/related (text/html + image)) + attachment
   String not found in plaintext and HTML part,
   raw base64 of plaintext and HTML and image data matched, attachment data not matched.

So not working: 7, 8, 9 and 10. There's a pattern to the madness ;-)
- if string not found in a part, then the part's raw data matches (7, 8, 9, 10)
- if raw HTML data is matched, raw image data is also matched
- attachment data never matched.
Flags: needinfo?(rkent)
Summary: Search for a string in message body fails to find message if message parts are base64 encoded. Searches the undecoded content. → For certain MIME structures (see comment #22) search for a string in message body fails to find message if message parts are base64 encoded (searches the undecoded content and also related images)
Ouch, cases 7-10 are what you could call "nested multipart" and we read here:
https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/search/src/nsMsgBodyHandler.cpp#384
  // TODO: make this work for nested multiparts (requires some redesign)

And that comment has been there since the import into HG :-(
This integrates my 10 test cases into mailnews/base/test/unit/test_searchBody.js.

The expected results are tweaked so that the test passed overall with the current code base.

We see that 2+4 strings are not found and a lot of times the raw base64 is matched.

So in the next few days I'll fix the code so that we get the correct expected results.

One error I noticed is that when a MIME part terminates, m_partIsText is set to true, leading to the following embedded image used as text part (I think).
Assignee: nobody → jorgk
Attachment #8939168 - Attachment is obsolete: true
Status: NEW → ASSIGNED
A few tweaks to the algorithm already reduce the false positives greatly. So two false positives remain and 2+4 strings still aren't found in nested multipart messages.
Attachment #8939256 - Attachment is obsolete: true
Correcting further fatal logic errors (reading much more body than belongs to the message!) fixes the two false positives.

Now left to do is to correct the nested part processing to fix not finding stuff.
Attachment #8939260 - Attachment is obsolete: true
This works, all tests pass. No false positives, no false negatives.

Nested multipart is made to work by collecting all boundaries in an array which works like a stack. We add to the end and search from the end, if we find our entry, later ones can go.

Not so easy to review, I looked at the faulty algorithm for three days and did lots of debugging dumping out buffers and indices, etc. :-(
Attachment #8939289 - Attachment is obsolete: true
Attachment #8939299 - Flags: review?(acelists)
This does not add too much new code but it looks like it adds some parsing of message parts. Shouldn't we have ready made tools for this in the mime library?
Here's a patch to make the reviewer's life easier. With this patch applied you can see what the algorithm does.

You could also re-enable the commented-out line
// ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
to check that is leads to duplication, which is inconsequential, since if the body contains a match, so does the doubled up body.

If might lead to false positives thought when a body line is
ABBA and you're searching for AA. Not found in ABBA, but found in ABBAABBA.

As a concrete example, import Joshua's message 'bug132340' and you'll see
=== buf |Semi case-insensitivity sucks. Match on "base 64 text"Semi case-insensitivity sucks. Match on "base 64 text"|
in the debug.

And you get a match on |text"Semi| :-(
(In reply to :aceman from comment #31)
> This does not add too much new code but it looks like it adds some parsing
> of message parts. Shouldn't we have ready made tools for this in the mime
> library?
Yes, the code in nsMsgBodyHandler.cpp is a mini-MIME-parser. With this in mind, the bug was never fixed since people were waiting for the all-happy-making new MIME implementation (which has never arrived). Quoting chat:
{"date":"2017-12-29T21:50:08.000Z","who":"jcranmer","text":"that was one of the things that convinced me we needed a better mime library","flags":["incoming"]}

Meanwhile TB extremely embarrassingly spews out false positives and false negatives which really makes *unfit* for message storage and search.

The same, BTW, happened to the "Asian Crisis" I fixed for TB 45. I was told not to, since the new MIME parser would fix it. Well, I'm glad I didn't listen to the advice and I'm glad I didn't listen to the advice this time either.

Here is your fix with 10 test cases taken from real life!
Minor simplification, we don't need m_partIsTextDefault and can use 'false' always.
Attachment #8939299 - Attachment is obsolete: true
Attachment #8939299 - Flags: review?(acelists)
Attachment #8939305 - Flags: review?(acelists)
Sorry about the noise, but it didn't compile on Mac. Now it should.
Attachment #8939305 - Attachment is obsolete: true
Attachment #8939305 - Flags: review?(acelists)
Attachment #8939322 - Flags: review?(acelists)
BTW, I have a "green" try run for this, the failures are due to other issues:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=46679cbb6d3b64949ce78a6acc03de5e2f729e7c
Comment on attachment 8939322 [details] [diff] [review]
1259534-base64-body-search.patch (v4c)

Review of attachment 8939322 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for diving into this. It is a shame we do not have an universal mime parser that we could just ask for the relevant and decoded message parts. Looks like the mime rewrite is in order.

I can't really confirm whether the mime parsing here is correct or not, but if the existing tests we had so far still pass, at least the new code shouldn't regress anything.

Thanks for adding new tests. I have also manually tested the reporter's message that now works with your patch.
You may want to also put example.com in all the Message-ID in the test messages instead of a real domain.

::: mailnews/base/search/public/nsMsgBodyHandler.h
@@ +90,5 @@
>    // With the exception of m_isMultipart, these all apply to the various parts
>    bool m_stripHeaders;   // true if we're supposed to strip of message headers
>    bool m_stripHtml;      // true if we're supposed to strip off HTML tags
> +  bool m_pastMsgHeaders; // true if we've already skipped over the message headers
> +  bool m_pastPartHeaders; // true if we've already skipped over the part headers

I think you can align the comment on this one line.

::: mailnews/base/search/src/nsMsgBodyHandler.cpp
@@ +276,5 @@
>        else
>        {
> +        // It is wrong to call ApplyTransformations() here since this will
> +        // lead to the buffer being doubled-up at |buf.Append(line.get());| below.
> +        // ApplyTransformations(buf, buf.Length(), eatThisLine, buf);

OK, you say it doubles the buffer, but why can we skip all the work the function does? Why is the call not needed now? Shouldn't we solve the doubling in a different way?
Attachment #8939322 - Flags: review?(acelists) → review+
(In reply to :aceman from comment #37)
> I have also manually tested the reporter's message ...
Sure that works, it's case 08.

> You may want to also put example.com in all the Message-ID in the test
> messages instead of a real domain.
I did now. However, I will be eternalised in the CIDs ;-)

> I think you can align the comment on this one line.
Done.

> > +        // It is wrong to call ApplyTransformations() here since this will
> > +        // lead to the buffer being doubled-up at |buf.Append(line.get());| below.
> > +        // ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
> OK, you say it doubles the buffer, but why can we skip all the work the
> function does? Why is the call not needed now? Shouldn't we solve the
> doubling in a different way?
Good question. When we get there, we just matched a boundary and decoded the accumulated buffer from base64. The removed call wouldn't have done anything other than running through
  if (m_base64part)
  {
    // We need to keep track of all lines to parse base64encoded...
    buf.Append(line.get());
    eatThisLine = true;
    return buf.Length();
  }

All the other if-blocks weren't entered:
  if (!m_pastPartHeaders)  // line is a line from the part headers
Nope, we were past headers already.

  if (matchedBoundary)
Nope, the accumulated net content didn't match a boundary.

  if (!m_partIsText)
Nope, the part was text.

So you can see that this was a 100% useless call.
Attachment #8939322 - Attachment is obsolete: true
Attachment #8939340 - Flags: review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/13a022fef0c8
fix body search of base64 encoded bodies (incl. 10 new test cases). r=aceman
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Attachment #8939340 - Flags: approval-comm-esr52?
Attachment #8939340 - Flags: approval-comm-beta+
Target Milestone: --- → Thunderbird 59.0
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/759973c6bcae
correct comment (insert 'not'). rs=comment-only DONTBUILD
Comment on attachment 8939303 [details] [diff] [review]
1259534-debug.patch

Review of attachment 8939303 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/test/resources/searchTestUtils.js
@@ +28,5 @@
>                      aArbitraryHeader, aHdrProperty)
>  {
>    var searchListener =
>    { 
> +    onSearchHit: function(dbHdr, folder) {

Damn, lost the hitCount++; :-(
So this debug patch destroys the test.
Jörg, this is awesome. You ROCK! Thanks a lot.
Wait until bug 1427124 lands :-)
Cheers Jörg appreciate the work you've put into this.
really excellent work.
Attachment #8939340 - Flags: approval-comm-esr52? → approval-comm-esr52+
Depends on: 1434020

Emails with these headers

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

won't show up when searching for a string in message body. Only Global Search find them, Quick Filter or right click on folder->Search Messages both don't work. I'm using Thunderbird 68.2.2 on Ubuntu 18.04.3.

I first commented in bug 753307#1 but I think it's more appropriate here.
These emails are sent by mailman-bounces, the header says: X-Mailman-Version: 2.1.20

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: