For certain MIME structures (see comment #22) search for a string in message body fails to find message if message parts are base64 encoded (searches the undecoded content and also related images)

RESOLVED FIXED in Thunderbird 59.0

Status

defect
RESOLVED FIXED
3 years ago
a year ago

People

(Reporter: tom.phinney, Assigned: jorgk)

Tracking

(Blocks 1 bug)

Trunk
Thunderbird 59.0
x86_64
macOS
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)

Details

(Whiteboard: TB 52.6 ESR)

Attachments

(4 attachments, 7 obsolete attachments)

(Reporter)

Description

3 years ago
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160315153207

Steps to reproduce:

Right-click folder in left pane, select "Search Messages...", match all of following: Date is after <3 months ago>; Content contains <string, e.g., 65C/828>


Actual results:

Search failed to find message in folder that met the search criteria, e.g., received within the last three months and containing the specified string, even though that message is within the folder.


Expected results:

Search should have found the message(s) that met the criteria. This has happened multiple times over the past few months, though such searches using the identical procedure used to work. Thus a faulty modification in the T-bird source is probably to blame. Example from today's failed search attached. (Content not sensitive and will become semi-public, so may be disclosed to others.) T-bird version is 38.7.0.
(Reporter)

Updated

3 years ago
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
(Assignee)

Updated

3 years ago
Attachment #8734465 - Attachment mime type: message/rfc822 → text/plain
(Assignee)

Comment 1

3 years ago
> Content contains <string, e.g., 65C/828>
You mean: Body contains?

Is the message you're looking for in an IMAP folder or a local folder. If it's an IMAP folder you're reporting a duplicate of bug 1245532.
Component: Untriaged → Search
(Reporter)

Comment 2

3 years ago
(In reply to Jorg K (GMT+1) from comment #1)
> > Content contains <string, e.g., 65C/828>
> You mean: Body contains?
> 
> Is the message you're looking for in an IMAP folder or a local folder. If
> it's an IMAP folder you're reporting a duplicate of bug 1245532.

Yes, I do mean "Body contains". Sorry for the misstatement.

The specific folder for this example was a local T-bird Mail (not ImapMail) folder on a POP account. I recall having seen similar search failures for "> Local Folders > subfolder" where I store e-mail of this type after it's about six months old. (I do that to get it into a different .spd file.)
(Assignee)

Comment 3

3 years ago
OK, I loaded the message into a local folder and can't find 65C although it's clearly in the message. I think it's a known problem with multipart messages. I'll look for a pre-existing bug later. (You can look at the message source to see its internal structure.)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Folder search for a string in message content fails to retrieve messages containing that string → Folder search for a string in message content fails to retrieve messages containing that string (multipart message)
(Assignee)

Comment 4

3 years ago
I was wrong. We've seen the problem for an IMAP stored message (bug 1245532), but not for a local message. So your sample is valuable test data.
See Also: → 1245532
(Assignee)

Updated

3 years ago
Summary: Folder search for a string in message content fails to retrieve messages containing that string (multipart message) → Folder search for a string in message content fails to retrieve messages containing that string (multipart message, plaintext+HTML, both base64 encoded)
(Assignee)

Comment 5

3 years ago
This is the same message, but I decoded the plaintext and HTML parts. This time the string 65C is found. Looks to me that the problem is caused by the base64 encoding.

I tried looking for SGkgVG9tLiAg which is present in the base64, and lo and behold, the string is found!!
(Assignee)

Comment 6

3 years ago
Kent, here we have another "search doesn't work" bug. This time the cause is clear. The search doesn't decode the body before looking. That's not very useful really. I can find the message looking for
NDEgQU0NClRvOiBTQzY1Q0BJU0EtT05MSU5FLk9SRw0KU3ViamVjdDogT25lIElFQyBTQzY1QyBk
Summary: Folder search for a string in message content fails to retrieve messages containing that string (multipart message, plaintext+HTML, both base64 encoded) → Search for a string in message body fails to find message if message parts are base64 encoded. Searches the undecoded content.
(Assignee)

Comment 7

3 years ago
I've just tried a filter. Same behaviour. It filters on the raw message, not the decoded message.
(In reply to Jorg K (GMT+1) from comment #7)
> I've just tried a filter. Same behaviour. It filters on the raw message, not
> the decoded message.

Should be easy to fix then. Care to try?
(Assignee)

Comment 9

3 years ago
(In reply to Kent James (:rkent) from comment #8)
> Should be easy to fix then. Care to try?
Care to try to deal with some editor bugs (see bug 1248971)? You're an editor specialist, right, see bug 780908 ;-)

If it's easy, just do it, I'll watch the master at work.
(Assignee)

Updated

3 years ago
See Also: → 1263783
NI myself since Jorg continues to encourage me to look at this, and I might be able to extract a future favor from him if I do :)
Flags: needinfo?(rkent)
(Assignee)

Comment 11

3 years ago
Don't you think it's pretty poor that you can't find content just because its base64 encoded?

Updated

3 years ago
Component: Search → Search
Flags: needinfo?(rkent)
Product: Thunderbird → MailNews Core
Version: 38 Branch → Trunk
didn't mean to un-NI myself
Flags: needinfo?(rkent)
(Assignee)

Comment 13

3 years ago
Kent, while looking at this, could you also glance at bug 1230815 comment #3. That bug is about not stripping HTML tags well in a body search.

The processing here
http://mxr.mozilla.org/comm-central/source/mailnews/base/search/src/nsMsgBodyHandler.cpp#310
is just a total misdesign since it processed the message body line by line and tries to strip tags in <>. That of course fails if the < and the > are not on the same line.

So if you'll be working in nsMsgBodyHandler.cpp, perhaps you can kill more than one bird with one stone.
I won't give up on looking at this yet, but really this bug is part of a whole class of issues where filters and search (and this is really a search bug) make unreasonably compromises, or fail to include reasonable features, because filter processing can only be done sync, but many needed operations are async.

Body search is a good example. The whole routine is a massive kludge.

At some point incoming message acquisition is an async operation, and there ought to be a way to insert MIME processing into that operation so that we can do things like correctly process the body, get correct attachment existence, search by attachment name and content type, etc. Rather than kludge the kludge, I think that our time would be better spent trying to figure out how to do an async search (which would include MIME processing) as part of the incoming message processing.

Comment 15

3 years ago
45.4.0 I can not search base64 encoded messages as well. Here is sample message: http://pastebin.com/WAFDYi2c and try search for example JPME1.
(Assignee)

Comment 16

3 years ago
We know it doesn't work. Unless you see activity in this bug it will remain broken since it won't get fixed magically.
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1350061
(Assignee)

Updated

a year ago
Duplicate of this bug: 1420796
(Assignee)

Comment 19

a year ago
Strangely there was bug 132340 where something like this got fixed(?). Good reading in bug 519202.

Comment 20

a year ago
(In reply to Kent James (:rkent) from comment #14)
> I won't give up on looking at this yet, 
>... trying to
> figure out how to do an async search (which would include MIME processing)
> as part of the incoming message processing.

Have you given up?

BTW, simple HTML entities in source (like &uuml;) also make non-gloda body search fail, just tested on release 52.5.2 (32-bit).

Comment 21

a year ago
(In reply to Thomas D. from comment #20)
> BTW, simple HTML entities in source (like &uuml;) also make non-gloda body
> search fail, just tested on release 52.5.2 (32-bit).

That's bug 521649 which I filed in 2009.
(Assignee)

Comment 22

a year ago
I'm surveying the code and tests added in bug 132340.

Tests:
https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/test/unit/test_searchBody.js#34
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/base64-1 - single part, search works.
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/basic1 - not base64 encoded, huh?
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/multipart-base64-2 - multipart/mixed, search works
https://dxr.mozilla.org/comm-central/source/mailnews/test/data/bug132340 - multipart/alternative, two text/plain text parts,
the second one is shown (later, thus higher priority) and search works there.

So let's look at a few MIME structures and see what works, also keeping bug 1263783 and bug 1420796 (although that's for a filter which may be different) where the search string is found in raw base64 of image data.

1) text/plain: Search works.
2) multipart/mixed: text/plain + attachment: Search works, attachment data not matched.
3) text/html: Search works.
4) multipart/mixed: text/html + attachment: Search works, attachment data not matched.

5) multipart/related: text/html + image: Search works, image data not matched.
6) multipart/alternative: text/plain + text/html: Search works, content found in both plain text and HTML parts.
7) multipart/alternative: text/plain + multipart/related (text/html + image):
   Search only works on plaintext part, string not found in HTML part,
   raw base64 of HTML and image data matched.

8) multipart/mixed: multipart/alternative (text/plain + text/html) + attachment (like in attachment 8734465 [details])
   String not found in plaintext and HTML part,
   raw base64 of plaintext and HTML matched, attachment data not matched.
9) multipart/mixed: multipart/related (text/html + image)) + attachment
   String not found in HTML part,
   raw base64 HTML matched, image data matched, attachment data not matched.
10) multipart/mixed: multipart/alternative (text/plain +  multipart/related (text/html + image)) + attachment
   String not found in plaintext and HTML part,
   raw base64 of plaintext and HTML and image data matched, attachment data not matched.

So not working: 7, 8, 9 and 10. There's a pattern to the madness ;-)
- if string not found in a part, then the part's raw data matches (7, 8, 9, 10)
- if raw HTML data is matched, raw image data is also matched
- attachment data never matched.
Flags: needinfo?(rkent)
Summary: Search for a string in message body fails to find message if message parts are base64 encoded. Searches the undecoded content. → For certain MIME structures (see comment #22) search for a string in message body fails to find message if message parts are base64 encoded (searches the undecoded content and also related images)
(Assignee)

Updated

a year ago
Duplicate of this bug: 1263783
(Assignee)

Updated

a year ago
Duplicate of this bug: 1420796
(Assignee)

Comment 25

a year ago
(Assignee)

Comment 26

a year ago
Ouch, cases 7-10 are what you could call "nested multipart" and we read here:
https://dxr.mozilla.org/comm-central/rev/4469c475c6b473927956f509a1ef78da8db5e48c/mailnews/base/search/src/nsMsgBodyHandler.cpp#384
  // TODO: make this work for nested multiparts (requires some redesign)

And that comment has been there since the import into HG :-(
(Assignee)

Comment 27

a year ago
This integrates my 10 test cases into mailnews/base/test/unit/test_searchBody.js.

The expected results are tweaked so that the test passed overall with the current code base.

We see that 2+4 strings are not found and a lot of times the raw base64 is matched.

So in the next few days I'll fix the code so that we get the correct expected results.

One error I noticed is that when a MIME part terminates, m_partIsText is set to true, leading to the following embedded image used as text part (I think).
Assignee: nobody → jorgk
Attachment #8939168 - Attachment is obsolete: true
Status: NEW → ASSIGNED
(Assignee)

Comment 28

a year ago
A few tweaks to the algorithm already reduce the false positives greatly. So two false positives remain and 2+4 strings still aren't found in nested multipart messages.
Attachment #8939256 - Attachment is obsolete: true
(Assignee)

Comment 29

a year ago
Correcting further fatal logic errors (reading much more body than belongs to the message!) fixes the two false positives.

Now left to do is to correct the nested part processing to fix not finding stuff.
Attachment #8939260 - Attachment is obsolete: true
(Assignee)

Comment 30

a year ago
This works, all tests pass. No false positives, no false negatives.

Nested multipart is made to work by collecting all boundaries in an array which works like a stack. We add to the end and search from the end, if we find our entry, later ones can go.

Not so easy to review, I looked at the faulty algorithm for three days and did lots of debugging dumping out buffers and indices, etc. :-(
Attachment #8939289 - Attachment is obsolete: true
Attachment #8939299 - Flags: review?(acelists)

Comment 31

a year ago
This does not add too much new code but it looks like it adds some parsing of message parts. Shouldn't we have ready made tools for this in the mime library?
(Assignee)

Comment 32

a year ago
Here's a patch to make the reviewer's life easier. With this patch applied you can see what the algorithm does.

You could also re-enable the commented-out line
// ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
to check that is leads to duplication, which is inconsequential, since if the body contains a match, so does the doubled up body.

If might lead to false positives thought when a body line is
ABBA and you're searching for AA. Not found in ABBA, but found in ABBAABBA.

As a concrete example, import Joshua's message 'bug132340' and you'll see
=== buf |Semi case-insensitivity sucks. Match on "base 64 text"Semi case-insensitivity sucks. Match on "base 64 text"|
in the debug.

And you get a match on |text"Semi| :-(
(Assignee)

Comment 33

a year ago
(In reply to :aceman from comment #31)
> This does not add too much new code but it looks like it adds some parsing
> of message parts. Shouldn't we have ready made tools for this in the mime
> library?
Yes, the code in nsMsgBodyHandler.cpp is a mini-MIME-parser. With this in mind, the bug was never fixed since people were waiting for the all-happy-making new MIME implementation (which has never arrived). Quoting chat:
{"date":"2017-12-29T21:50:08.000Z","who":"jcranmer","text":"that was one of the things that convinced me we needed a better mime library","flags":["incoming"]}

Meanwhile TB extremely embarrassingly spews out false positives and false negatives which really makes *unfit* for message storage and search.

The same, BTW, happened to the "Asian Crisis" I fixed for TB 45. I was told not to, since the new MIME parser would fix it. Well, I'm glad I didn't listen to the advice and I'm glad I didn't listen to the advice this time either.

Here is your fix with 10 test cases taken from real life!
(Assignee)

Comment 34

a year ago
Minor simplification, we don't need m_partIsTextDefault and can use 'false' always.
Attachment #8939299 - Attachment is obsolete: true
Attachment #8939299 - Flags: review?(acelists)
Attachment #8939305 - Flags: review?(acelists)
(Assignee)

Comment 35

a year ago
Sorry about the noise, but it didn't compile on Mac. Now it should.
Attachment #8939305 - Attachment is obsolete: true
Attachment #8939305 - Flags: review?(acelists)
Attachment #8939322 - Flags: review?(acelists)
(Assignee)

Comment 36

a year ago
BTW, I have a "green" try run for this, the failures are due to other issues:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=46679cbb6d3b64949ce78a6acc03de5e2f729e7c

Comment 37

a year ago
Comment on attachment 8939322 [details] [diff] [review]
1259534-base64-body-search.patch (v4c)

Review of attachment 8939322 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for diving into this. It is a shame we do not have an universal mime parser that we could just ask for the relevant and decoded message parts. Looks like the mime rewrite is in order.

I can't really confirm whether the mime parsing here is correct or not, but if the existing tests we had so far still pass, at least the new code shouldn't regress anything.

Thanks for adding new tests. I have also manually tested the reporter's message that now works with your patch.
You may want to also put example.com in all the Message-ID in the test messages instead of a real domain.

::: mailnews/base/search/public/nsMsgBodyHandler.h
@@ +90,5 @@
>    // With the exception of m_isMultipart, these all apply to the various parts
>    bool m_stripHeaders;   // true if we're supposed to strip of message headers
>    bool m_stripHtml;      // true if we're supposed to strip off HTML tags
> +  bool m_pastMsgHeaders; // true if we've already skipped over the message headers
> +  bool m_pastPartHeaders; // true if we've already skipped over the part headers

I think you can align the comment on this one line.

::: mailnews/base/search/src/nsMsgBodyHandler.cpp
@@ +276,5 @@
>        else
>        {
> +        // It is wrong to call ApplyTransformations() here since this will
> +        // lead to the buffer being doubled-up at |buf.Append(line.get());| below.
> +        // ApplyTransformations(buf, buf.Length(), eatThisLine, buf);

OK, you say it doubles the buffer, but why can we skip all the work the function does? Why is the call not needed now? Shouldn't we solve the doubling in a different way?
Attachment #8939322 - Flags: review?(acelists) → review+
(Assignee)

Comment 38

a year ago
(In reply to :aceman from comment #37)
> I have also manually tested the reporter's message ...
Sure that works, it's case 08.

> You may want to also put example.com in all the Message-ID in the test
> messages instead of a real domain.
I did now. However, I will be eternalised in the CIDs ;-)

> I think you can align the comment on this one line.
Done.

> > +        // It is wrong to call ApplyTransformations() here since this will
> > +        // lead to the buffer being doubled-up at |buf.Append(line.get());| below.
> > +        // ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
> OK, you say it doubles the buffer, but why can we skip all the work the
> function does? Why is the call not needed now? Shouldn't we solve the
> doubling in a different way?
Good question. When we get there, we just matched a boundary and decoded the accumulated buffer from base64. The removed call wouldn't have done anything other than running through
  if (m_base64part)
  {
    // We need to keep track of all lines to parse base64encoded...
    buf.Append(line.get());
    eatThisLine = true;
    return buf.Length();
  }

All the other if-blocks weren't entered:
  if (!m_pastPartHeaders)  // line is a line from the part headers
Nope, we were past headers already.

  if (matchedBoundary)
Nope, the accumulated net content didn't match a boundary.

  if (!m_partIsText)
Nope, the part was text.

So you can see that this was a 100% useless call.
Attachment #8939322 - Attachment is obsolete: true
Attachment #8939340 - Flags: review+

Comment 39

a year ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/13a022fef0c8
fix body search of base64 encoded bodies (incl. 10 new test cases). r=aceman
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
(Assignee)

Updated

a year ago
Attachment #8939340 - Flags: approval-comm-esr52?
Attachment #8939340 - Flags: approval-comm-beta+
(Assignee)

Updated

a year ago
Target Milestone: --- → Thunderbird 59.0

Comment 40

a year ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/759973c6bcae
correct comment (insert 'not'). rs=comment-only DONTBUILD
(Assignee)

Comment 41

a year ago
Comment on attachment 8939303 [details] [diff] [review]
1259534-debug.patch

Review of attachment 8939303 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/test/resources/searchTestUtils.js
@@ +28,5 @@
>                      aArbitraryHeader, aHdrProperty)
>  {
>    var searchListener =
>    { 
> +    onSearchHit: function(dbHdr, folder) {

Damn, lost the hitCount++; :-(
So this debug patch destroys the test.

Comment 42

a year ago
Jörg, this is awesome. You ROCK! Thanks a lot.
(Assignee)

Comment 43

a year ago
Wait until bug 1427124 lands :-)
Cheers Jörg appreciate the work you've put into this.

Comment 46

a year ago
really excellent work.
Duplicate of this bug: 1059230
(Assignee)

Updated

a year ago
Attachment #8939340 - Flags: approval-comm-esr52? → approval-comm-esr52+
(Assignee)

Updated

a year ago
Duplicate of this bug: 1105937
(Assignee)

Updated

a year ago
Depends on: 1434020
You need to log in before you can comment on or make changes to this bug.