Closed Bug 1427124 Opened 6 years ago Closed 6 years ago

Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)

Categories

(Thunderbird :: Search, defect)

52 Branch
defect
Not set
normal

Tracking

(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)

RESOLVED FIXED
Thunderbird 59.0
Tracking Status
thunderbird_esr52 58+ fixed
thunderbird58 --- fixed
thunderbird59 --- fixed

People

(Reporter: regspam, Assigned: jorgk-bmo)

References

(Blocks 1 open bug)

Details

(Keywords: intl, Whiteboard: TB 52.6 ESR)

Attachments

(4 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:56.0) Gecko/20100101 Firefox/56.0
Build ID: 20171024165158

Steps to reproduce:

1. Open  Tools -> Options -> Composition -> Send Options...  and select "Send the message in both plain text and HTML" from the drop-down menu.

2. Start a new e-mail message, type some non-English text and apply some HTML formatting to it. For example, let the text be "русский текст" with the "текст" part in bold. (This is Russian; I also tried it with Hebrew with the same result).
Send the message to one of your addresses so that you could continue with step 3.

3. Download the message from the server, select the folder it's in and enable Quick Filter. Type "русский" in the filter input field, enable "Body" in the "Filter messages by:" list if it's disabled. The red "No results" text appears left to the input field, the message isn't found.

If I send the message in either plain text or HTML only, Quick Filter does find the message.
Also, in any case the message is found via the Global Search function (Ctrl-K). 

Thunderbird 52.5.2 (32-bit),
Win7.
Component: Untriaged → Search
Nice comprehensive report. Can you please save/drag your test message to the desktop or folder and then attach the .eml file here using "Attach File". Please remove any personal information from the e-mail headers before attaching the file.

Body search isn't working very well, there are many bugs in this area.
Attached file test message.eml
Sure, here's the message.
Attached file test message.eml.zip
Sorry, for some reason only the HTML-part of the message is displayed. I'm reattaching it in the zip-packed form, hope it works out this time.
(Turns out an extension caused the attached eml file to be displayed, not downloaded. Sorry again).
Attachment #8939084 - Attachment mime type: message/rfc822 → text/plain
See Also: → 521649
Thanks for adding the sample. As you stated, there is a plaintext and a HTML part and any non-ASCII content isn't font.

I changed the message to:

Plaintext: русский *текст* Heute Ändern
HTML:      русский <b>текст</b><br>Heute Ändern

and the "Heute" ("today" in German) is found, but the "Ändern" ("to change") isn't. It's basically an ancient bug 521649 from 2009 where someone complains about drafts which also always have plaintext and HTML parts. Most likely the search is done on the raw UTF-8, so only ASCII text is found.

This will most likely be fixed of we ever get around to fixing bug 1259534.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Jorg K (GMT+1) from comment #5)
> isn't. It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts.

Someone is me, and Jörg knows me :) I wasn't complaining, just to my best knowledge and correctly reporting what I saw at the time. Finding the exact causes is not the duty of the reporter.
(In reply to Jorg K (GMT+1) from comment #5)
> It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts. Most
> likely the search is done on the raw UTF-8, so only ASCII text is found.
Oops, that was wrong, the draft test case in bug 521649 is different, no multipart involved.

> This will most likely be fixed of we ever get around to fixing bug 1259534.
However, the root cause of all body search problems is the processing which is faulty be design. We search some raw data instead of decoding it first, see bug 1259534 comment #14.
I fixed bug 1259534 now, but this problem persists. Bug 1259534 was about base64 encoded content. The problem here is most likely that we don't pay attention to the UTF-8 charset. Needs more investigation.
Here are 10 new test cases with non-ASCII characters. Sadly, only two cases work: plaintext (01) and HTML (03) without embedded images or attachments.

So the problem is quite clear: Multipart doesn't work *at all* since the charset is only ever taken from the message headers and never from the part headers.
Summary: Quick Filter doesn't find non-English text in message body if the message was sent in both plain text and HTML → Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)
This fixes the problem. I added 10 new tests. We have a variation of UTF-8, windows-1252 and ISO-8859-7 (Greek) encodings and also some UTF-8 in base64 encoded parts thrown it. It all works and should improve TB's body search capability greatly.
Assignee: nobody → jorgk
Attachment #8939364 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8939406 - Flags: review?(acelists)
Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Splinter Review gets confused here with the various charsets. Most stuff is in UTF-8, there is some ANSI/windows-1252 and some Greek.

It looks much better in the "Details" view where UTF-8 is used. Only the ISO-8859-7 and windows-1252 messages look bad there.

::: mailnews/base/test/unit/test_searchBody.js
@@ +91,5 @@
>    // and we don't want to find that.
>    { value: "U2VhcmNoIGZ", op: Contains, count: 0 },
> +
> +  // Messages 11 and 13 to 20 contain "hühü" once.
> +  { value: "hühü", op: Contains, count: 9 },

Looks like Splinter Review is using ANSI here. This is hühü.

@@ +93,5 @@
> +
> +  // Messages 11 and 13 to 20 contain "hühü" once.
> +  { value: "hühü", op: Contains, count: 9 },
> +  // Message 12 contains ÎαληÏÏέÏα (good evening in Greek).
> +  { value: "ÎαληÏÏέÏα", op: Contains, count: 1 },

And this is Greek: Καλησπέρα

@@ +96,5 @@
> +  // Message 12 contains ÎαληÏÏέÏα (good evening in Greek).
> +  { value: "ÎαληÏÏέÏα", op: Contains, count: 1 },
> +
> +  // Messages 16, 17, 18, 20 contain "hïhï" in the plaintext part.
> +  { value: "hïhï", op: Contains, count: 4 },

And this is hïhï.
Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for diving into these problems Jorg. This should improve the search tremendously for a lot of people.

I have created my own messages with different charsets in a single folder and search worked on them with the patch. Without the patch, nothing was found.

Assuming all the test cases pass, this is great and you have my r+.

::: mailnews/base/search/public/nsMsgBodyHandler.h
@@ +32,5 @@
>      uint32_t headersSize, bool ForFilters);
>  
>    virtual ~nsMsgBodyHandler();
>  
> +  // returns next message line in buf and the applicable charset, if found.

This actually has 3 return values (also the int32_t function return value), please document the int too.

::: mailnews/base/search/src/nsMsgBodyHandler.cpp
@@ +128,5 @@
>      // And reapply our transformations...
>      outLength = ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
>    }
>  
> +  charset = m_partCharset;

So where do you use charset in this function? Or is this meant as an out argument?

@@ +415,1 @@
>      start += 9;

Please comment that this is the length of "boundary=", at least I assume that is the relation.

@@ +431,5 @@
>  
> +  if (m_isMultipart &&
> +      (start = lowerCaseLine.Find("charset=", /* ignoreCase = */ true)) != -1)
> +  {
> +    start += 8;

Please comment that this is the length of "charset=", at least I assume that is the relation.

@@ +436,5 @@
> +    int32_t end = line.RFindChar(';');
> +    if (end == -1)
> +      end = line.Length();
> +
> +    m_partCharset.Assign(Substring(line,start,end-start));

Spaces after commas.
Attachment #8939406 - Flags: review?(acelists) → review+
quite a few bugs of the intl variety https://mzl.la/2CHR6jL
Keywords: intl
(In reply to :aceman from comment #12)
> > +  charset = m_partCharset;
> So where do you use charset in this function? Or is this meant as an out
> argument?
Yes:
int32_t nsMsgBodyHandler::GetNextLine (nsCString &buf, nsCString &charset)
Since the arguments already didn't follow the conventions, I didn't do the right thing here either.

I'll add the comments and spaces when pushing this.
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/25e260b99581
fix body search for non-ASCII bodies (incl. 10 new test cases). r=aceman
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Attachment #8939406 - Flags: approval-comm-esr52?
Attachment #8939406 - Flags: approval-comm-beta+
Target Milestone: --- → Thunderbird 59.0
Sadly there's a follow-up. Turns out that the charset can be quoted, too, and that RFind won't work since the line can be:
Content-Type: text/plain; charset="utf-8"; format=flowed; delsp=yes;
Attachment #8939427 - Flags: review?(acelists)
Comment on attachment 8939427 [details] [diff] [review]
1427124-follow-up.patch

Review of attachment 8939427 [details] [diff] [review]:
-----------------------------------------------------------------

Open-coding ugliness :(
Attachment #8939427 - Flags: review?(acelists) → review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/863d0cd8bb53
Follow-up: charset can be quoted. r=aceman
Attachment #8939406 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: