Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)

RESOLVED FIXED in Thunderbird 59.0

Status

defect
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: regspam, Assigned: jorgk)

Tracking

(Blocks 1 bug, {intl})

52 Branch
Thunderbird 59.0

Thunderbird Tracking Flags

(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)

Details

(Whiteboard: TB 52.6 ESR)

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

a year ago
User Agent: Mozilla/5.0 (Windows NT 6.1; rv:56.0) Gecko/20100101 Firefox/56.0
Build ID: 20171024165158

Steps to reproduce:

1. Open  Tools -> Options -> Composition -> Send Options...  and select "Send the message in both plain text and HTML" from the drop-down menu.

2. Start a new e-mail message, type some non-English text and apply some HTML formatting to it. For example, let the text be "русский текст" with the "текст" part in bold. (This is Russian; I also tried it with Hebrew with the same result).
Send the message to one of your addresses so that you could continue with step 3.

3. Download the message from the server, select the folder it's in and enable Quick Filter. Type "русский" in the filter input field, enable "Body" in the "Filter messages by:" list if it's disabled. The red "No results" text appears left to the input field, the message isn't found.

If I send the message in either plain text or HTML only, Quick Filter does find the message.
Also, in any case the message is found via the Global Search function (Ctrl-K). 

Thunderbird 52.5.2 (32-bit),
Win7.
(Reporter)

Updated

a year ago
Component: Untriaged → Search
(Assignee)

Comment 1

a year ago
Nice comprehensive report. Can you please save/drag your test message to the desktop or folder and then attach the .eml file here using "Attach File". Please remove any personal information from the e-mail headers before attaching the file.

Body search isn't working very well, there are many bugs in this area.
(Reporter)

Comment 2

a year ago
Posted file test message.eml
Sure, here's the message.
(Reporter)

Comment 3

a year ago
Sorry, for some reason only the HTML-part of the message is displayed. I'm reattaching it in the zip-packed form, hope it works out this time.
(Reporter)

Comment 4

a year ago
(Turns out an extension caused the attached eml file to be displayed, not downloaded. Sorry again).
(Assignee)

Updated

a year ago
Attachment #8939084 - Attachment mime type: message/rfc822 → text/plain
(Assignee)

Updated

a year ago
See Also: → 521649
(Assignee)

Comment 5

a year ago
Thanks for adding the sample. As you stated, there is a plaintext and a HTML part and any non-ASCII content isn't font.

I changed the message to:

Plaintext: русский *текст* Heute Ändern
HTML:      русский <b>текст</b><br>Heute Ändern

and the "Heute" ("today" in German) is found, but the "Ändern" ("to change") isn't. It's basically an ancient bug 521649 from 2009 where someone complains about drafts which also always have plaintext and HTML parts. Most likely the search is done on the raw UTF-8, so only ASCII text is found.

This will most likely be fixed of we ever get around to fixing bug 1259534.
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 6

a year ago
(In reply to Jorg K (GMT+1) from comment #5)
> isn't. It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts.

Someone is me, and Jörg knows me :) I wasn't complaining, just to my best knowledge and correctly reporting what I saw at the time. Finding the exact causes is not the duty of the reporter.
(Assignee)

Comment 7

a year ago
(In reply to Jorg K (GMT+1) from comment #5)
> It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts. Most
> likely the search is done on the raw UTF-8, so only ASCII text is found.
Oops, that was wrong, the draft test case in bug 521649 is different, no multipart involved.

> This will most likely be fixed of we ever get around to fixing bug 1259534.
However, the root cause of all body search problems is the processing which is faulty be design. We search some raw data instead of decoding it first, see bug 1259534 comment #14.
(Assignee)

Comment 8

a year ago
I fixed bug 1259534 now, but this problem persists. Bug 1259534 was about base64 encoded content. The problem here is most likely that we don't pay attention to the UTF-8 charset. Needs more investigation.
(Assignee)

Comment 9

a year ago
Here are 10 new test cases with non-ASCII characters. Sadly, only two cases work: plaintext (01) and HTML (03) without embedded images or attachments.

So the problem is quite clear: Multipart doesn't work *at all* since the charset is only ever taken from the message headers and never from the part headers.
(Assignee)

Updated

a year ago
Summary: Quick Filter doesn't find non-English text in message body if the message was sent in both plain text and HTML → Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)
(Assignee)

Comment 10

a year ago
This fixes the problem. I added 10 new tests. We have a variation of UTF-8, windows-1252 and ISO-8859-7 (Greek) encodings and also some UTF-8 in base64 encoded parts thrown it. It all works and should improve TB's body search capability greatly.
Assignee: nobody → jorgk
Attachment #8939364 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8939406 - Flags: review?(acelists)
(Assignee)

Comment 11

a year ago
Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Splinter Review gets confused here with the various charsets. Most stuff is in UTF-8, there is some ANSI/windows-1252 and some Greek.

It looks much better in the "Details" view where UTF-8 is used. Only the ISO-8859-7 and windows-1252 messages look bad there.

::: mailnews/base/test/unit/test_searchBody.js
@@ +91,5 @@
>    // and we don't want to find that.
>    { value: "U2VhcmNoIGZ", op: Contains, count: 0 },
> +
> +  // Messages 11 and 13 to 20 contain "hühü" once.
> +  { value: "hühü", op: Contains, count: 9 },

Looks like Splinter Review is using ANSI here. This is hühü.

@@ +93,5 @@
> +
> +  // Messages 11 and 13 to 20 contain "hühü" once.
> +  { value: "hühü", op: Contains, count: 9 },
> +  // Message 12 contains Καλησπέρα (good evening in Greek).
> +  { value: "Καλησπέρα", op: Contains, count: 1 },

And this is Greek: Καλησπέρα

@@ +96,5 @@
> +  // Message 12 contains Καλησπέρα (good evening in Greek).
> +  { value: "Καλησπέρα", op: Contains, count: 1 },
> +
> +  // Messages 16, 17, 18, 20 contain "hïhï" in the plaintext part.
> +  { value: "hïhï", op: Contains, count: 4 },

And this is hïhï.

Comment 12

a year ago
Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for diving into these problems Jorg. This should improve the search tremendously for a lot of people.

I have created my own messages with different charsets in a single folder and search worked on them with the patch. Without the patch, nothing was found.

Assuming all the test cases pass, this is great and you have my r+.

::: mailnews/base/search/public/nsMsgBodyHandler.h
@@ +32,5 @@
>      uint32_t headersSize, bool ForFilters);
>  
>    virtual ~nsMsgBodyHandler();
>  
> +  // returns next message line in buf and the applicable charset, if found.

This actually has 3 return values (also the int32_t function return value), please document the int too.

::: mailnews/base/search/src/nsMsgBodyHandler.cpp
@@ +128,5 @@
>      // And reapply our transformations...
>      outLength = ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
>    }
>  
> +  charset = m_partCharset;

So where do you use charset in this function? Or is this meant as an out argument?

@@ +415,1 @@
>      start += 9;

Please comment that this is the length of "boundary=", at least I assume that is the relation.

@@ +431,5 @@
>  
> +  if (m_isMultipart &&
> +      (start = lowerCaseLine.Find("charset=", /* ignoreCase = */ true)) != -1)
> +  {
> +    start += 8;

Please comment that this is the length of "charset=", at least I assume that is the relation.

@@ +436,5 @@
> +    int32_t end = line.RFindChar(';');
> +    if (end == -1)
> +      end = line.Length();
> +
> +    m_partCharset.Assign(Substring(line,start,end-start));

Spaces after commas.
Attachment #8939406 - Flags: review?(acelists) → review+
quite a few bugs of the intl variety https://mzl.la/2CHR6jL
Keywords: intl
(Assignee)

Comment 14

a year ago
(In reply to :aceman from comment #12)
> > +  charset = m_partCharset;
> So where do you use charset in this function? Or is this meant as an out
> argument?
Yes:
int32_t nsMsgBodyHandler::GetNextLine (nsCString &buf, nsCString &charset)
Since the arguments already didn't follow the conventions, I didn't do the right thing here either.

I'll add the comments and spaces when pushing this.

Comment 15

a year ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/25e260b99581
fix body search for non-ASCII bodies (incl. 10 new test cases). r=aceman
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
(Assignee)

Updated

a year ago
Attachment #8939406 - Flags: approval-comm-esr52?
Attachment #8939406 - Flags: approval-comm-beta+
(Assignee)

Updated

a year ago
Target Milestone: --- → Thunderbird 59.0
(Assignee)

Comment 16

a year ago
Sadly there's a follow-up. Turns out that the charset can be quoted, too, and that RFind won't work since the line can be:
Content-Type: text/plain; charset="utf-8"; format=flowed; delsp=yes;
Attachment #8939427 - Flags: review?(acelists)

Comment 17

a year ago
Comment on attachment 8939427 [details] [diff] [review]
1427124-follow-up.patch

Review of attachment 8939427 [details] [diff] [review]:
-----------------------------------------------------------------

Open-coding ugliness :(
Attachment #8939427 - Flags: review?(acelists) → review+

Comment 18

a year ago
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/863d0cd8bb53
Follow-up: charset can be quoted. r=aceman
(Assignee)

Updated

a year ago
Duplicate of this bug: 1059230
(Assignee)

Updated

a year ago
Attachment #8939406 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.