Closed Bug 1427124 Opened 6 years ago Closed 6 years ago

Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)

Tracking

(thunderbird_esr5258+ fixed, thunderbird58 fixed, thunderbird59 fixed)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 59.0

Tracking Flags:

Tracking

Status

thunderbird_esr52

58+

fixed

thunderbird58

---

fixed

thunderbird59

---

fixed

People

(Reporter: regspam, Assigned: jorgk-bmo)

References

(Blocks 1 open bug)

Details

(Keywords: intl, Whiteboard: TB 52.6 ESR)

Attachments

(4 files, 1 obsolete file)

test message.eml 6 years ago el8 1.68 KB, text/plain		Details
test message.eml.zip 6 years ago el8 1.09 KB, application/x-zip-compressed		Details
1427124-non-ascii-body-search.patch (v1) WIP 6 years ago Jorg K (CEST = GMT+2) 18.09 KB, patch		Details \| Diff \| Splinter Review
1427124-non-ascii-body-search.patch (v2) 6 years ago Jorg K (CEST = GMT+2) 27.48 KB, patch	aceman : review+ jorgk-bmo : approval-comm-beta+ jorgk-bmo : approval-comm-esr52+	Details \| Diff \| Splinter Review
1427124-follow-up.patch 6 years ago Jorg K (CEST = GMT+2) 1.11 KB, patch	aceman : review+	Details \| Diff \| Splinter Review

el8

Reporter

Description

•

6 years ago

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:56.0) Gecko/20100101 Firefox/56.0
Build ID: 20171024165158

Steps to reproduce:

1. Open  Tools -> Options -> Composition -> Send Options...  and select "Send the message in both plain text and HTML" from the drop-down menu.

2. Start a new e-mail message, type some non-English text and apply some HTML formatting to it. For example, let the text be "русский текст" with the "текст" part in bold. (This is Russian; I also tried it with Hebrew with the same result).
Send the message to one of your addresses so that you could continue with step 3.

3. Download the message from the server, select the folder it's in and enable Quick Filter. Type "русский" in the filter input field, enable "Body" in the "Filter messages by:" list if it's disabled. The red "No results" text appears left to the input field, the message isn't found.

If I send the message in either plain text or HTML only, Quick Filter does find the message.
Also, in any case the message is found via the Global Search function (Ctrl-K). 

Thunderbird 52.5.2 (32-bit),
Win7.

el8

Reporter

Updated

•

6 years ago

Component: Untriaged → Search

Jorg K (CEST = GMT+2)

Assignee

Comment 1

•

6 years ago

Nice comprehensive report. Can you please save/drag your test message to the desktop or folder and then attach the .eml file here using "Attach File". Please remove any personal information from the e-mail headers before attaching the file.

Body search isn't working very well, there are many bugs in this area.

el8

Reporter

Comment 2

•

6 years ago

Attached file test message.eml — Details

Sure, here's the message.

el8

Reporter

Comment 3

•

6 years ago

Attached file test message.eml.zip — Details

Sorry, for some reason only the HTML-part of the message is displayed. I'm reattaching it in the zip-packed form, hope it works out this time.

el8

Reporter

Comment 4

•

6 years ago

(Turns out an extension caused the attached eml file to be displayed, not downloaded. Sorry again).

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Attachment #8939084 - Attachment mime type: message/rfc822 → text/plain

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Comment 5

•

6 years ago

Thanks for adding the sample. As you stated, there is a plaintext and a HTML part and any non-ASCII content isn't font.

I changed the message to:

Plaintext: русский *текст* Heute Ändern
HTML:      русский <b>текст</b><br>Heute Ändern

and the "Heute" ("today" in German) is found, but the "Ändern" ("to change") isn't. It's basically an ancient bug 521649 from 2009 where someone complains about drafts which also always have plaintext and HTML parts. Most likely the search is done on the raw UTF-8, so only ASCII text is found.

This will most likely be fixed of we ever get around to fixing bug 1259534.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Thomas D. (:thomas8)

Comment 6

•

6 years ago

(In reply to Jorg K (GMT+1) from comment #5)
> isn't. It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts.

Someone is me, and Jörg knows me :) I wasn't complaining, just to my best knowledge and correctly reporting what I saw at the time. Finding the exact causes is not the duty of the reporter.

Jorg K (CEST = GMT+2)

Assignee

Comment 7

•

6 years ago

(In reply to Jorg K (GMT+1) from comment #5)
> It's basically an ancient bug 521649 from 2009 where someone
> complains about drafts which also always have plaintext and HTML parts. Most
> likely the search is done on the raw UTF-8, so only ASCII text is found.
Oops, that was wrong, the draft test case in bug 521649 is different, no multipart involved.

> This will most likely be fixed of we ever get around to fixing bug 1259534.
However, the root cause of all body search problems is the processing which is faulty be design. We search some raw data instead of decoding it first, see bug 1259534 comment #14.

Blocks: qfasfailtracker

Jorg K (CEST = GMT+2)

Assignee

Comment 8

•

6 years ago

I fixed bug 1259534 now, but this problem persists. Bug 1259534 was about base64 encoded content. The problem here is most likely that we don't pay attention to the UTF-8 charset. Needs more investigation.

Jorg K (CEST = GMT+2)

Assignee

Comment 9

•

6 years ago

Attached patch 1427124-non-ascii-body-search.patch (v1) WIP (obsolete) — Details — Splinter Review

Here are 10 new test cases with non-ASCII characters. Sadly, only two cases work: plaintext (01) and HTML (03) without embedded images or attachments.

So the problem is quite clear: Multipart doesn't work *at all* since the charset is only ever taken from the message headers and never from the part headers.

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Summary: Quick Filter doesn't find non-English text in message body if the message was sent in both plain text and HTML → Quick Filter doesn't find non-ASCII text in message body if the message is multipart (only plain text and HTML with no attachments work)

Jorg K (CEST = GMT+2)

Assignee

Comment 10

•

6 years ago

Attached patch 1427124-non-ascii-body-search.patch (v2) — Details — Splinter Review

This fixes the problem. I added 10 new tests. We have a variation of UTF-8, windows-1252 and ISO-8859-7 (Greek) encodings and also some UTF-8 in base64 encoded parts thrown it. It all works and should improve TB's body search capability greatly.

Assignee: nobody → jorgk

Attachment #8939364 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Attachment #8939406 - Flags: review?(acelists)

Jorg K (CEST = GMT+2)

Assignee

Comment 11

•

6 years ago

Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Splinter Review gets confused here with the various charsets. Most stuff is in UTF-8, there is some ANSI/windows-1252 and some Greek.

It looks much better in the "Details" view where UTF-8 is used. Only the ISO-8859-7 and windows-1252 messages look bad there.

::: mailnews/base/test/unit/test_searchBody.js
@@ +91,5 @@
>    // and we don't want to find that.
>    { value: "U2VhcmNoIGZ", op: Contains, count: 0 },
> +
> +  // Messages 11 and 13 to 20 contain "hÃ¼hÃ¼" once.
> +  { value: "hÃ¼hÃ¼", op: Contains, count: 9 },

Looks like Splinter Review is using ANSI here. This is hühü.

@@ +93,5 @@
> +
> +  // Messages 11 and 13 to 20 contain "hÃ¼hÃ¼" once.
> +  { value: "hÃ¼hÃ¼", op: Contains, count: 9 },
> +  // Message 12 contains ÎÎ±Î»Î·ÏÏÎÏÎ± (good evening in Greek).
> +  { value: "ÎÎ±Î»Î·ÏÏÎÏÎ±", op: Contains, count: 1 },

And this is Greek: Καλησπέρα

@@ +96,5 @@
> +  // Message 12 contains ÎÎ±Î»Î·ÏÏÎÏÎ± (good evening in Greek).
> +  { value: "ÎÎ±Î»Î·ÏÏÎÏÎ±", op: Contains, count: 1 },
> +
> +  // Messages 16, 17, 18, 20 contain "hÃ¯hÃ¯" in the plaintext part.
> +  { value: "hÃ¯hÃ¯", op: Contains, count: 4 },

And this is hïhï.

:aceman

Comment 12

•

6 years ago

Comment on attachment 8939406 [details] [diff] [review]
1427124-non-ascii-body-search.patch (v2)

Review of attachment 8939406 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for diving into these problems Jorg. This should improve the search tremendously for a lot of people.

I have created my own messages with different charsets in a single folder and search worked on them with the patch. Without the patch, nothing was found.

Assuming all the test cases pass, this is great and you have my r+.

::: mailnews/base/search/public/nsMsgBodyHandler.h
@@ +32,5 @@
>      uint32_t headersSize, bool ForFilters);
>  
>    virtual ~nsMsgBodyHandler();
>  
> +  // returns next message line in buf and the applicable charset, if found.

This actually has 3 return values (also the int32_t function return value), please document the int too.

::: mailnews/base/search/src/nsMsgBodyHandler.cpp
@@ +128,5 @@
>      // And reapply our transformations...
>      outLength = ApplyTransformations(buf, buf.Length(), eatThisLine, buf);
>    }
>  
> +  charset = m_partCharset;

So where do you use charset in this function? Or is this meant as an out argument?

@@ +415,1 @@
>      start += 9;

Please comment that this is the length of "boundary=", at least I assume that is the relation.

@@ +431,5 @@
>  
> +  if (m_isMultipart &&
> +      (start = lowerCaseLine.Find("charset=", /* ignoreCase = */ true)) != -1)
> +  {
> +    start += 8;

Please comment that this is the length of "charset=", at least I assume that is the relation.

@@ +436,5 @@
> +    int32_t end = line.RFindChar(';');
> +    if (end == -1)
> +      end = line.Length();
> +
> +    m_partCharset.Assign(Substring(line,start,end-start));

Spaces after commas.

Attachment #8939406 - Flags: review?(acelists) → review+

Wayne Mery (:wsmwk)

Comment 13

•

6 years ago

quite a few bugs of the intl variety https://mzl.la/2CHR6jL

Keywords: intl

Jorg K (CEST = GMT+2)

Assignee

Comment 14

•

6 years ago

(In reply to :aceman from comment #12)
> > +  charset = m_partCharset;
> So where do you use charset in this function? Or is this meant as an out
> argument?
Yes:
int32_t nsMsgBodyHandler::GetNextLine (nsCString &buf, nsCString &charset)
Since the arguments already didn't follow the conventions, I didn't do the right thing here either.

I'll add the comments and spaces when pushing this.

Pulsebot

Comment 15

•

6 years ago

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/25e260b99581
fix body search for non-ASCII bodies (incl. 10 new test cases). r=aceman

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Attachment #8939406 - Flags: approval-comm-esr52?

Attachment #8939406 - Flags: approval-comm-beta+

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Target Milestone: --- → Thunderbird 59.0

Jorg K (CEST = GMT+2)

Assignee

Comment 16

•

6 years ago

Attached patch 1427124-follow-up.patch — Details — Splinter Review

Sadly there's a follow-up. Turns out that the charset can be quoted, too, and that RFind won't work since the line can be:
Content-Type: text/plain; charset="utf-8"; format=flowed; delsp=yes;

Attachment #8939427 - Flags: review?(acelists)

:aceman

Comment 17

•

6 years ago

Comment on attachment 8939427 [details] [diff] [review]
1427124-follow-up.patch

Review of attachment 8939427 [details] [diff] [review]:
-----------------------------------------------------------------

Open-coding ugliness :(

Attachment #8939427 - Flags: review?(acelists) → review+

Pulsebot

Comment 18

•

6 years ago

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/863d0cd8bb53
Follow-up: charset can be quoted. r=aceman

Jorg K (CEST = GMT+2)

Assignee

Comment 19

•

6 years ago

Beta (TB 58):
https://hg.mozilla.org/releases/comm-beta/rev/4bdc680ec79b
https://hg.mozilla.org/releases/comm-beta/rev/86501ecd46e6

status-thunderbird58: --- → fixed

status-thunderbird59: --- → fixed

status-thunderbird_esr52: --- → affected

Jorg K (CEST = GMT+2)

Assignee

Updated

•

6 years ago

Attachment #8939406 - Flags: approval-comm-esr52? → approval-comm-esr52+

Jorg K (CEST = GMT+2)

Assignee

Comment 21

•

6 years ago

TB 52.6 ESR:
https://hg.mozilla.org/releases/comm-esr52/rev/505cb876ef79f19e39bd073e4943c8909ff5e044
https://hg.mozilla.org/releases/comm-esr52/rev/c4f66e88dd7eca6a15335f409c6ac09d9e1f2fce

status-thunderbird_esr52: affected → fixed

tracking-thunderbird_esr52: --- → 58+

Whiteboard: TB 52.6 ESR

You need to log in before you can comment on or make changes to this bug.