Closed Bug 1478828 Opened 7 years ago Closed 7 years ago

Quick Filter and Search Messages by Body fail to find messages if message is encoded in an invalid charset

Categories

(Thunderbird :: Search, defect)

52 Branch
defect
Not set
normal

Tracking

(thunderbird_esr60 fixed, thunderbird62 affected, thunderbird63 fixed)

RESOLVED FIXED
Thunderbird 63.0
Tracking Status
thunderbird_esr60 --- fixed
thunderbird62 --- affected
thunderbird63 --- fixed

People

(Reporter: jmdyck, Assigned: jorgk-bmo)

References

Details

Attachments

(5 files, 2 obsolete files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0 Build ID: 20180704192850 Steps to reproduce: 1. Went to a folder containing lots of messages (spam) that begin "We considered your resume to be very attractive". 2. Do a 'Quick Filter" (filtering messages by Body) on "resume" (or "attractive"). 3. Do a 'Search Messages' on the folder for "Body contains resume" (or ".. attractive"). Actual results: Either way, I got only 5 hits, all cases where I was getting a bounce-report that included the text of the original spam. Expected results: I should have got hundreds of hits.
(I got the bounce report because the spammer used my address as the From on the original spam.)
Attachment #8995355 - Attachment mime type: text/plain → message/rfc822
Version: 59 → 52 Branch
This is Thunderbird 52.9.1 (Ubuntu 18.04's newest version).
See Also: → 1434020
Attachment #8995349 - Attachment mime type: message/rfc822 → text/plain
Attachment #8995355 - Attachment mime type: message/rfc822 → text/plain
Interesting ;-) The "false negative" message is encoded in a charset cp-850 I've never heard of. If you export the message, change that to, say, windows-1252, your message is found when you import it again. The rejection notice doesn't specify a charset and quotes the original message verbatim. So the string is and should be found there. I can correct the charset handling, so if an unknown charset is detected, we default to something else. The cause is bad error handling here: https://dxr.mozilla.org/comm-central/rev/e5e1510b8d914bfa8439b21ba3f73e4f2e83e957/mailnews/base/search/src/nsMsgSearchTerm.cpp#1121 nsMsgI18NConvertToUnicode() will fail on a bad charset and we pretend everything went fine.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Quick Filter and Search Messages by Body fail to find messages → Quick Filter and Search Messages by Body fail to find messages if message is encoded in an invalid charset
Attached patch 1478828-bad-charset.patch (obsolete) — Splinter Review
This does the trick.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attachment #8995708 - Flags: review?(acelists)
According to https://www.iana.org/assignments/character-sets/character-sets.xhtml, "cp850" is an alias for "IBM850", which is a valid character set, registered with IANA. (Not that I care about my spam being correctly decoded, but there might be people legitimately using the charset.) Of course, if Thunderbird doesn't grok IBM850, there's not much else you can do.
Sorry, we use Mozilla core decoding software, if that doesn't recognise the charset, there's little I can do. That said, the e-mail doesn't contain "cp850" but "cp-850".
Before I hear "please add a test", here's one.
Attachment #8995708 - Attachment is obsolete: true
Attachment #8995708 - Flags: review?(acelists)
Attachment #8995739 - Flags: review?(acelists)
Comment on attachment 8995739 [details] [diff] [review] 1478828-bad-charset.patch - with test Review of attachment 8995739 [details] [diff] [review]: ----------------------------------------------------------------- So what is the idea in this patch? In case of unknown charset you try again with UTF8 so that maybe something useful comes out?
Attachment #8995739 - Flags: review?(acelists) → review+
(In reply to :aceman from comment #8) > In case of unknown charset you try again with UTF8 so that maybe something useful comes out? Yes. The existing code tried UTF-8 if no charset was given. So I'm extending this to the case where the charset is bad. With luck, the message as ASCII, like in the bug here, and that will do something useful. Worst case, we don't find anything.
Attached patch 1478828-ws.patchSplinter Review
There's an awful lot of badly formatted and ugly code in that file. Here I'm fixed the worst bits I saw.
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/dc35d06acdd4 try matching string with UTF-8 if conversion fails. r=aceman https://hg.mozilla.org/comm-central/rev/95eb75ce038f some white-space clean-up in nsMsgSearchTerm.cpp. rs=white-space-only
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
My absolute favourite was this code which I removed: -#ifdef DO_I18N - if(conv) - INTL_DestroyCharCodeConverter(conv); -#endif 'DO_I18N' which is thankfully never defined anywhere calling 'INTL_DestroyCharCodeConverter()' which doesn't exist :-(
Target Milestone: --- → Thunderbird 63.0
Attached patch 1478828-follow-up.patch (obsolete) — Splinter Review
I've been thinking about this. nsMsgI18NConvertToUnicode() treats an empty charset as ASCII/windows-1252, so basically a single-byte encoding. Here failure to convert uses ASCII/windows-1252 as well: https://dxr.mozilla.org/comm-central/rev/e5e1510b8d914bfa8439b21ba3f73e4f2e83e957/mailnews/mime/src/mimedrft.cpp#1553 So trying UTF-8 unconditionally is maybe a bit ambitious. This one will let me sleep better ;-) Note that since the arrival of encoding_rs, ISO-8859-1 is just a label for windows-1252.
Attachment #8995806 - Flags: review?(acelists)
Attachment #8995806 - Flags: review?(acelists) → review+
Tweaked the test so it exercises the UTF-8 and the windows-1252 code paths.
Attachment #8995806 - Attachment is obsolete: true
Attachment #8995823 - Flags: review+
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/9dc3759e20aa If no charset or conversion failed, check for UTF-8 and fallback to ASCII. r=aceman
Comment on attachment 8995739 [details] [diff] [review] 1478828-bad-charset.patch - with test Applies to all three patches here.
Attachment #8995739 - Flags: approval-comm-esr60+
Attachment #8995739 - Flags: approval-comm-beta+
Attachment #8995739 - Flags: approval-comm-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: