News messages in KOI8-R shown incorrectly: TB91 regression
Categories
(MailNews Core :: MIME, defect)
Tracking
(thunderbird_esr91+ fixed, thunderbird94 wontfix)
People
(Reporter: cstef, Assigned: mkmelin)
References
()
Details
(Keywords: regression)
Attachments
(8 files)
30.91 KB,
image/png
|
Details | |
1.40 KB,
message/rfc822
|
Details | |
42.71 KB,
image/png
|
Details | |
28.56 KB,
image/png
|
Details | |
36.23 KB,
image/png
|
Details | |
36.72 KB,
image/png
|
Details | |
48 bytes,
text/x-phabricator-request
|
wsmwk
:
approval-comm-esr91+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0
Steps to reproduce:
After upgrade to Thunderbird/91.1.2 some newsgroups in Russian in KOI8-R encoding are not shown correctly. The newsgroups are on ddt.demos.su news server, http://ddt.demos.su/index.html.en is info on it. I experiment with fido7.ru.fidonet.today group, but others show the same.
Actual results:
Messages and subjects in thread pane all are in <?> (see Message_pane.png in attachment). The message I try to see is in the attachment in next comment. It's weird name is how TB saved it.
When I try to view message source, it is shown correctly (Message_source.png screenshot).
I can use 'Charset Menu' addon (https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/) and select KOI8-R encoding by hand, then the message is shown correctly (see Message_pane_charset_corrected.png), but:
- I have to do it for every message, which is very annoying (and do it again even after returning to the same message again).
- Thread pane is still not shown correctly.
Expected results:
The expected result is to show everything correctly, as it was in Thunderbird/78.14.0. before the upgqrade.
Reporter | ||
Comment 1•2 years ago
|
||
Reporter | ||
Comment 2•2 years ago
|
||
Reporter | ||
Comment 3•2 years ago
|
||
This is how ddt.demos.su / fido7.ru.fidonet.today looks like in TB 78 with a folder charset of ISO-8859-1 (Western). The subject isn't right, but the body for some reason is picked up as KOI8-R. View source even detects KOI8-U which is apparently much the same.
Here TB 78 with folder charset windows-1251. It looks Cyrillic, but correct would be "05 октября в истории Фидо". Sadly, KOI8-R/U aren't available as folder charset and they were also not available in the default text encoding in the news account's server settings. The folder charset was indeed removed in bug 1671880 part 2. Surprising that the server charset doesn't offer all available charsets.
Further experimentation in TB 78 and 91 shows that setting mail.server.serverX.charset to KOI8-R has no effect at all.
There seem to be two bugs here:
- The charset detection in the body no longer works, was working in 78, not in 91.
- News account's server settings charset has no effect, neither in 78 nor in 91.
Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.
Alice, can you please look for the regression re. point 1.
STR: Set up a news account (add other account in the account manager) on server ddt.demos.su and subscribe to newsgroup fido7.ru.fidonet.today. Download the 500 headers it offers. Click on some messages. Do they only show the UTF-8 replacement character <?> or do they show Cyrillic text. in TB 78 it was Cyrillic text, now it's <?>.
Apparently used here, maybe only for the newsgroup folder name but not the message subject/body:
mailnews/news/public/nsINntpIncomingServer.idl
42 /** the server charset and it may be needed to display newsgroup folder
mailnews/news/src/nsNNTPProtocol.cpp
1431 if (NS_FAILED(m_nntpServer->GetCharset(charset)) ||
2497 if (NS_SUCCEEDED(m_nntpServer->GetCharset(charset)) &&
3650 if (NS_FAILED(m_nntpServer->GetCharset(charset) ||
mailnews/news/src/nsNewsFolder.cpp
1321 rv = nntpServer->GetCharset(dataCharset);
mailnews/news/src/nsNntpIncomingServer.cpp
257 nsNntpIncomingServer::SetCharset(const nsACString& aCharset) {
262 nsNntpIncomingServer::GetCharset(nsACString& aCharset) {
263 // first we get the per-server settings mail.server.<serverkey>.charset
![]() |
||
Comment 9•2 years ago
•
|
||
(In reply to newsfan from comment #7)
Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.
Alice, can you please look for the regression re. point 1.
STR: Set up a news account (add other account in the account manager) on server ddt.demos.su and subscribe to newsgroup fido7.ru.fidonet.today. Download the 500 headers it offers. Click on some messages. Do they only show the UTF-8 replacement character <?> or do they show Cyrillic text. in TB 78 it was Cyrillic text, now it's <?>.
There are two regressions for the problem in point 1.
#1 regression (the message shown like "ðÒÉ×ÅÔ" instead of "Привет"):
https://hg.mozilla.org/comm-central/pushloghtml?fromchange=43b5533bb521ce912d1e6efd63ba1a8b85b4689e&tochange=3298f786eff21756ef963741631033117c7d8aeb
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f05a0084c5f20abf47fa332ab5ec5c953ba48eb1&tochange=3155ffead6aefd99f4cef141b19163b881c0e288
Suspect of #1 regression: Bug 1645307 - Port bug 1603712 to Thunderbird (remove intl.charset.detector.ng.enabled). r=pmorris CLOSED TREE
#2 regression (the message shown like "<?><?><?><?><?><?>" instead of "ðÒÉ×ÅÔ"):
https://hg.mozilla.org/comm-central/pushloghtml?fromchange=ea8e92388dfe64465640b7ee7e471083cf9bcfe2&tochange=a5ec7d54dd63c8c9d1e8ba34e32d5090eb8754a6
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e581b02b5df54c1cdf194316dba74a5cf65145b6&tochange=68867f327c6267f971e6483443148a4025f94633
Suspect of #2 regression: Bug 1671880 - Part 1: Remove mailnews.view_default_charset pref. r=mkmelin
Comment 10•2 years ago
|
||
Fantastic work, much appreciated, thanks a lot! Yes, removing the charset detector would have contributed to the issue, and removing the default charset did the rest; it moved the charset to UTF-8 and all those highbit characters are invalid in UTF-8, hence triggering the <?> replacement.
Looks like proper detection needs to be implemented once again. As mentioned in comment #8, the server default charset is only used for folder names.
Comment 11•2 years ago
|
||
However, it would be quite useful, especially for newsgroups, to have an individual default. The server default charset seems to me the logical choice.
See also: Bug 1301058
There are servers where a Chinese Simpified (GBK) / Big5 default would make sense:
See Bug 1516774, Bug 1662350
Comment 12•2 years ago
|
||
(In reply to newsfan from comment #7)
Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.
It's possible I can fix both problems in the new JS implemenation, I can use the server charset to decode title/body. But I have no idea how to fix it for 91, especially the message body problem.
Comment 13•2 years ago
|
||
The main regression was caused by the removal of this line here:
https://hg.mozilla.org/comm-central/diff/d31e5b627413057dd99c2ccd901a3a8b6261272d/mailnews/mime/src/mimetext.cpp#l1.19
Restoring the line fixes the body, but it doesn't fix the subject. Even in TB 78 the subject wasn't correct, it could be made to look "somewhat" Cyrillic by setting the folder charset or the default incoming mail charset, both of which have been removed between 78 and 91. The latter was localised, so chances are that it looked better in a Russian version. In TB 78 the subject can be made to look like "чПФУБР, ЙОУФБЗТБН, ЖЕКУВХЛ.. жйдпоеф!", but correct is "Вотсап, инстаграм, фейсбук.. ФИДОНЕТ!" (English: Whatsapp, Instagram, Facebook .. FIDONET!). The first string "чПФУБР, ..." is just Cyrillic looking gibberish.
Here's our fix for the body, not sure whether we should address the subject as well since that clearly infringes RFC 6532 that allows raw UTF-8 not not any other raw data, like KOI8-R in this case:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-auto-detect.patch
Reporter | ||
Comment 14•2 years ago
•
|
||
(In reply to Rachel Martin from comment #13)
Restoring the line fixes the body, but it doesn't fix the subject. Even in TB 78 the subject wasn't correct, it could be made to look "somewhat"
Cyrillic by setting the folder charset or the default incoming mail charset, both of which have been removed between 78 and 91.
The subject definitely was correct in TB 78, and subjects in thread pane were correct as well. Yes, I had folder charset set to KOI8-R. All this together gave absolutely correct view despite RFC violation.
Reporter | ||
Comment 15•2 years ago
|
||
The subject definitely was correct in TB 78, and subjects in thread pane were correct as well. Yes, I had folder charset set to KOI8-R. All this together gave absolutely correct view despite RFC violation.
P.S. Sorry for the wrong formatting.
Comment 16•2 years ago
|
||
Sure, as we pointed out in comment #13, the folder charset or default charset would have corrected the issue, but both are gone now. That said, even at TB 78, KOI8-R was not a selectable charset:
https://searchfox.org/comm-esr78/rev/ba95b27c546ce348adadd41ce57d2a944df198ad/mailnews/base/content/menulist-charsetpicker.js#130
So how did you set the folder charset to KOI8-R?
Now we need to find a solution to use the servers default charset instead, we've already added more charsets to the list (see patch in comment #13).
Reporter | ||
Comment 17•2 years ago
|
||
(In reply to Rachel Martin from comment #16)
Sure, as we pointed out in comment #13, the folder charset or default charset would have corrected the issue, but both are gone now. That said, even at TB 78, KOI8-R was not a selectable charset:
https://searchfox.org/comm-esr78/rev/ba95b27c546ce348adadd41ce57d2a944df198ad/mailnews/base/content/menulist-charsetpicker.js#130
So how did you set the folder charset to KOI8-R?
Now we need to find a solution to use the servers default charset instead, we've already added more charsets to the list (see patch in comment #13).
To be honest, I don't remember how I set it. I use this profile for many years. I went to an old backup and found
(2426=KOI8-R) in .msf file for the newsgroup
and
user_pref("mail.server.server5.charset", "KOI8-R");
in pref.js
I'm not sure if this was done via interface or was edited by hand, it was too long ago.
But I'm definitely sure that everything worked and displayed correctly until upgrade to TB 91.
Comment 18•2 years ago
|
||
Full fix, including body display, subject display in header pane and thread pane, search and filtering (all broken by bug 1645307 and bug 1671880 part 2). Also adds missing charsets back to the menu:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-fix-news-charset-issues.patch
Assignee | ||
Comment 19•2 years ago
|
||
I took a look. Looking at the history here, nobody has been able to set up a koi8-r newsgroup since many many years.
- de-xbl'd in bug 1524508 (for tb68)
- in tb 60 - could not do it - https://searchfox.org/comm-esr60/source/mailnews/base/content/charsetList.xml
- but was not possible at least back in tb 36 - xref bug 1003716
Maybe around 10yrs ago it was possible... It's been possible to continue using that group until now though.
There hasn't been other complaints since, so it's a very questionable feature to support at all.
Assignee | ||
Comment 20•2 years ago
|
||
- Restore erroneously removed 'text->inputAutodetect = true;'.
- Use news server default charset.
Updated•2 years ago
|
Assignee | ||
Comment 21•2 years ago
|
||
Depends on D128582
Updated•2 years ago
|
Comment 22•2 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02
fix non-utf8 newsgroups. r=freaktechnik
https://hg.mozilla.org/comm-central/rev/f5354584e967
remove obsolete charset picking for outgoing mails. r=freaktechnik
Assignee | ||
Updated•2 years ago
|
Comment 23•2 years ago
|
||
Currently TB uses the server charset for all messages from newsgroups.
However, the server charset should be used (by default) only for those messages that do not have a valid charset in the Content-Type header.
I guess that was broken in this bug.
Comment 24•2 years ago
|
||
This is surprising. The idea of the patch was to use the server charset when previously the folder charset was used. This happens in four locations:
Old nsMsgDatabase::GetEffectiveCharset():
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l11.9
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l2.22
Old bridge_new_new_uri():
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l17.12
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.30
Old Search/Filter:
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l20.12
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l21.13
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l9.30
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l10.30
The search/filter code is not relevant for message display. With the folder charset there also was a GetFolderCharsetOverride()
that determined whether to use the folder charset or not.
Can you see the logic error? What's a good news group to try it on? Could this like be incorrect?
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38
Getting the server charset doesn't constitute a manual override. We'll debug it soon.
Comment 25•2 years ago
|
||
(In reply to Rachel Martin from comment #24)
What's a good news group to try it on?
news.mozilla.org
I posted two articles in "mozilla.test". Depending on the server charset, TB displays one or the other correctly or incorrectly.
Comment 26•2 years ago
|
||
This looks fishy:
Old:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.29
*override_charset = false;
- *default_charset = strdup("UTF-8");
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.35
+ rv = nntpURL->GetCharset(charset);
+ if (NS_SUCCEEDED(rv)) {
+ *default_charset = ToNewCString(charset);
+ *override_charset = true;
With override_charset = true;
it will force the default - right?
Comment 27•2 years ago
|
||
With override_charset = true; it will force the default - right?
Yes. Removing this line, https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38, fixes the issue. The news.mozilla.org / mozilla.test UTF-8 message displays correctly although the server charset is "Western" in our config.
However, removing that line breaks ddt.demos.su / fido7.ru.fidonet.today again. There the override is required to show the correct subject in the header pane and thread pane since the subject is encoded in KOI8-R and the override takes care of that. For cases like this, the folder charset was ideal since you could override per folder, not per server. We'll have to think about this a bit more. Maybe something like a forced override based on the server, a checkbox next to the server charset. Open for suggestions.
Comment 28•2 years ago
|
||
Is it possible that the concept behind bool* override_charset
is outdated?
When is it needed?
I would imagine it was used when there was still the menu to select an explicit charset. Then maybe we could get rid of bool* override_charset
altogether.
For this, code like:
https://searchfox.org/comm-central/source/mailnews/mime/src/mimehdrs.cpp#478
if (opt->override_charset)
charset = PL_strdup(opt->default_charset);
... would have to be replaced with something similar to:
charset = MimeHeaders_get_parameter(contentType, HEADER_PARM_CHARSET,
nullptr, nullptr);
if(!charset)
charset = opt->default_charset;
Don't know if this will do well.
Comment 29•2 years ago
|
||
In general, an override signifies that the system is using a charset which is different to the charset it would normally determine and use. Yes, when there was still a charset menu, the code set the override which was later used for display, replay, forward, etc., here in TB 68 when that last worked:
https://searchfox.org/comm-esr68/rev/f89f3cf690ad0daa84414dee44605c328855eb8f/mail/base/content/mailWindow.js#848
In the line below, the code remembers for which message that charset is to be used so that the override is only used for the message it's intended for.
Even after repairing "Repair text encoding", the concept of the override is still there:
https://searchfox.org/comm-central/rev/1bd54a23a2b8a4af43c21ba8ac5e1602fb3d475f/mail/base/content/mailWindow.js#1022
only that the charset is not selected by the user but instead determined by the system in forceDetectDocumentCharset()
.
So the override mechanism in our view is essential for charset correction. As was noted in bug 1713786 comment #33, the new override mechanism only corrects the body, when especially for news without charset, we also need to correct the treatment of the subject header, also when retrieved from the database (which is used for the thread pane). And as we saw, this line *override_charset = true;
makes all the difference.
For the news case, which is what is being fixed here, setting *override_charset = true;
maybe - as you suggested - should be made dependent on whether the headers have a charset parameter. It's still a heuristic approach, ultimately the body could be in a "local" charset and the subject header in raw UTF-8 as per RFC 6532 (see comment #13).
Comment 30•2 years ago
|
||
Looking at that suggestion, if doesn't appear that the message headers are available at this point. That brings us back to the suggestion from the second last sentence in comment #27 (forced server override). Should all this be discussed in a separate bug or do you want to reopen this one?
Comment 31•2 years ago
|
||
(In reply to Rachel Martin from comment #27)
However, removing that line breaks ddt.demos.su / fido7.ru.fidonet.today again.
No it doesn't. We had forgotten to set the server charset to KOI8-R, so of course it doesn't work then. So please ignore comment #28 to comment #30. This line https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38 can just be removed and everything is fine. Surely Magnus will take care of that.
Comment 32•2 years ago
|
||
The original patch didn't have that line:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-fix-news-charset-issues.patch#L159
Comment 33•2 years ago
|
||
Pushed by mkmelin@iki.fi: https://hg.mozilla.org/comm-central/rev/e0e961dce406 followup - remove case a of incorrect forcing of charset override. rs=me
Comment 34•2 years ago
|
||
(In reply to Pulsebot from comment #33)
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e0e961dce406
followup - remove case a of incorrect forcing of charset override. rs=me
Thanks - Yes this fixes Comment 23.
Assignee | ||
Comment 35•2 years ago
|
||
Thanks for testing!
Assignee | ||
Comment 37•2 years ago
|
||
Comment on attachment 9246103 [details]
Bug 1734361 - fix non-utf8 news groups. r=freaktechnik
[Approval Request Comment]
Regression caused by (bug #): bug 1671880
User impact if declined: Can't read such newsgroups (which will be rare, but apparently exist)
Testing completed (on c-c, etc.): Have been on beta
Risk to taking this patch (and alternatives if risky): not very risky
Comment 38•2 years ago
|
||
Let's not forget the additional fix from comment #33.
Comment 39•2 years ago
|
||
Comment on attachment 9246103 [details]
Bug 1734361 - fix non-utf8 news groups. r=freaktechnik
[Triage Comment]
Approved for esr91
(In reply to Rachel Martin from comment #38)
Let's not forget the additional fix from comment #33.
Comment 40•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Description
•