Closed Bug 1734361 Opened 3 years ago Closed 3 years ago

News messages in KOI8-R shown incorrectly: TB91 regression

Categories

(MailNews Core :: MIME, defect)

Thunderbird 91
defect

Tracking

(thunderbird_esr91+ fixed, thunderbird94 wontfix)

RESOLVED FIXED
95 Branch
Tracking Status
thunderbird_esr91 + fixed
thunderbird94 --- wontfix

People

(Reporter: cstef, Assigned: mkmelin)

References

()

Details

(Keywords: regression)

Attachments

(8 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0

Steps to reproduce:

After upgrade to Thunderbird/91.1.2 some newsgroups in Russian in KOI8-R encoding are not shown correctly. The newsgroups are on ddt.demos.su news server, http://ddt.demos.su/index.html.en is info on it. I experiment with fido7.ru.fidonet.today group, but others show the same.

Actual results:

Messages and subjects in thread pane all are in <?> (see Message_pane.png in attachment). The message I try to see is in the attachment in next comment. It's weird name is how TB saved it.

When I try to view message source, it is shown correctly (Message_source.png screenshot).

I can use 'Charset Menu' addon (https://addons.thunderbird.net/en-US/thunderbird/addon/charset-menu/) and select KOI8-R encoding by hand, then the message is shown correctly (see Message_pane_charset_corrected.png), but:

  • I have to do it for every message, which is very annoying (and do it again even after returning to the same message again).
  • Thread pane is still not shown correctly.

Expected results:

The expected result is to show everything correctly, as it was in Thunderbird/78.14.0. before the upgqrade.

Blocks: tb91found

Added 'See also' for 2 bug that seem relevant.

Attached image news1.png

This is how ddt.demos.su / fido7.ru.fidonet.today looks like in TB 78 with a folder charset of ISO-8859-1 (Western). The subject isn't right, but the body for some reason is picked up as KOI8-R. View source even detects KOI8-U which is apparently much the same.

Attached image news2.png

Here TB 78 with folder charset windows-1251. It looks Cyrillic, but correct would be "05 октября в истории Фидо". Sadly, KOI8-R/U aren't available as folder charset and they were also not available in the default text encoding in the news account's server settings. The folder charset was indeed removed in bug 1671880 part 2. Surprising that the server charset doesn't offer all available charsets.

Further experimentation in TB 78 and 91 shows that setting mail.server.serverX.charset to KOI8-R has no effect at all.

There seem to be two bugs here:

  1. The charset detection in the body no longer works, was working in 78, not in 91.
  2. News account's server settings charset has no effect, neither in 78 nor in 91.

Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.

Alice, can you please look for the regression re. point 1.

STR: Set up a news account (add other account in the account manager) on server ddt.demos.su and subscribe to newsgroup fido7.ru.fidonet.today. Download the 500 headers it offers. Click on some messages. Do they only show the UTF-8 replacement character <?> or do they show Cyrillic text. in TB 78 it was Cyrillic text, now it's <?>.

Flags: needinfo?(remotenonsense)
Flags: needinfo?(alice0775)

Apparently used here, maybe only for the newsgroup folder name but not the message subject/body:

	mailnews/news/public/nsINntpIncomingServer.idl
42	/** the server charset and it may be needed to display newsgroup folder
	mailnews/news/src/nsNNTPProtocol.cpp
1431	if (NS_FAILED(m_nntpServer->GetCharset(charset)) ||
2497	if (NS_SUCCEEDED(m_nntpServer->GetCharset(charset)) &&
3650	if (NS_FAILED(m_nntpServer->GetCharset(charset) ||
	mailnews/news/src/nsNewsFolder.cpp
1321	rv = nntpServer->GetCharset(dataCharset);
	mailnews/news/src/nsNntpIncomingServer.cpp
257	nsNntpIncomingServer::SetCharset(const nsACString& aCharset) {
262	nsNntpIncomingServer::GetCharset(nsACString& aCharset) {
263	// first we get the per-server settings mail.server.<serverkey>.charset

(In reply to newsfan from comment #7)

Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.

Alice, can you please look for the regression re. point 1.

STR: Set up a news account (add other account in the account manager) on server ddt.demos.su and subscribe to newsgroup fido7.ru.fidonet.today. Download the 500 headers it offers. Click on some messages. Do they only show the UTF-8 replacement character <?> or do they show Cyrillic text. in TB 78 it was Cyrillic text, now it's <?>.

There are two regressions for the problem in point 1.
#1 regression (the message shown like "ðÒÉ×ÅÔ" instead of "Привет"):
https://hg.mozilla.org/comm-central/pushloghtml?fromchange=43b5533bb521ce912d1e6efd63ba1a8b85b4689e&tochange=3298f786eff21756ef963741631033117c7d8aeb
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f05a0084c5f20abf47fa332ab5ec5c953ba48eb1&tochange=3155ffead6aefd99f4cef141b19163b881c0e288

Suspect of #1 regression: Bug 1645307 - Port bug 1603712 to Thunderbird (remove intl.charset.detector.ng.enabled). r=pmorris CLOSED TREE

#2 regression (the message shown like "<?><?><?><?><?><?>" instead of "ðÒÉ×ÅÔ"):
https://hg.mozilla.org/comm-central/pushloghtml?fromchange=ea8e92388dfe64465640b7ee7e471083cf9bcfe2&tochange=a5ec7d54dd63c8c9d1e8ba34e32d5090eb8754a6
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e581b02b5df54c1cdf194316dba74a5cf65145b6&tochange=68867f327c6267f971e6483443148a4025f94633

Suspect of #2 regression: Bug 1671880 - Part 1: Remove mailnews.view_default_charset pref. r=mkmelin

Flags: needinfo?(alice0775)

Fantastic work, much appreciated, thanks a lot! Yes, removing the charset detector would have contributed to the issue, and removing the default charset did the rest; it moved the charset to UTF-8 and all those highbit characters are invalid in UTF-8, hence triggering the <?> replacement.

Looks like proper detection needs to be implemented once again. As mentioned in comment #8, the server default charset is only used for folder names.

However, it would be quite useful, especially for newsgroups, to have an individual default. The server default charset seems to me the logical choice.

See also: Bug 1301058

There are servers where a Chinese Simpified (GBK) / Big5 default would make sense:
See Bug 1516774, Bug 1662350

(In reply to newsfan from comment #7)

Ping, is this working better in the new implementation? What is the charset option meant to do? That's not working in 78 and 91.

It's possible I can fix both problems in the new JS implemenation, I can use the server charset to decode title/body. But I have no idea how to fix it for 91, especially the message body problem.

Flags: needinfo?(remotenonsense)

The main regression was caused by the removal of this line here:
https://hg.mozilla.org/comm-central/diff/d31e5b627413057dd99c2ccd901a3a8b6261272d/mailnews/mime/src/mimetext.cpp#l1.19

Restoring the line fixes the body, but it doesn't fix the subject. Even in TB 78 the subject wasn't correct, it could be made to look "somewhat" Cyrillic by setting the folder charset or the default incoming mail charset, both of which have been removed between 78 and 91. The latter was localised, so chances are that it looked better in a Russian version. In TB 78 the subject can be made to look like "чПФУБР, ЙОУФБЗТБН, ЖЕКУВХЛ.. жйдпоеф!", but correct is "Вотсап, инстаграм, фейсбук.. ФИДОНЕТ!" (English: Whatsapp, Instagram, Facebook .. FIDONET!). The first string "чПФУБР, ..." is just Cyrillic looking gibberish.

Here's our fix for the body, not sure whether we should address the subject as well since that clearly infringes RFC 6532 that allows raw UTF-8 not not any other raw data, like KOI8-R in this case:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-auto-detect.patch

(In reply to Rachel Martin from comment #13)

Restoring the line fixes the body, but it doesn't fix the subject. Even in TB 78 the subject wasn't correct, it could be made to look "somewhat"
Cyrillic by setting the folder charset or the default incoming mail charset, both of which have been removed between 78 and 91.

The subject definitely was correct in TB 78, and subjects in thread pane were correct as well. Yes, I had folder charset set to KOI8-R. All this together gave absolutely correct view despite RFC violation.

The subject definitely was correct in TB 78, and subjects in thread pane were correct as well. Yes, I had folder charset set to KOI8-R. All this together gave absolutely correct view despite RFC violation.

P.S. Sorry for the wrong formatting.

Sure, as we pointed out in comment #13, the folder charset or default charset would have corrected the issue, but both are gone now. That said, even at TB 78, KOI8-R was not a selectable charset:
https://searchfox.org/comm-esr78/rev/ba95b27c546ce348adadd41ce57d2a944df198ad/mailnews/base/content/menulist-charsetpicker.js#130
So how did you set the folder charset to KOI8-R?
Now we need to find a solution to use the servers default charset instead, we've already added more charsets to the list (see patch in comment #13).

(In reply to Rachel Martin from comment #16)

Sure, as we pointed out in comment #13, the folder charset or default charset would have corrected the issue, but both are gone now. That said, even at TB 78, KOI8-R was not a selectable charset:
https://searchfox.org/comm-esr78/rev/ba95b27c546ce348adadd41ce57d2a944df198ad/mailnews/base/content/menulist-charsetpicker.js#130
So how did you set the folder charset to KOI8-R?
Now we need to find a solution to use the servers default charset instead, we've already added more charsets to the list (see patch in comment #13).

To be honest, I don't remember how I set it. I use this profile for many years. I went to an old backup and found
(2426=KOI8-R) in .msf file for the newsgroup
and
user_pref("mail.server.server5.charset", "KOI8-R");
in pref.js

I'm not sure if this was done via interface or was edited by hand, it was too long ago.

But I'm definitely sure that everything worked and displayed correctly until upgrade to TB 91.

Full fix, including body display, subject display in header pane and thread pane, search and filtering (all broken by bug 1645307 and bug 1671880 part 2). Also adds missing charsets back to the menu:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-fix-news-charset-issues.patch

I took a look. Looking at the history here, nobody has been able to set up a koi8-r newsgroup since many many years.

Maybe around 10yrs ago it was possible... It's been possible to continue using that group until now though.
There hasn't been other complaints since, so it's a very questionable feature to support at all.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
  • Restore erroneously removed 'text->inputAutodetect = true;'.
  • Use news server default charset.

Based on https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/1734361-fix-news-charset-issues.patch

Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED
Component: Untriaged → MIME
Product: Thunderbird → MailNews Core

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02
fix non-utf8 newsgroups. r=freaktechnik
https://hg.mozilla.org/comm-central/rev/f5354584e967
remove obsolete charset picking for outgoing mails. r=freaktechnik

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

Currently TB uses the server charset for all messages from newsgroups.

However, the server charset should be used (by default) only for those messages that do not have a valid charset in the Content-Type header.

I guess that was broken in this bug.

This is surprising. The idea of the patch was to use the server charset when previously the folder charset was used. This happens in four locations:

Old nsMsgDatabase::GetEffectiveCharset():
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l11.9
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l2.22

Old bridge_new_new_uri():
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l17.12
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.30

Old Search/Filter:
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l20.12
https://hg.mozilla.org/comm-central/rev/5696264d99e6#l21.13
New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l9.30
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l10.30

The search/filter code is not relevant for message display. With the folder charset there also was a GetFolderCharsetOverride() that determined whether to use the folder charset or not.

Can you see the logic error? What's a good news group to try it on? Could this like be incorrect?
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38
Getting the server charset doesn't constitute a manual override. We'll debug it soon.

(In reply to Rachel Martin from comment #24)

What's a good news group to try it on?

news.mozilla.org

I posted two articles in "mozilla.test". Depending on the server charset, TB displays one or the other correctly or incorrectly.

This looks fishy:

Old:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.29

              *override_charset = false;
 -            *default_charset = strdup("UTF-8");

New:
https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.35

+              rv = nntpURL->GetCharset(charset);
+              if (NS_SUCCEEDED(rv)) {
+                *default_charset = ToNewCString(charset);
+                *override_charset = true;

With override_charset = true; it will force the default - right?

With override_charset = true; it will force the default - right?

Yes. Removing this line, https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38, fixes the issue. The news.mozilla.org / mozilla.test UTF-8 message displays correctly although the server charset is "Western" in our config.

However, removing that line breaks ddt.demos.su / fido7.ru.fidonet.today again. There the override is required to show the correct subject in the header pane and thread pane since the subject is encoded in KOI8-R and the override takes care of that. For cases like this, the folder charset was ideal since you could override per folder, not per server. We'll have to think about this a bit more. Maybe something like a forced override based on the server, a checkbox next to the server charset. Open for suggestions.

Is it possible that the concept behind bool* override_charset is outdated?
When is it needed?

I would imagine it was used when there was still the menu to select an explicit charset. Then maybe we could get rid of bool* override_charset altogether.

For this, code like:
https://searchfox.org/comm-central/source/mailnews/mime/src/mimehdrs.cpp#478

    if (opt->override_charset)
       charset = PL_strdup(opt->default_charset);

... would have to be replaced with something similar to:

   charset = MimeHeaders_get_parameter(contentType, HEADER_PARM_CHARSET,
                                            nullptr, nullptr);
   if(!charset)
       charset = opt->default_charset;

Don't know if this will do well.

In general, an override signifies that the system is using a charset which is different to the charset it would normally determine and use. Yes, when there was still a charset menu, the code set the override which was later used for display, replay, forward, etc., here in TB 68 when that last worked:
https://searchfox.org/comm-esr68/rev/f89f3cf690ad0daa84414dee44605c328855eb8f/mail/base/content/mailWindow.js#848
In the line below, the code remembers for which message that charset is to be used so that the override is only used for the message it's intended for.

Even after repairing "Repair text encoding", the concept of the override is still there:
https://searchfox.org/comm-central/rev/1bd54a23a2b8a4af43c21ba8ac5e1602fb3d475f/mail/base/content/mailWindow.js#1022
only that the charset is not selected by the user but instead determined by the system in forceDetectDocumentCharset().

So the override mechanism in our view is essential for charset correction. As was noted in bug 1713786 comment #33, the new override mechanism only corrects the body, when especially for news without charset, we also need to correct the treatment of the subject header, also when retrieved from the database (which is used for the thread pane). And as we saw, this line *override_charset = true; makes all the difference.

For the news case, which is what is being fixed here, setting *override_charset = true; maybe - as you suggested - should be made dependent on whether the headers have a charset parameter. It's still a heuristic approach, ultimately the body could be in a "local" charset and the subject header in raw UTF-8 as per RFC 6532 (see comment #13).

Looking at that suggestion, if doesn't appear that the message headers are available at this point. That brings us back to the suggestion from the second last sentence in comment #27 (forced server override). Should all this be discussed in a separate bug or do you want to reopen this one?

(In reply to Rachel Martin from comment #27)

However, removing that line breaks ddt.demos.su / fido7.ru.fidonet.today again.

No it doesn't. We had forgotten to set the server charset to KOI8-R, so of course it doesn't work then. So please ignore comment #28 to comment #30. This line https://hg.mozilla.org/comm-central/rev/d0af6cc5fe02#l4.38 can just be removed and everything is fine. Surely Magnus will take care of that.

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e0e961dce406
followup - remove case a of incorrect forcing of charset override. rs=me

(In reply to Pulsebot from comment #33)

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e0e961dce406
followup - remove case a of incorrect forcing of charset override. rs=me

Thanks - Yes this fixes Comment 23.

Thanks for testing!

Comment on attachment 9246103 [details]
Bug 1734361 - fix non-utf8 news groups. r=freaktechnik

[Approval Request Comment]
Regression caused by (bug #): bug 1671880
User impact if declined: Can't read such newsgroups (which will be rare, but apparently exist)
Testing completed (on c-c, etc.): Have been on beta
Risk to taking this patch (and alternatives if risky): not very risky

Attachment #9246103 - Flags: approval-comm-esr91?

Let's not forget the additional fix from comment #33.

Comment on attachment 9246103 [details]
Bug 1734361 - fix non-utf8 news groups. r=freaktechnik

[Triage Comment]
Approved for esr91

(In reply to Rachel Martin from comment #38)

Let's not forget the additional fix from comment #33.

Flags: needinfo?(rob)
Attachment #9246103 - Flags: approval-comm-esr91? → approval-comm-esr91+
Flags: needinfo?(rob)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: