Closed Bug 1704749 Opened 3 years ago Closed 2 years ago

Prepare for m-c removal of CharsetMenu.jsm

Categories

(MailNews Core :: Internationalization, task, P1)

Tracking

(thunderbird_esr78 unaffected)

RESOLVED FIXED
91 Branch
Tracking Status
thunderbird_esr78 --- unaffected

People

(Reporter: hsivonen, Assigned: freaktechnik)

References

Details

Attachments

(5 files, 1 obsolete file)

In bug 1687635, I'm planning to replace the Text Encoding submenu with a single menu item Override Text Encoding that performs the action currently performed by the item Automatic in the existing submenu.

If successful, this will result in the removal of CharsetMenu.jsm from m-c. I don't expect this to happen in the 89 cycle, but starting from 90, it could happen.

I also intend to remove the docshell and parser capabilities of dealing with overrides other than the automatic one, but my undestanding is that mailnews doesn't use that part.

I recommend that c-c replace usage of CharsetMenu.jsm with a single menu item that triggers the detector (or makes its own copy of CharsetMenu.jsm assuming that mailnews indeed handles the menu selections in mailnews code and doesn't propagate them to the docshell).

Depends on: 1687635
Priority: -- → P1

Martin, can you take this on.

Assignee: nobody → martin

What's our approach here? Do we want to follow and only offer auto or should I preserve the full charset selection?

Flags: needinfo?(mkmelin+mozilla)

I think only auto should be fine.

Flags: needinfo?(mkmelin+mozilla)
Status: NEW → ASSIGNED

Replaces all text encoding/character set menu items and buttons to only allow forcing the encoding.

Here's a test message with UTF-8 encoded Spanish text. The message is displayed incorrectly since the charset is incorrectly stated as windows-1252. With the patch, "View > Override Text Encoding" gives a console error. This is not a surprise since "Automatic" charset (without the patch) already doesn't work with the same console error. It would be good to fix the automatic/override detection before withdrawing the charset menu since otherwise there wouldn't be any way to correct an incorrect charset.

On the positive side, it all seems to work on the the message source.

Further to comment #5: Once you can correct the charset of the message, you need to test that reply/forward/"edit as new" will use the corrected charset.

(In reply to José M. Muñoz from comment #5)

With the patch, "View > Override Text Encoding"

On the Firefox side, the verb changed from "Override" to "Repair". See the comments on bug 1687635 for reasoning to decide if you want Thunderbird to match.

I deliberately didn't remove CharsetMenu.jsm from m-c, yet, but I intend to do so in a follow-up if bug 1687635 sticks.

Well, the main issue is that automatic/override/repair doesn't work in Thunderbird. This assignment msgWindow.mailCharacterSet = "_autodetect_all"; throws.

(In reply to José M. Muñoz from comment #8)

Well, the main issue is that automatic/override/repair doesn't work in Thunderbird. This assignment msgWindow.mailCharacterSet = "_autodetect_all"; throws.

It's probably better to indicate the request to detect in a way that doesn't use a magic encoding name if at all feasible.

Attached patch D116241.diffSplinter Review

How about this instead? Short-circuits the automatic handling inside the C++ code.

BTW, the patch needs clang-format.

Attachment #9224042 - Flags: feedback?(martin)

Comment on attachment 9224042 [details] [diff] [review]
D116241.diff

That's absolutely an option if the call order is not relevant (and the charset of the docshell is correct), which I was not able to assertain.

Attachment #9224042 - Flags: feedback?(martin) → feedback+

Martin, thanks for looking at the patch. Now that automatic detection works "in principle", I did the test the other way around. Encode the e-mail in windows-1252 and state that it's UTF-8. Sadly in this case the detection doesn't work, it's detected as UTF-8 although all the highbit characters are invalid UTF-8, so they are displayed as replacement characters <?>.

I thought you have pretty good heuristics (see for example bug 1602816 comment #34 and below), so how come some Spanish/Western text is not detected, Henri? Is there a bug somewhere?

Unless there is a bug somewhere, that leads me to believe that removing the charset menu is not a good idea since the detection doesn't allow the user to correct/repair the display.

(Forgot the NI.)
Firefox 78 ESR displays the attachment correctly and the charset menu says "Western". So maybe mDocShell->GetCharset(aResultingCharacterSet); isn't correct in the patch?

Flags: needinfo?(hsivonen)

Here's another example where detection doesn't work. This is in Shift_JIS. Unfortunately this is already broken in TB 78. That version already has Japanese auto-detection (between Shift_JIS, ISO-2022-JP and maybe others), and it looks like msgWindow.mailCharacterSet is set to "Japanese" which is not an encoding name and hence throws. Anyway, that will be fixed now.

There is test_detectAttachmentCharset.js and the sample data there could be turned into test messages since we know that the auto-detection works for that data, so it should also work for mail bodies.

As I understand it, mailnews always feeds UTF-8 to code living in m-c, so it's expected that a docshell reports UTF-8 as its encoding and it's expected that trying to ask a docshell to deal with _autodetect_all won't work.

Instead, the menu item should make mailnews code disregard the charset parameter from the message so that the mailnews-side invocation of chardetng happens. I'm not sure which one of these is the relevant code path that the UI should aim to activate:
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/compose/src/nsMsgCompUtils.cpp#75
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37

Flags: needinfo?(hsivonen)

Right, Shift_JIS was also reported as UTF-8 with the existing patch. What you describe is not so straight forward to implement. So far, the menu sets the charset the user selected, and then displays the message again with that charset, which goes through some MIME processing and ultimately pumps the content into the "code living in m-c", maybe after a conversion to UTF-8 (or UTF-16, see below). We saw that this approach doesn't work for _autodetect_all or Japanese. So instead you're saying that the relevant content that goes into the docshell needs to be run through the detector first, which guesses the charset. Then the conversion needs to happen before pumping the content into the display. Maybe somewhere around here:
https://searchfox.org/comm-central/search?q=converttounicode&path=mime&case=false&regexp=false

Are you sure that UTF-8 is fed into M-C, wouldn't it just be an nsString and hence UTF-16?

(In reply to José M. Muñoz from comment #17)

What you describe is not so straight forward to implement. So far, the menu sets the charset the user selected, and then displays the message again with that charset, which goes through some MIME processing and ultimately pumps the content into the "code living in m-c", maybe after a conversion to UTF-8 (or UTF-16, see below). We saw that this approach doesn't work for _autodetect_all or Japanese.

In the case of Firefox, the encoding override feature is used the most in Japan and whenever there's a reduction in functionality, someone files a bug from a Japanese point of view. If you've shipped with "Japanese" not working for a while and no one has filed a bug, perhaps you could get away with not having this feature at all. After all, emails are generated in a rather different way than Web pages, so if there is a declared encoding, it's probably very rare for it to be wrong for non-spam email.

If you still want to have a "Repair Text Encoding" menu item, I suggest introducing a placeholder value (preferably null or mozilla::Nothing or, failing those, _autodetect_all) that can be set as the forced charset of an email and then propagating that value to the point where the message decoding code currently decides to use mozilla::EncodingDetector for messages that didn't have a charset to begin with.

Are you sure that UTF-8 is fed into M-C, wouldn't it just be an nsString and hence UTF-16?

I'm not 100% sure, but you said

Right, Shift_JIS was also reported as UTF-8 with the existing patch.

which matches my understanding that the HTML parser in m-c sees email content always as UTF-8 with the conversion having been performed on the c-c side.

Thanks for the comments. I have the impression that Japanese users file very few bugs unless they are part of the community. Besides, there is no complete triaging done for Thunderbird bugs. So the absence or non-detection of a bug should be an argument of removing a broken feature instead of fixing it. In my experience there is always some confusion with "Chinese" (and also Eastern European/Cyrillic) charsets and the user needs a way to correct that, and even if it were to make a SPAM mail readable. Changing the charset also affects the subject display, and subjects are sometimes badly encoded, meaning that some non-SPAM messages use raw windows-1252. Repairing a charset is not a common operation, but unless there is Telemetry to prove that the feature isn't used at all, it shouldn't be removed. If anything, the non-working automatic detection should be removed and the product should stick with the manual setting as it was present for a long time up to the current TB 78 (apart from broken Japanese).

Looking at the code for the "view source" window, that had:

      if (charset == "Japanese") {
        charset = "Shift_JIS";
      }
      gBrowser.characterSet = charset;

and now has gBrowser.characterSet = "_autodetect_all". So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).

OK, so the issue was reported in bug 1677104 but there was no follow-up :-(

(In reply to José M. Muñoz from comment #19)

Looking at the code for the "view source" window, that had:

      if (charset == "Japanese") {
        charset = "Shift_JIS";
      }
      gBrowser.characterSet = charset;

Note that the m-c-side code that made this work is going to go away as part of the dead code cleanup after bug 1687635.

and now has gBrowser.characterSet = "_autodetect_all". So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).

It's wishful thinking. If you use _autodetect_all as a placeholder value, you'll need to handle the placeholder value somewhere in MIME code. Moreover, any code in between that carries a mozilla::Encoding* instead of a string will be unable to represent _autodetect_all unless you map it to nullptr.

(m-c handles _autodetect_all in the docshell, but, as noted, that's too late for email and that code will go away as part of the cleanup after bug 1687635.)

If anything, the non-working automatic detection should be removed and the product should stick with the manual setting as it was present for a long time up to the current TB 78 (apart from broken Japanese).

That's, indeed, one option (replacing "Japanese" with the old three items for ISO-2022-JP, Shift_JIS, and EUC-JP).

In view of all the complication here, IMHO it would be best to fork the charset menu into C-C and fix bug 1677104 at the same time. Auto-detection can then be implemented in a follow-up bug without pressure.

(In reply to José M. Muñoz from comment #22)

In view of all the complication here, IMHO it would be best to fork the charset menu into C-C and fix bug 1677104 at the same time. Auto-detection can then be implemented in a follow-up bug without pressure.

Makes sense. The old access keys, etc., for reference:
https://hg.mozilla.org/mozilla-central/file/8b074e2a3b68a339928e5b9f1d52f712e4bfce44/toolkit/modules/CharsetMenu.jsm#l64
https://hg.mozilla.org/mozilla-central/file/8b074e2a3b68a339928e5b9f1d52f712e4bfce44/toolkit/locales/en-US/chrome/global/charsetMenu.properties#l107

I'm not quite convinced we need to go that far. I would be inclined to think that with one single report (maybe one or two hiding) the usefulness of this is extremely limited. I don't think there's telemetry to be had: the feature per definition is real edge case, and how many people will know about it + be knowledgeable to use it + actually bother to use it + have a real need to add it. You'll get super small numbers even if you figured out how to track that.

I think we could go with the patches we have which fixes it for the cases it's not explicitly wrongly declared, AIUI.

I think we could go with the patches we have ...

Please consider that as discussed in the preceding comments, for message display particularly the second patch is wrong (quote: "trying to ask a docshell to deal with _autodetect_all won't work") and fixes nothing, neither a wrong nor a missing charset. Did you try it during the review? It alo doesn't fix bug 1677104. Exercising the menu will force the display to UTF-8 which is already the default for a missing charset. The only case that will be accidentally repaired is a UTF-8 message with wrong charset. So basically you're presenting the user with a mostly useless menu item labelled "Repair".

BTW, isn't telemetry there to replace "I would be inclined to think" with hard facts? In general, it would be good to know what charsets are occurring in email and how often the user fixes something. There is also support for UTF-7 which might be dropped as some stage. Doesn't Firefox have telemetry on charsets of web pages? Why is this any different for email?

I think we could go with the patches we have which fixes it for the cases it's not explicitly wrongly declared, AIUI.

If it's not explicitly wrongly declared you don't need the feature in the first place. This is only for emails where the declared charset does not match the content charset. Using the "force" leads to some part in the mail loading pipeline to use sniffing instead of the encoding header to decide the source charset.

So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).

The issue isn't mail displaying, the issue is the composing of messages that contain excerpts or the full content of a wrongly declared message.

Regarding other discussion, yes, the display of mails is always in UTF-8. There is some component somewhere that leads to the charset conversion, however that component evidently understands _autodetect_all, so it must be using the m-c code, or at least depend on parts of it.

I would propose to change the primary patch to not throw for the auto detection while leaving the reply/forward encoding broken and filing a follow up for that (so discarding the second patch).

Any issues with the auto detection itself not finding the correct encoding should also be handled separately, and are actually an upstream issue.

(In reply to José M. Muñoz from comment #25)
I don't think Telemetry could help. Like I said, the numbers, whatever they are, will be super small. Since it's a menu, a small amount of users will click on it mistakenly or just to experiment. If say 0.001% of our users ever used it, how many did it make any good for? How many were mis-clicks. It's not possible to tell. Would it be a justified priority even if 0.1% ever used it (which would be shockingly high)?

(In reply to Martin Giger [:freaktechnik] from comment #26)

Any issues with the auto detection itself not finding the correct encoding should also be handled separately, and are actually an upstream issue.

Apparently there are no issues with auto-detection as long as it's called properly at the right point in the display pipeline. test_detectAttachmentCharset.js shows that Japanese, Russian charsets and windows-1252 are detected correctly.

The issue isn't mail displaying, ...

I don't see how it isn't a display problem. If the message is displayed correctly via repair/detect/manual/whatever, then any subsequent operation will use the override and hence work.

(In reply to José M. Muñoz from comment #25)

There is also support for UTF-7 which might be dropped as some stage.

Perhaps I'm too jaded, but my expectation is that you won't get to drop that one. Especially not if you want to keep Thunderbird capable of reading the user's old already-received emails. :-(

Why is this any different for email?

Web page authoring is often detached from server configuration, so the content and the HTTP headers come from very different places. In contrast, email content and MIME headers are typically generated by the same program.

(In reply to Martin Giger [:freaktechnik] from comment #26)

If it's not explicitly wrongly declared you don't need the feature in the first place. This is only for emails where the declared charset does not match the content charset.

Indeed.

The issue isn't mail displaying, the issue is the composing of messages that contain excerpts or the full content of a wrongly declared message.

If the email displayed correctly, shouldn't Thunderbird always take care of it getting quoted correctly?

Regarding other discussion, yes, the display of mails is always in UTF-8. There is some component somewhere that leads to the charset conversion, however that component evidently understands _autodetect_all, so it must be using the m-c code, or at least depend on parts of it.

The only place that understands _autodetect_all is the nsDocShell::SetCharset. It's too late for email and is going away in bug 1713627.

The options here are:

  1. Not having the feature at all.
  2. Forking CharsetMenu.jsm, restoring the manual items for ISO-2022-JP, Shift_JIS, and EUC-JP, and removing the detector-related items Automatic and Japanese.
  3. Doing the work, on c-c side, to propagate a "force autodetection" flag through the MIME code to the point where the MIME code currently decides whether to use the detector (and currently uses it only in the absence of a declared charset).

(In reply to Magnus Melin [:mkmelin] from comment #27)

I don't think Telemetry could help.

FWIW, in the case of Firefox, my inclination to keep some form of encoding override is mainly based on Japanese-context bug reports indicating that people still need it combined with the telemetry that the level of usage in Japan is higher than elsewhere and on a different magnitude compared to my frame of experience. I don't have a good way to judge the specific telemetry number, though.

If you were to add telemetry, I suggest two things for analyzing it:

  1. Instead of looking at the use count, look at one in how many distinct telemetry submitters used the feature at least once over some period of time (e.g. a month). For a low-usage trouble workaround feature this kind of analysis gives a better idea of how relevant it is for users than a use count.
  2. It's hard to have intuition of where to draw the line. It would be useful to benchmark the result against the usage of some other peripheral feature that you believe is worth keeping. (On the Firefox side, this analysis is missing, and instead, as noted, I'm going by the bug reports being indicative.)

For Firefox, the most recent numbers that have been cleared for publication are in bug 1702914 comment 14.

It's too late for email and is going away in bug 1713627.

If I understand if correctly, _autodetect_all will go away(?). Then this code in the first patch in viewSource.js

  onForceCharacterSet() {
    gBrowser.characterSet = "_autodetect_all";

will also stop working, right?

So in summary, to deliver anything working to the user, for "view source" and message display/reply/forward, etc., forking CharsetMenu.jsm appears to be the only semi-painless option.

(In reply to José M. Muñoz from comment #30)

It's too late for email and is going away in bug 1713627.

If I understand if correctly, _autodetect_all will go away(?).

Yes.

Then this code in the first patch in viewSource.js

  onForceCharacterSet() {
    gBrowser.characterSet = "_autodetect_all";

will also stop working, right?

Yes, but once _autodetect_all goes away there will be a new method on the docshell that View Source can call in this case.

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/a84d70e678b4
Only allow forcing auto charset detection. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Attachment #9224002 - Attachment is obsolete: true
Blocks: 1713786
Target Milestone: --- → 91 Branch

Excuse the question, but what was the point of committing something completely broken and dysfunctional?

I believe that a text encoding menu is necessary, but I do not think that telemetry is useful.
This menu was not working over half a year. People do not click on the menu that is known to not work.

Regressions: 1716059

(In reply to Magnus Melin [:mkmelin] from comment #3)

I think only auto should be fine.

As we saw, this is a bad choice since "auto" only works when the message has CTE 8bit. For CTE QP it can't auto-detect since it's all ASCII, see bug 1713786 comment #13 and bug 1716059 for details.

Apologies, ignore that comment. By the time the message is displayed, MIME will already have decoded CTE QP and presented 8bit to the display engine.

Sorry, I need to correct this again. Auto-detection doesn't work at all, see bug 1713786, and it's questionable how to fix that. In any case, QP needs to be decoded so the charset detector has a chance to pick a charset for the 8bit data. Architecturally it was a lot easier to let the user pick a charset and then just apply that, for better of for worse.

Even if auto-detection "worked", it would not obviate the need to see what character set is used, and even to use a different one, e.g. when a message has content in multiple character sets.

Also, "Repair" is misleading IMO, because (and correct me if I'm wrong) using that button doesn't actually change the message's headers to state a different charset, right? And again, in the multi-charset case, or in the case of messages with invalid octet sequences, there is never a full "repair". It sounds like the FF people decided to "dumb down" the UI some more.

See also my comment in bug 1687635, which I won't repeat

Ok, so, apparently I do need to repeat that comment, because that page is basically not about TB. So,

Thunderbird should have retained the charset/text encoding selection submenu; and in the current situation, should reintroduce it:

  • The removal prevents you from knowing which character set was used to display the message (although TBH this was already too vague with the existing UI)
  • The removal prevents you from setting a character set you believe you need (especially relevant for messages with content in multiple character sets, where there is no single valid setting). And yes, this is necessary even if "Repair" actually does something; you might not agree with its choices.
You need to log in before you can comment on or make changes to this bug.