Prepare for m-c removal of CharsetMenu.jsm
Categories
(MailNews Core :: Internationalization, task, P1)
Tracking
(thunderbird_esr78 unaffected)
Tracking | Status | |
---|---|---|
thunderbird_esr78 | --- | unaffected |
People
(Reporter: hsivonen, Assigned: freaktechnik)
References
Details
Attachments
(5 files, 1 obsolete file)
In bug 1687635, I'm planning to replace the Text Encoding submenu with a single menu item Override Text Encoding that performs the action currently performed by the item Automatic in the existing submenu.
If successful, this will result in the removal of CharsetMenu.jsm
from m-c. I don't expect this to happen in the 89 cycle, but starting from 90, it could happen.
I also intend to remove the docshell and parser capabilities of dealing with overrides other than the automatic one, but my undestanding is that mailnews doesn't use that part.
I recommend that c-c replace usage of CharsetMenu.jsm
with a single menu item that triggers the detector (or makes its own copy of CharsetMenu.jsm
assuming that mailnews indeed handles the menu selections in mailnews code and doesn't propagate them to the docshell).
Assignee | ||
Comment 2•2 years ago
|
||
What's our approach here? Do we want to follow and only offer auto or should I preserve the full charset selection?
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 4•2 years ago
|
||
Replaces all text encoding/character set menu items and buttons to only allow forcing the encoding.
Comment 5•2 years ago
|
||
Here's a test message with UTF-8 encoded Spanish text. The message is displayed incorrectly since the charset is incorrectly stated as windows-1252. With the patch, "View > Override Text Encoding" gives a console error. This is not a surprise since "Automatic" charset (without the patch) already doesn't work with the same console error. It would be good to fix the automatic/override detection before withdrawing the charset menu since otherwise there wouldn't be any way to correct an incorrect charset.
On the positive side, it all seems to work on the the message source.
Comment 6•2 years ago
|
||
Further to comment #5: Once you can correct the charset of the message, you need to test that reply/forward/"edit as new" will use the corrected charset.
Reporter | ||
Comment 7•2 years ago
|
||
(In reply to José M. Muñoz from comment #5)
With the patch, "View > Override Text Encoding"
On the Firefox side, the verb changed from "Override" to "Repair". See the comments on bug 1687635 for reasoning to decide if you want Thunderbird to match.
I deliberately didn't remove CharsetMenu.jsm
from m-c, yet, but I intend to do so in a follow-up if bug 1687635 sticks.
Comment 8•2 years ago
|
||
Well, the main issue is that automatic/override/repair doesn't work in Thunderbird. This assignment msgWindow.mailCharacterSet = "_autodetect_all";
throws.
Reporter | ||
Comment 9•2 years ago
|
||
(In reply to José M. Muñoz from comment #8)
Well, the main issue is that automatic/override/repair doesn't work in Thunderbird. This assignment
msgWindow.mailCharacterSet = "_autodetect_all";
throws.
It's probably better to indicate the request to detect in a way that doesn't use a magic encoding name if at all feasible.
Assignee | ||
Comment 10•2 years ago
|
||
Depends on D115908
Comment 11•2 years ago
|
||
How about this instead? Short-circuits the automatic handling inside the C++ code.
BTW, the patch needs clang-format.
Assignee | ||
Comment 12•2 years ago
|
||
Comment on attachment 9224042 [details] [diff] [review]
D116241.diff
That's absolutely an option if the call order is not relevant (and the charset of the docshell is correct), which I was not able to assertain.
Comment 13•2 years ago
|
||
Martin, thanks for looking at the patch. Now that automatic detection works "in principle", I did the test the other way around. Encode the e-mail in windows-1252 and state that it's UTF-8. Sadly in this case the detection doesn't work, it's detected as UTF-8 although all the highbit characters are invalid UTF-8, so they are displayed as replacement characters <?>.
I thought you have pretty good heuristics (see for example bug 1602816 comment #34 and below), so how come some Spanish/Western text is not detected, Henri? Is there a bug somewhere?
Unless there is a bug somewhere, that leads me to believe that removing the charset menu is not a good idea since the detection doesn't allow the user to correct/repair the display.
Comment 14•2 years ago
|
||
(Forgot the NI.)
Firefox 78 ESR displays the attachment correctly and the charset menu says "Western". So maybe mDocShell->GetCharset(aResultingCharacterSet);
isn't correct in the patch?
Comment 15•2 years ago
|
||
Here's another example where detection doesn't work. This is in Shift_JIS. Unfortunately this is already broken in TB 78. That version already has Japanese auto-detection (between Shift_JIS, ISO-2022-JP and maybe others), and it looks like msgWindow.mailCharacterSet
is set to "Japanese" which is not an encoding name and hence throws. Anyway, that will be fixed now.
There is test_detectAttachmentCharset.js and the sample data there could be turned into test messages since we know that the auto-detection works for that data, so it should also work for mail bodies.
Reporter | ||
Comment 16•2 years ago
|
||
As I understand it, mailnews always feeds UTF-8 to code living in m-c, so it's expected that a docshell reports UTF-8 as its encoding and it's expected that trying to ask a docshell to deal with _autodetect_all
won't work.
Instead, the menu item should make mailnews code disregard the charset
parameter from the message so that the mailnews-side invocation of chardetng happens. I'm not sure which one of these is the relevant code path that the UI should aim to activate:
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/compose/src/nsMsgCompUtils.cpp#75
https://searchfox.org/comm-central/rev/4e5a1cfe83ae0b7eefdbe553f5cb56304b0e7d25/mailnews/mime/src/comi18n.cpp#37
Comment 17•2 years ago
|
||
Right, Shift_JIS was also reported as UTF-8 with the existing patch. What you describe is not so straight forward to implement. So far, the menu sets the charset the user selected, and then displays the message again with that charset, which goes through some MIME processing and ultimately pumps the content into the "code living in m-c", maybe after a conversion to UTF-8 (or UTF-16, see below). We saw that this approach doesn't work for _autodetect_all
or Japanese
. So instead you're saying that the relevant content that goes into the docshell needs to be run through the detector first, which guesses the charset. Then the conversion needs to happen before pumping the content into the display. Maybe somewhere around here:
https://searchfox.org/comm-central/search?q=converttounicode&path=mime&case=false®exp=false
Are you sure that UTF-8 is fed into M-C, wouldn't it just be an nsString
and hence UTF-16?
Reporter | ||
Comment 18•2 years ago
|
||
(In reply to José M. Muñoz from comment #17)
What you describe is not so straight forward to implement. So far, the menu sets the charset the user selected, and then displays the message again with that charset, which goes through some MIME processing and ultimately pumps the content into the "code living in m-c", maybe after a conversion to UTF-8 (or UTF-16, see below). We saw that this approach doesn't work for
_autodetect_all
orJapanese
.
In the case of Firefox, the encoding override feature is used the most in Japan and whenever there's a reduction in functionality, someone files a bug from a Japanese point of view. If you've shipped with "Japanese" not working for a while and no one has filed a bug, perhaps you could get away with not having this feature at all. After all, emails are generated in a rather different way than Web pages, so if there is a declared encoding, it's probably very rare for it to be wrong for non-spam email.
If you still want to have a "Repair Text Encoding" menu item, I suggest introducing a placeholder value (preferably null
or mozilla::Nothing
or, failing those, _autodetect_all
) that can be set as the forced charset of an email and then propagating that value to the point where the message decoding code currently decides to use mozilla::EncodingDetector
for messages that didn't have a charset to begin with.
Are you sure that UTF-8 is fed into M-C, wouldn't it just be an
nsString
and hence UTF-16?
I'm not 100% sure, but you said
Right, Shift_JIS was also reported as UTF-8 with the existing patch.
which matches my understanding that the HTML parser in m-c sees email content always as UTF-8 with the conversion having been performed on the c-c side.
Comment 19•2 years ago
|
||
Thanks for the comments. I have the impression that Japanese users file very few bugs unless they are part of the community. Besides, there is no complete triaging done for Thunderbird bugs. So the absence or non-detection of a bug should be an argument of removing a broken feature instead of fixing it. In my experience there is always some confusion with "Chinese" (and also Eastern European/Cyrillic) charsets and the user needs a way to correct that, and even if it were to make a SPAM mail readable. Changing the charset also affects the subject display, and subjects are sometimes badly encoded, meaning that some non-SPAM messages use raw windows-1252. Repairing a charset is not a common operation, but unless there is Telemetry to prove that the feature isn't used at all, it shouldn't be removed. If anything, the non-working automatic detection should be removed and the product should stick with the manual setting as it was present for a long time up to the current TB 78 (apart from broken Japanese).
Looking at the code for the "view source" window, that had:
if (charset == "Japanese") {
charset = "Shift_JIS";
}
gBrowser.characterSet = charset;
and now has gBrowser.characterSet = "_autodetect_all"
. So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).
Comment 20•2 years ago
|
||
OK, so the issue was reported in bug 1677104 but there was no follow-up :-(
Reporter | ||
Comment 21•2 years ago
|
||
(In reply to José M. Muñoz from comment #19)
Looking at the code for the "view source" window, that had:
if (charset == "Japanese") { charset = "Shift_JIS"; } gBrowser.characterSet = charset;
Note that the m-c-side code that made this work is going to go away as part of the dead code cleanup after bug 1687635.
and now has
gBrowser.characterSet = "_autodetect_all"
. So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).
It's wishful thinking. If you use _autodetect_all
as a placeholder value, you'll need to handle the placeholder value somewhere in MIME code. Moreover, any code in between that carries a mozilla::Encoding*
instead of a string will be unable to represent _autodetect_all
unless you map it to nullptr
.
(m-c handles _autodetect_all
in the docshell, but, as noted, that's too late for email and that code will go away as part of the cleanup after bug 1687635.)
If anything, the non-working automatic detection should be removed and the product should stick with the manual setting as it was present for a long time up to the current TB 78 (apart from broken Japanese).
That's, indeed, one option (replacing "Japanese" with the old three items for ISO-2022-JP, Shift_JIS, and EUC-JP).
Comment 22•2 years ago
|
||
In view of all the complication here, IMHO it would be best to fork the charset menu into C-C and fix bug 1677104 at the same time. Auto-detection can then be implemented in a follow-up bug without pressure.
Reporter | ||
Comment 23•2 years ago
|
||
(In reply to José M. Muñoz from comment #22)
In view of all the complication here, IMHO it would be best to fork the charset menu into C-C and fix bug 1677104 at the same time. Auto-detection can then be implemented in a follow-up bug without pressure.
Makes sense. The old access keys, etc., for reference:
https://hg.mozilla.org/mozilla-central/file/8b074e2a3b68a339928e5b9f1d52f712e4bfce44/toolkit/modules/CharsetMenu.jsm#l64
https://hg.mozilla.org/mozilla-central/file/8b074e2a3b68a339928e5b9f1d52f712e4bfce44/toolkit/locales/en-US/chrome/global/charsetMenu.properties#l107
Comment 24•2 years ago
|
||
I'm not quite convinced we need to go that far. I would be inclined to think that with one single report (maybe one or two hiding) the usefulness of this is extremely limited. I don't think there's telemetry to be had: the feature per definition is real edge case, and how many people will know about it + be knowledgeable to use it + actually bother to use it + have a real need to add it. You'll get super small numbers even if you figured out how to track that.
I think we could go with the patches we have which fixes it for the cases it's not explicitly wrongly declared, AIUI.
Comment 25•2 years ago
|
||
I think we could go with the patches we have ...
Please consider that as discussed in the preceding comments, for message display particularly the second patch is wrong (quote: "trying to ask a docshell to deal with _autodetect_all won't work") and fixes nothing, neither a wrong nor a missing charset. Did you try it during the review? It alo doesn't fix bug 1677104. Exercising the menu will force the display to UTF-8 which is already the default for a missing charset. The only case that will be accidentally repaired is a UTF-8 message with wrong charset. So basically you're presenting the user with a mostly useless menu item labelled "Repair".
BTW, isn't telemetry there to replace "I would be inclined to think" with hard facts? In general, it would be good to know what charsets are occurring in email and how often the user fixes something. There is also support for UTF-7 which might be dropped as some stage. Doesn't Firefox have telemetry on charsets of web pages? Why is this any different for email?
Assignee | ||
Comment 26•2 years ago
|
||
I think we could go with the patches we have which fixes it for the cases it's not explicitly wrongly declared, AIUI.
If it's not explicitly wrongly declared you don't need the feature in the first place. This is only for emails where the declared charset does not match the content charset. Using the "force" leads to some part in the mail loading pipeline to use sniffing instead of the encoding header to decide the source charset.
So maybe a similar approach can be used for mail display without having to dig through the MIME code (or maybe that's wishful thinking).
The issue isn't mail displaying, the issue is the composing of messages that contain excerpts or the full content of a wrongly declared message.
Regarding other discussion, yes, the display of mails is always in UTF-8. There is some component somewhere that leads to the charset conversion, however that component evidently understands _autodetect_all
, so it must be using the m-c code, or at least depend on parts of it.
I would propose to change the primary patch to not throw for the auto detection while leaving the reply/forward encoding broken and filing a follow up for that (so discarding the second patch).
Any issues with the auto detection itself not finding the correct encoding should also be handled separately, and are actually an upstream issue.
Comment 27•2 years ago
|
||
(In reply to José M. Muñoz from comment #25)
I don't think Telemetry could help. Like I said, the numbers, whatever they are, will be super small. Since it's a menu, a small amount of users will click on it mistakenly or just to experiment. If say 0.001% of our users ever used it, how many did it make any good for? How many were mis-clicks. It's not possible to tell. Would it be a justified priority even if 0.1% ever used it (which would be shockingly high)?
Comment 28•2 years ago
|
||
(In reply to Martin Giger [:freaktechnik] from comment #26)
Any issues with the auto detection itself not finding the correct encoding should also be handled separately, and are actually an upstream issue.
Apparently there are no issues with auto-detection as long as it's called properly at the right point in the display pipeline. test_detectAttachmentCharset.js shows that Japanese, Russian charsets and windows-1252 are detected correctly.
The issue isn't mail displaying, ...
I don't see how it isn't a display problem. If the message is displayed correctly via repair/detect/manual/whatever, then any subsequent operation will use the override and hence work.
Reporter | ||
Comment 29•2 years ago
|
||
(In reply to José M. Muñoz from comment #25)
There is also support for UTF-7 which might be dropped as some stage.
Perhaps I'm too jaded, but my expectation is that you won't get to drop that one. Especially not if you want to keep Thunderbird capable of reading the user's old already-received emails. :-(
Why is this any different for email?
Web page authoring is often detached from server configuration, so the content and the HTTP headers come from very different places. In contrast, email content and MIME headers are typically generated by the same program.
(In reply to Martin Giger [:freaktechnik] from comment #26)
If it's not explicitly wrongly declared you don't need the feature in the first place. This is only for emails where the declared charset does not match the content charset.
Indeed.
The issue isn't mail displaying, the issue is the composing of messages that contain excerpts or the full content of a wrongly declared message.
If the email displayed correctly, shouldn't Thunderbird always take care of it getting quoted correctly?
Regarding other discussion, yes, the display of mails is always in UTF-8. There is some component somewhere that leads to the charset conversion, however that component evidently understands
_autodetect_all
, so it must be using the m-c code, or at least depend on parts of it.
The only place that understands _autodetect_all
is the nsDocShell::SetCharset
. It's too late for email and is going away in bug 1713627.
The options here are:
- Not having the feature at all.
- Forking
CharsetMenu.jsm
, restoring the manual items for ISO-2022-JP, Shift_JIS, and EUC-JP, and removing the detector-related items Automatic and Japanese. - Doing the work, on c-c side, to propagate a "force autodetection" flag through the MIME code to the point where the MIME code currently decides whether to use the detector (and currently uses it only in the absence of a declared
charset
).
(In reply to Magnus Melin [:mkmelin] from comment #27)
I don't think Telemetry could help.
FWIW, in the case of Firefox, my inclination to keep some form of encoding override is mainly based on Japanese-context bug reports indicating that people still need it combined with the telemetry that the level of usage in Japan is higher than elsewhere and on a different magnitude compared to my frame of experience. I don't have a good way to judge the specific telemetry number, though.
If you were to add telemetry, I suggest two things for analyzing it:
- Instead of looking at the use count, look at one in how many distinct telemetry submitters used the feature at least once over some period of time (e.g. a month). For a low-usage trouble workaround feature this kind of analysis gives a better idea of how relevant it is for users than a use count.
- It's hard to have intuition of where to draw the line. It would be useful to benchmark the result against the usage of some other peripheral feature that you believe is worth keeping. (On the Firefox side, this analysis is missing, and instead, as noted, I'm going by the bug reports being indicative.)
For Firefox, the most recent numbers that have been cleared for publication are in bug 1702914 comment 14.
Comment 30•2 years ago
|
||
It's too late for email and is going away in bug 1713627.
If I understand if correctly, _autodetect_all
will go away(?). Then this code in the first patch in viewSource.js
onForceCharacterSet() {
gBrowser.characterSet = "_autodetect_all";
will also stop working, right?
So in summary, to deliver anything working to the user, for "view source" and message display/reply/forward, etc., forking CharsetMenu.jsm appears to be the only semi-painless option.
Reporter | ||
Comment 31•2 years ago
|
||
(In reply to José M. Muñoz from comment #30)
It's too late for email and is going away in bug 1713627.
If I understand if correctly,
_autodetect_all
will go away(?).
Yes.
Then this code in the first patch in viewSource.js
onForceCharacterSet() { gBrowser.characterSet = "_autodetect_all";
will also stop working, right?
Yes, but once _autodetect_all
goes away there will be a new method on the docshell that View Source can call in this case.
Comment 32•2 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/a84d70e678b4
Only allow forcing auto charset detection. r=mkmelin
Updated•2 years ago
|
Updated•2 years ago
|
Comment 33•2 years ago
|
||
Excuse the question, but what was the point of committing something completely broken and dysfunctional?
Comment 34•2 years ago
|
||
I believe that a text encoding menu is necessary, but I do not think that telemetry is useful.
This menu was not working over half a year. People do not click on the menu that is known to not work.
Comment 35•2 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #3)
I think only auto should be fine.
As we saw, this is a bad choice since "auto" only works when the message has CTE 8bit. For CTE QP it can't auto-detect since it's all ASCII, see bug 1713786 comment #13 and bug 1716059 for details.
Comment 36•2 years ago
|
||
Apologies, ignore that comment. By the time the message is displayed, MIME will already have decoded CTE QP and presented 8bit to the display engine.
Comment 37•2 years ago
|
||
Sorry, I need to correct this again. Auto-detection doesn't work at all, see bug 1713786, and it's questionable how to fix that. In any case, QP needs to be decoded so the charset detector has a chance to pick a charset for the 8bit data. Architecturally it was a lot easier to let the user pick a charset and then just apply that, for better of for worse.
Comment 38•2 years ago
|
||
Even if auto-detection "worked", it would not obviate the need to see what character set is used, and even to use a different one, e.g. when a message has content in multiple character sets.
Also, "Repair" is misleading IMO, because (and correct me if I'm wrong) using that button doesn't actually change the message's headers to state a different charset, right? And again, in the multi-charset case, or in the case of messages with invalid octet sequences, there is never a full "repair". It sounds like the FF people decided to "dumb down" the UI some more.
See also my comment in bug 1687635, which I won't repeat
Comment 39•2 years ago
|
||
Ok, so, apparently I do need to repeat that comment, because that page is basically not about TB. So,
Thunderbird should have retained the charset/text encoding selection submenu; and in the current situation, should reintroduce it:
- The removal prevents you from knowing which character set was used to display the message (although TBH this was already too vague with the existing UI)
- The removal prevents you from setting a character set you believe you need (especially relevant for messages with content in multiple character sets, where there is no single valid setting). And yes, this is necessary even if "Repair" actually does something; you might not agree with its choices.
Description
•