Japanese auto-detect no longer work on some generic (i.e. language neutral) TLDs

RESOLVED FIXED in Firefox 69

Status

()

defect
P1
normal
RESOLVED FIXED
4 months ago
9 days ago

People

(Reporter: emk, Assigned: hsivonen)

Tracking

(Regression, {regression})

unspecified
mozilla69
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox67 wontfix, firefox68 wontfix, firefox69 fixed)

Details

Attachments

(6 attachments)

Steps to reproduce:

  1. Open https://emk.name/test/enter.html. (Note that this bug depends on TLDs, so I can't attach the testcase on Bugzilla.)
  2. Select hamburger menu > [More] > [Text Encoding] > [Auto-Detect] > [Japanese].

Actual result:
Mojibake.

Expected result:
Shift_JIS should be detected.

I got mojibake on other TLD, .fam.cx which is a Dynamic DNS service.

Regression range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=140c4f3b6b137026f6305102d1e34725225352a9&tochange=a2df400cb88c79664cd8ba5fcf6b640373a85524

Flags: needinfo?(hsivonen)

Considering that this effect is intentional and expected, is this known to cause more problems than having predominantly non-Japanese TLDs be assumed to be Shift_JIS?

Flags: needinfo?(hsivonen) → needinfo?(VYV03354)

If a Western page is mis-detected as Shift_JIS, only a few accent characters and quotation marks will be garbled. But if a Japanese page is mis-detected as windows-1252, the page will be completely unreadable. So the latter is much worse than the former for us (Japanese people).

Flags: needinfo?(VYV03354)

(In reply to Masatoshi Kimura [:emk] from comment #2)

If a Western page is mis-detected as Shift_JIS, only a few accent characters and quotation marks will be garbled. But if a Japanese page is mis-detected as windows-1252, the page will be completely unreadable. So the latter is much worse than the former for us (Japanese people).

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese), and for UI localizations affiliated with those encodings, .cx has not yielded the UI localization-affiliated encoding since Firefox 30. Yet, we haven't gotten a single bug report.

As a matter of where I think we should be aiming at, I think we should be aiming at eliminating the dependency on the UI localization. .com/.org/.net currently depend on the UI localization, because those domains are assumed to have been in global use in the 1990s when ways of actually declaring the encoding weren't well-known, and so far there hasn't been a viable idea of how to eliminate the UI localization dependency for .com/.org/.net. New generic TLDs do not fall back on the UI localization-affiliated encoding, because there's no 1990s legacy for them.

.cx is a case of a ccTLD that was used generically relatively early. In that sense, it has potential for .com/.net/.org-style long-tail legacy. However, I'd like us not to take steps backward (from the goal of eliminating the dependency on the UI localization) from one data point for Japanese when the data for Chinese, Korean, Hebrew, Arabic, and Greek since Firefox 30 suggests that there isn't a legacy pattern warranting action. (For example, I think it's bad that we decided bug 620106 from one data point.)

(I'm talking about UI localization dependency instead of detector setting, because I'd like us to a point where the Cyrillic detectors are removed and the Japanese detector runs unconditionally when the base guess [from .jp TLD or from ja UI locale] is Shift_JIS without UI for the Japanese detector.)

Why UI localization is relevant at all? See STR in comment #0. I reproduced it on en-US localized Firefox.

(In reply to Henri Sivonen (:hsivonen) from comment #3)

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese),

Those locales have only one dominant legacy encoding. If they see garbled text, they can just choose only one menu entry from the Text Encoding menu. On the other hand, we (Japanese users) have to gamble from the three encodings because Auto-Detect no longer works. It is really boring.

(In reply to Masatoshi Kimura [:emk] from comment #4)

Why UI localization is relevant at all?

It affects the base guess for old generic TLDs (.com/.net/.org) and the Japanese detector only runs if the base guess is Shift_JIS.

I reproduced it on en-US localized Firefox.

Do you have the "Text Encoding for Legacy Content" pref set to "Japanese"? That is, are you asking for .cx to be able to gain Shift_JIS as the base guess or are you asking for the detector to run even if the base guess isn't Shift_JIS?

(In reply to Henri Sivonen (:hsivonen) from comment #3)

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese),

Those locales have only one dominant legacy encoding. If they see garbled text, they can just choose only one menu entry from the Text Encoding menu. On the other hand, we (Japanese users) have to gamble from the three encodings

Two, considering how rare ISO-2022-JP is on the Web and how easy it is to tell if mojibake is due to ISO-2022-JP decoded as something else. Central European users also have two legacy encodings and no bug reports about .cx. Greek has two, too, though the difference in the Greek case may be mild enough that the user won't try again if the first guess is the wrong one of the two.

because Auto-Detect no longer works. It is really boring.

To address this concern, it seems it would be better to have a menu item "Japanese (Detect)" as in IE instead of or in addition to "Japanese (Shift_JIS)" and "Japanese (EUC-JP)" than to make the Japanese autodetector run when the base guess isn't Shift_JIS. (The "instead of" option would be easier to implement than the "in addition to" option and probably more comprehensible to more users.)

Specifically, I'd prefer the following over reverting bug 977540:

  1. Make the Japanese detector run if at the start of the parse the encoding candidate is Shift_JIS and it doesn't come from an explicit label.
  2. Remove UI for the Japanese detector.
  3. Replace "Japanese (Shift_JIS)", "Japanese (EUC-JP)", and "Japanese (ISO-2022-JP)" in the menu with a single item "Japanese" that sets the encoding candidate to "Shift_JIS" (and per item 1 causes the detector to run).
  4. Replace the current detector with https://github.com/hsivonen/shift_or_euc to make it extremely improbable that the detector guesses wrong between Shift_JIS, EUC-JP, and ISO-2022-JP so as to remove the need for having UI to override the detector result.
Priority: -- → P1

emk, to confirm, does this look like a resolution you'd r+?

  1. Replace the current Japanese detector that tries to gain confidence that the content looks EUC-JPish enough with https://github.com/hsivonen/shift_or_euc which instead looks for "not Shift_JIS" signs and, as a result, decides extremely soon (typically from the first non-ASCII character) and extremely accurately. It decides when it has seen one full-width kana or one Level 1 kanji (or one of the decidable subset of Level 2 kanji). The two failure modes are: It misdecides if there is half-width katakana character before a full-width kana or common kanji (half-width katakana after those is fine). It is undecided if the input consists entirely of an undecidable subset of JIS X 0208 Level 2 kanji. The latter case won't realistically happen when considering a full document. (When considering just the first 1024, it is plausible to see only an undecidable document title.) The former half-width katakana case seems plausible with old JIS X 0201-only text files (which would work as Shift_JIS but that the detector decides as EUC-JP), but doesn't seem a real problem for Web pages. (If you looked at https://github.com/hsivonen/shift_or_euc 5 days ago, it was bogus then. It actually works now. Despite the name, it also detects ISO-2022-JP.)
  2. Make the Japanese detector run if the content is unlabeled and the fallback absent of detection would be Shift_JIS.
  3. Remove the UI for the Japanese detector.
  4. Replace "Japanese (Shift_JIS)", "Japanese (EUC-JP)", and "Japanese (ISO-2022-JP)" in the menu with a single item "Japanese".
  5. When the item "Japanese" is chosen, run the detector (and fall back to Shift_JIS in case of ASCII or the implausible case of the document having only undecidable kanji).
  6. If the encoding of the top-level document came from a label, don't show a checkmark next to the item "Japanese" to allow the user to trigger detection for mislabeled content. (This is consistent with how ISO-8859-15 doesn't show a checkmark and allows the user to choose "Western".)
  7. If the encoding came from the detector already, show a checkmark next to the item "Japanese".

I believe this would bring the user experience for Japanese legacy encodings to the same level of ease of use as for non-Latin writing systems that have a single legacy encoding, such as Korean and Thai. Furthermore, coupling the detection with Japanese base guess instead of configuration (whose default is locale-dependent; whether at present the Japanese detector is on or off) would make unlabeled .jp pages work better in non-Japanese Firefox localizations by default.

(Note: I can think of UI that would handle manual override of the detector misdetecting files whose first Japanese character is half-width katakana, but I think the result would be more confusing to most users in the common case. That solution would be never to put a checkmark next to the item "Japanese" and make the item flip to the other one Shift_JIS and EUC-JP than the current encoding if the current one already came from the detector.)

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Flags: needinfo?(VYV03354)

(In reply to Henri Sivonen (:hsivonen) from comment #7)

emk, to confirm, does this look like a resolution you'd r+?

It looks reasonable to me, yes.

Flags: needinfo?(VYV03354)

(The patch set is not yet complete. The patches here don't make the UI side make sense yet.)

Try run to confirm that the changes to make Coverity happy don't make any compiler unhappy:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=767709687873bd09c432a441c8b1f812a179830f

See Also: → 1510569

Jörg, can you confirm that comm-central isn't using the strings from charsetMenu.properties outside CharsetMenu.jsm and other places in comm-central use charsetTitles.properties? (See string removals from charsetMenu.properties in Part 4 here.)

Flags: needinfo?(jorgk)

Looks like we don't use charsetMenu.properties:
https://searchfox.org/comm-central/search?q=charsetMenu.properties&case=false&regexp=false&path=
If we don't references it, we can't use strings from it, right?

As you said, we use charsetTitles.properties:
https://searchfox.org/comm-central/search?q=charsetTitles.properties&case=false&regexp=false&path=
of which we have our own copy:
mail/locales/en-US/chrome/messenger/charsetTitles.properties

You're not removing the ability to select various Japanese encodings, like ISO-2022-JP or Shift_JIS, are you?

Flags: needinfo?(jorgk)

(In reply to Jorg K (GMT+2) from comment #22)

You're not removing the ability to select various Japanese encodings, like ISO-2022-JP or Shift_JIS, are you?

I am removing the ability to select a specific Japanese encoding and replacing the three options with one option that automatically selects one of the three possibilities instead of the user having to try to guess (or to be able to diagnose from mojibake) which one the old three options to pick. (See comment 7 for the design.)

OK, but we have a menu that allows to select a single encoding for an outgoing message. I hope that's not affected. And sorry, I didn't study comment #7 in detail so far.

(In reply to Jorg K (GMT+2) from comment #24)

OK, but we have a menu that allows to select a single encoding for an outgoing message. I hope that's not affected. And sorry, I didn't study comment #7 in detail so far.

For Japanese, that's always ISO-2022-JP with name from charsetTitles.properties, right?

I'd have to look how we do the menu (no time right this minute), bug there are these three items:
Japanese (ISO-2022-JP)
Japanese (Shift_JIS)
Japanese (EUC-JP).

OK, the popup is built here
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/compose/content/messengercompose.xul#1629
using CharsetMenu.build(). That comes from CharsetMenu.jsm. If that will now only say "Japanese", then we have a problem.

We have other UI that lets users select a charset, for example here:
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/preferences/fonts.xul#276
and the content of those lists is built here:
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mailnews/base/content/menulist-charsetpicker.js#84

(In reply to Jorg K (GMT+2) from comment #27)

OK, the popup is built here
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/compose/content/messengercompose.xul#1629
using CharsetMenu.build(). That comes from CharsetMenu.jsm. If that will now only say "Japanese", then we have a problem.

:-(

Is there actually a use case for sending email in Shift_JIS or EUC-JP? If there is, I guess CharsetMenu.jsm menu building code could take a flag that indicates that the menu is being built for message compose, but 1) is there really a use case for letting the user choose Shift_JIS or EUC-JP (as opposed to making the single "Japanese" item mean ISO-2022-JP in the email case) and 2) any idea when bug 862292 could be fixed and the menu removed from the message compose UI?

Flags: needinfo?(jorgk)

Hmm, I've been doing a lot of encoding work in recent years, but Magnus has been at the party much longer. He's also a "do it all in UTF-8" proponent, so his answer might be biased ;-)

Personally I have no insight in what Japanese users want to use. From working on some CJK bugs I know that people need ISO-2022-JP from some older(?) mail clients. I don't know what the use of Shift_JIS or EUC-JP is. Shift_JIS is the "native" Windows encoding in Japan, close to Windows-932, no?

Apparently the Japanese localisation switched to UTF-8 a while ago
https://hg.mozilla.org/l10n-central/ja/rev/11867217bdd2d495aefe5fd3257d30316dafda69
so most likely the Shift_JIS and EUC-JP can be removed from the choices. Magnus?

Flags: needinfo?(jorgk) → needinfo?(mkmelin+mozilla)

(In reply to Jorg K (GMT+2) from comment #29)

I don't know what the use of Shift_JIS or EUC-JP is. Shift_JIS is the "native" Windows encoding in Japan, close to Windows-932, no?

It's almost precisely (except for three or so non-Private Use Area characters, precisely) Windows 932 for encoder purposes, yes.

From working on some CJK bugs I know that people need ISO-2022-JP from some older(?) mail clients.

Considering the previously the main concern for Japanese email has always been ISO-2022-JP, I find it highly implausible that there'd be recipients that could deal with EUC-JP and Shift_JIS but couldn't deal with ISO-2022-JP. (Thunderbird already does not allow setting EUC-JP or Shift_JIS as the permanent outgoing email encoding from the pref UI.)

emk, any comment on this?

Flags: needinfo?(VYV03354)

For what it's worth, I agree, it should be OK to remove Shift_JIS and EUC-JP from the menu. Most likely we used the M-C-generated menu for convenience and those two happened to be on it. As Henri pointed out, they are not supported as permanent outgoing encodings, so why support them there.

Outlook uses Japanese (auto-detect) for incoming mails by default (with three encodings under More submenu). So I think we can also use Japanese (auto-detect) considering the reliability of ISO-2022-JP detection.

Flags: needinfo?(VYV03354)

Agreed, I don't think we need to support setting Shift_JIS and EUC-JP.

Flags: needinfo?(mkmelin+mozilla)

Now that the soft freeze is over, what should my expectations be in terms of review and being able to land this near the beginning of the Firefox 69 cycle?

Flags: needinfo?(VYV03354)

FYI, some patches from bug 1510569 landed on autoland today that move the OnStateChange handler from WebProgressChild.jsm/RemoteWebProgress.jsm into BrowserChild/BrowserParent.

Phabricator should send a reminder mail as Bugzilla review request did.

Flags: needinfo?(VYV03354)
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a0e7ced8e47
part 1 - Vendor shift_or_euc into m-c. r=emk
https://hg.mozilla.org/integration/autoland/rev/1cbaafb9373a
part 2 - Use mozilla::JapaneseDetector in the HTML parser. r=emk
https://hg.mozilla.org/integration/autoland/rev/4573c25b1ce0
part 3 - Remove the old Japanese detector from the tree. r=emk
https://hg.mozilla.org/integration/autoland/rev/ccc438262e29
part 4 - Have only one item for Japanese in the Text Encoding menu. r=emk,Gijs
https://hg.mozilla.org/integration/autoland/rev/25449ba8aceb
part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. r=emk
https://hg.mozilla.org/integration/autoland/rev/f593045cc48f
part 6 - Tests for the new Japanese encoding override. r=emk

(In reply to Masatoshi Kimura [:emk] from comment #36)

Phabricator should send a reminder mail as Bugzilla review request did.

Thanks for the reviews!

(In reply to Barret Rennie [:brennie] from comment #35)

FYI, some patches from bug 1510569 landed on autoland today that move the OnStateChange handler from WebProgressChild.jsm/RemoteWebProgress.jsm into BrowserChild/BrowserParent.

Landed with this taken into account when rebasing. Thanks.

The feature itself appears to work on Mac and Windows. The new tests that fail on Mac and Windows succeed on Linux64 locally and on try, but the Test Verify mode manages to make them fail even on Linux64 on try. I find this very odd considering that the tests are variations of tests using an existing harness. That is, the tests are variations on this theme: https://searchfox.org/mozilla-central/source/docshell/test/browser/browser_bug234628-1.js

If anyone has insight into what's going wrong, I'm very curious to hear ideas.

Flags: needinfo?(hsivonen)

The current form of the test harness came from bug 967873, which Gijs reviewed, so needinfoing Gijs for ideas.

Flags: needinfo?(gijskruitbosch+bugs)

I don't know. Off-hand, a few points:

  1. we should probably not implicitly make the test depend on the textContent serializing of the HTML doc by specifying the exact index we expect to find certain unicode characters. Use <str>.includes() instead.
  2. adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.
  3. you can run verify yourself with --verify.
  4. I don't understand from the comments here whether this always fails TV or not, and/or if there's any difference between what got pushed to try vs. what landed (like, say, ancestor csets). If so, that may be worth looking at.
Flags: needinfo?(gijskruitbosch+bugs)

(In reply to :Gijs (he/him) from comment #43)

I don't know. Off-hand, a few points:

  1. we should probably not implicitly make the test depend on the textContent serializing of the HTML doc by specifying the exact index we expect to find certain unicode characters. Use <str>.includes() instead.

I think this bit is fine. It's a pretty serious bug if the indices aren't exact.

  1. adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.

Does the test framework make the test assertions and the content being tested run in the same process or does stuff gets proxied with some kind of cross-process wrappers? A recording of the failure suggests that the HTML parser is doing the right thing, yet the test JS sees something else.

  1. I don't understand from the comments here whether this always fails TV or not, and/or if there's any difference between what got pushed to try vs. what landed (like, say, ancestor csets). If so, that may be worth looking at.

TV does not always fail on Linux.

Looks like in the step where the Shift_JIS byte pair is supposed to be decoded as windows-1252, it gets decoded into two REPLACEMENT CHARACTERS.

(In reply to Henri Sivonen (:hsivonen) from comment #45)

Looks like in the step where the Shift_JIS byte pair is supposed to be decoded as windows-1252, it gets decoded into two REPLACEMENT CHARACTERS.

I think Lando mangled the non-UTF-8 parts of the patch and generated REPLACEMENT CHARACTERS as UTF-8 bytes where the Shift_JIS and EUC-JP bytes should have been.

I'm going to try relanding this by pushing to inbound.

(In reply to Henri Sivonen (:hsivonen) from comment #44)

  1. adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.

Does the test framework make the test assertions and the content being tested run in the same process or does stuff gets proxied with some kind of cross-process wrappers? A recording of the failure suggests that the HTML parser is doing the right thing, yet the test JS sees something else.

I don't understand the question. There are no longer any "cross process wrappers" (CPOWs) if that's what you mean. The tests run with e10s enabled, so the check functions in the test JS all run in the content process, see https://searchfox.org/mozilla-central/source/docshell/test/browser/head.js#79 .

(In reply to :Gijs (he/him) from comment #47)

There are no longer any "cross process wrappers" (CPOWs) if that's what you mean.

That's what I meant.

Anyway, it looks like this is a case of Lando landing different bytes than I meant to land. (Now the mystery, which I'm not going to pursue, is: Why didn't Linux mochitest-browser-chrome go perma-orange like Mac and Windows after the landing?)

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/af8569c103d2
part 1 - Vendor shift_or_euc into m-c. r=emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6d04ade73b5
part 2 - Use mozilla::JapaneseDetector in the HTML parser. r=emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d4822860fe7
part 3 - Remove the old Japanese detector from the tree. r=emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/52b6ea9ab31e
part 4 - Have only one item for Japanese in the Text Encoding menu. r=Gijs,emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/35bf777b62f7
part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. r=emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/76dbcd8529c4
part 6 - Tests for the new Japanese encoding override. r=emk.

(In reply to Henri Sivonen (:hsivonen) from comment #48)

Anyway, it looks like this is a case of Lando landing different bytes than I meant to land.

Filed as bug 1556381.

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/beaa9e87cca2
Fix merge conflict bustage, add missing comma. r=bustage-fix CLOSED TREE
Depends on: 1556746
Regressions: 1565888
You need to log in before you can comment on or make changes to this bug.