Open Bug 1585935 Opened 3 years ago Updated 4 months ago

Japanese UTF-8 site auto-detect fails

Categories

(Core :: Internationalization, defect, P3)

66 Branch
Desktop
Windows 10
defect

Tracking

()

People

(Reporter: alice0775, Unassigned)

References

(Regression, )

Details

(Keywords: nightly-community, regression)

Attachments

(1 file)

Attached image image.png

Steps to reproduce:
0. Start Firefox [Ja] localized build
(if not use [Ja]build,
set intl.accept_languages to "ja, en-US, en"
set intl.charset.detector = "ja_parallel_state_machine"
)

  1. Open http://www.bug-ja.org/

Actual Results:
Mojibake.
Page info shows "windows-1252".
Alt -> View –> Text Encoding shows "Western".

Interestingly, Network monitor shows without Mojibake. See attached screenshot.

Expected Results:
Page info should show "UTF-8"
Alt -> View –> Text Encoding shows "Unicode"

This is regressed since Firefox66

This change is intentional. The intent of the change was to make it so that going forward Web developers who use the Japanese localization of Firefox wouldn't create Web pages like this that are broken in other localizations of Firefox and in other browsers. (E.g. in Chrome the user has no recourse to fix this at the user's end.)

Obviously, this kind of change intended to avoid bad Web authoring going forward breaks bad authoring that has already happened, but I suggest dealing with this case by contacting the admin of the page to ask them to add <meta charset=utf-8>.

Does the Network monitor always decode as UTF-8 or is there more going on there?

Priority: -- → P2

I think this isn't a P2 as Gecko bug. I think this should be WONTFIX as a Gecko bug. The site looks the same in Safari and Chrome as it now does in Firefox. The change in how it looks in Firefox is a foreseeable effect of an intentional health-of-the-Web change. If we kept guessing UTF-8 for Japanese, Chrome may decide to guess UTF-8 and not just for Japanese, which would make the Web more brittle overall, because there are languages for which UTF-8 guessing is less reliable than for Japanese. It's better for the Web to exclude UTF-8 from competitive guessing.

Let's treat this as a tech evangelism bug. This looks like close-to-Mozilla site (which probably explains how the site has gotten away with being broken in Chrome and Safari), so it should be possible to contact the admin to get the site fixed. However, I fail to locate an email address for an admin on the site.

Does anyone on the CC list happen to know how to contact the admin of this site?

Ok, Understood. About http://www.bug-ja.org/, I reported site owner to add meta element or adding charset to content-type.

I don't think it's not an i18n issue of Gecko, just an issue on site (already mostly-not-maintained status...). - fixed by https://github.com/bug-ja/bugja-www-old/commit/aab76cb4556232c0fd910c8de513911496b4fa74

[W3C i18n WG hat on]
I personally feel something strange to exclude utf-8 from a list for competitive guessing, that the spec states documents must be in utf-8 whether a character encoding declaration is present or not; even if I understand, for some encodings, it is quite difficult to judge between them and utf-8.

(In reply to A. Shimono [:himorin] from comment #5)

I don't think it's not an i18n issue of Gecko, just an issue on site (already mostly-not-maintained status...). - fixed by https://github.com/bug-ja/bugja-www-old/commit/aab76cb4556232c0fd910c8de513911496b4fa74

Thank you!

[W3C i18n WG hat on]
I personally feel something strange to exclude utf-8 from a list for competitive guessing, that the spec states documents must be in utf-8 whether a character encoding declaration is present or not; even if I understand, for some encodings, it is quite difficult to judge between them and utf-8.

The issue is not the difficulty of distinguishing UTF-8 from other encodings given the full content. When loading files from file URLs, Firefox does detect UTF-8.

For http/https content, though, incremental processing is important and starting over is bad. A Japanese site is likely to have non-ASCII right in the page title and to have the page title in the first 1024 bytes of the page (which we wait for anyway for the meta) scan. However, languages like English, Dutch, Indonesian, Somali, and Swahili can easily have a very long ASCII prefix before any non-ASCII (to the point the first non-ASCII may be a copyright sign in the page footer). Languages like German and Finnish tend to have non-ASCII on every page, but can easily have none in the first 1024 bytes if the page title is the only non-markup/non-stylesheet/not-JavaScript text within the first 1024 bytes.

Therefore, detecting UTF-8 for remote content world-wide would cause a situation where UTF-8 detection would work often enough that Web developers would start relying on it, but one of the following would apply:

  1. It would break incremental rendering for some languages if we buffered until finding non-ASCII (and developers with fast connections might notice)
  2. It would cause reloads (if we reloaded when discovering non-ASCII late; as happens in Gecko when Gecko has started with a Shift_JIS presumption and discovers EUC-JP)
  3. Pages would sometimes have mojibake later on the page (if we only considered the start of the page)
  4. Pages would have mojibake later on the page depending on network conditions (if we had a timeout for buffering or otherwise depended on time-correlated network buffer coalescing)

For legacy encodings, there is both a greater need for autodetection and, since UTF-8 has won to the point that most Web developers try to use it for new development, legacy-encoding autodetection has lesser chance of causing Web developers to fail to label newly-authored pages, which are presumably more performance-sensitive and JavaScript side-effectful than legacy pages, so autodetection for legacy encodings only has a lesser chance of causing harm than UTF-8 autodetection.

AFAIK, right now no major browser autodetects UTF-8 for non-file URLs. <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">: Something that everyone wants for new pages, and, therefore, is annoyed about not getting by default, but that requires explicit opt-in because we can't expect legacy pages to be the ones opting out.

Severity: normal → S3
Priority: P2 → P3
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.