1543077 - Japanese auto-detect no longer work on some generic (i.e. language neutral) TLDs

Masatoshi Kimura [:emk]

Reporter

Description

•

6 years ago

Steps to reproduce:

Open https://emk.name/test/enter.html. (Note that this bug depends on TLDs, so I can't attach the testcase on Bugzilla.)
Select hamburger menu > [More] > [Text Encoding] > [Auto-Detect] > [Japanese].

Actual result:
Mojibake.

Expected result:
Shift_JIS should be detected.

I got mojibake on other TLD, .fam.cx which is a Dynamic DNS service.

Regression range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=140c4f3b6b137026f6305102d1e34725225352a9&tochange=a2df400cb88c79664cd8ba5fcf6b640373a85524

Masatoshi Kimura [:emk]

Reporter

Updated

•

6 years ago

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

6 years ago

Considering that this effect is intentional and expected, is this known to cause more problems than having predominantly non-Japanese TLDs be assumed to be Shift_JIS?

Flags: needinfo?(hsivonen) → needinfo?(VYV03354)

Masatoshi Kimura [:emk]

Reporter

Comment 2

•

6 years ago

If a Western page is mis-detected as Shift_JIS, only a few accent characters and quotation marks will be garbled. But if a Japanese page is mis-detected as windows-1252, the page will be completely unreadable. So the latter is much worse than the former for us (Japanese people).

Flags: needinfo?(VYV03354)

Henri Sivonen (:hsivonen)

Assignee

Comment 3

•

6 years ago

(In reply to Masatoshi Kimura [:emk] from comment #2)

If a Western page is mis-detected as Shift_JIS, only a few accent characters and quotation marks will be garbled. But if a Japanese page is mis-detected as windows-1252, the page will be completely unreadable. So the latter is much worse than the former for us (Japanese people).

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese), and for UI localizations affiliated with those encodings, .cx has not yielded the UI localization-affiliated encoding since Firefox 30. Yet, we haven't gotten a single bug report.

As a matter of where I think we should be aiming at, I think we should be aiming at eliminating the dependency on the UI localization. .com/.org/.net currently depend on the UI localization, because those domains are assumed to have been in global use in the 1990s when ways of actually declaring the encoding weren't well-known, and so far there hasn't been a viable idea of how to eliminate the UI localization dependency for .com/.org/.net. New generic TLDs do not fall back on the UI localization-affiliated encoding, because there's no 1990s legacy for them.

.cx is a case of a ccTLD that was used generically relatively early. In that sense, it has potential for .com/.net/.org-style long-tail legacy. However, I'd like us not to take steps backward (from the goal of eliminating the dependency on the UI localization) from one data point for Japanese when the data for Chinese, Korean, Hebrew, Arabic, and Greek since Firefox 30 suggests that there isn't a legacy pattern warranting action. (For example, I think it's bad that we decided bug 620106 from one data point.)

(I'm talking about UI localization dependency instead of detector setting, because I'd like us to a point where the Cyrillic detectors are removed and the Japanese detector runs unconditionally when the base guess [from .jp TLD or from ja UI locale] is Shift_JIS without UI for the Japanese detector.)

Masatoshi Kimura [:emk]

Reporter

Comment 4

•

6 years ago

Why UI localization is relevant at all? See STR in comment #0. I reproduced it on en-US localized Firefox.

(In reply to Henri Sivonen (:hsivonen) from comment #3)

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese),

Those locales have only one dominant legacy encoding. If they see garbled text, they can just choose only one menu entry from the Text Encoding menu. On the other hand, we (Japanese users) have to gamble from the three encodings because Auto-Detect no longer works. It is really boring.

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

6 years ago

(In reply to Masatoshi Kimura [:emk] from comment #4)

Why UI localization is relevant at all?

It affects the base guess for old generic TLDs (.com/.net/.org) and the Japanese detector only runs if the base guess is Shift_JIS.

I reproduced it on en-US localized Firefox.

Do you have the "Text Encoding for Legacy Content" pref set to "Japanese"? That is, are you asking for .cx to be able to gain Shift_JIS as the base guess or are you asking for the detector to run even if the base guess isn't Shift_JIS?

(In reply to Henri Sivonen (:hsivonen) from comment #3)

This also applies to Chinese, Korean, Hebrew, Arabic, and Greek legacy encodings (and to a large extent to Vietnamese),

Those locales have only one dominant legacy encoding. If they see garbled text, they can just choose only one menu entry from the Text Encoding menu. On the other hand, we (Japanese users) have to gamble from the three encodings

Two, considering how rare ISO-2022-JP is on the Web and how easy it is to tell if mojibake is due to ISO-2022-JP decoded as something else. Central European users also have two legacy encodings and no bug reports about .cx. Greek has two, too, though the difference in the Greek case may be mild enough that the user won't try again if the first guess is the wrong one of the two.

because Auto-Detect no longer works. It is really boring.

To address this concern, it seems it would be better to have a menu item "Japanese (Detect)" as in IE instead of or in addition to "Japanese (Shift_JIS)" and "Japanese (EUC-JP)" than to make the Japanese autodetector run when the base guess isn't Shift_JIS. (The "instead of" option would be easier to implement than the "in addition to" option and probably more comprehensible to more users.)

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

6 years ago

Specifically, I'd prefer the following over reverting bug 977540:

Make the Japanese detector run if at the start of the parse the encoding candidate is Shift_JIS and it doesn't come from an explicit label.
Remove UI for the Japanese detector.
Replace "Japanese (Shift_JIS)", "Japanese (EUC-JP)", and "Japanese (ISO-2022-JP)" in the menu with a single item "Japanese" that sets the encoding candidate to "Shift_JIS" (and per item 1 causes the detector to run).
Replace the current detector with https://github.com/hsivonen/shift_or_euc to make it extremely improbable that the detector guesses wrong between Shift_JIS, EUC-JP, and ISO-2022-JP so as to remove the need for having UI to override the detector result.

Makoto Kato [:m_kato]

Updated

•

6 years ago

Priority: -- → P1

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

6 years ago

emk, to confirm, does this look like a resolution you'd r+?

Replace the current Japanese detector that tries to gain confidence that the content looks EUC-JPish enough with https://github.com/hsivonen/shift_or_euc which instead looks for "not Shift_JIS" signs and, as a result, decides extremely soon (typically from the first non-ASCII character) and extremely accurately. It decides when it has seen one full-width kana or one Level 1 kanji (or one of the decidable subset of Level 2 kanji). The two failure modes are: It misdecides if there is half-width katakana character before a full-width kana or common kanji (half-width katakana after those is fine). It is undecided if the input consists entirely of an undecidable subset of JIS X 0208 Level 2 kanji. The latter case won't realistically happen when considering a full document. (When considering just the first 1024, it is plausible to see only an undecidable document title.) The former half-width katakana case seems plausible with old JIS X 0201-only text files (which would work as Shift_JIS but that the detector decides as EUC-JP), but doesn't seem a real problem for Web pages. (If you looked at https://github.com/hsivonen/shift_or_euc 5 days ago, it was bogus then. It actually works now. Despite the name, it also detects ISO-2022-JP.)
Make the Japanese detector run if the content is unlabeled and the fallback absent of detection would be Shift_JIS.
Remove the UI for the Japanese detector.
Replace "Japanese (Shift_JIS)", "Japanese (EUC-JP)", and "Japanese (ISO-2022-JP)" in the menu with a single item "Japanese".
When the item "Japanese" is chosen, run the detector (and fall back to Shift_JIS in case of ASCII or the implausible case of the document having only undecidable kanji).
If the encoding of the top-level document came from a label, don't show a checkmark next to the item "Japanese" to allow the user to trigger detection for mislabeled content. (This is consistent with how ISO-8859-15 doesn't show a checkmark and allows the user to choose "Western".)
If the encoding came from the detector already, show a checkmark next to the item "Japanese".

I believe this would bring the user experience for Japanese legacy encodings to the same level of ease of use as for non-Latin writing systems that have a single legacy encoding, such as Korean and Thai. Furthermore, coupling the detection with Japanese base guess instead of configuration (whose default is locale-dependent; whether at present the Japanese detector is on or off) would make unlabeled .jp pages work better in non-Japanese Firefox localizations by default.

(Note: I can think of UI that would handle manual override of the detector misdetecting files whose first Japanese character is half-width katakana, but I think the result would be more confusing to most users in the common case. That solution would be never to put a checkmark next to the item "Japanese" and make the item flip to the other one Shift_JIS and EUC-JP than the current encoding if the current one already came from the detector.)

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Flags: needinfo?(VYV03354)

Masatoshi Kimura [:emk]

Reporter

Comment 8

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #7)

emk, to confirm, does this look like a resolution you'd r+?

It looks reasonable to me, yes.

Flags: needinfo?(VYV03354)

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

6 years ago

Attached file Bug 1543077 part 1 - Vendor shift_or_euc into m-c. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

6 years ago

Attached file Bug 1543077 part 2 - Use mozilla::JapaneseDetector in the HTML parser. — Details

Depends on D27792

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

6 years ago

Attached file Bug 1543077 part 3 - Remove the old Japanese detector from the tree. — Details

Depends on D27793

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

6 years ago

(The patch set is not yet complete. The patches here don't make the UI side make sense yet.)

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

6 years ago

Attached file Bug 1543077 part 4 - Have only one item for Japanese in the Text Encoding menu. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 14

•

6 years ago

Attached file Bug 1543077 part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 15

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5517313643ac66b28d65c99ec845be33c88866a9

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1549329

Henri Sivonen (:hsivonen)

Assignee

Comment 16

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=95f112261dbed0b32fb0e78281414870587d1849

Henri Sivonen (:hsivonen)

Assignee

Comment 17

•

6 years ago

Attached file Bug 1543077 part 6 - Tests for the new Japanese encoding override. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 18

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a101fcdc6b4be33bcc46fc52f5817238070a98e0

Henri Sivonen (:hsivonen)

Assignee

Comment 19

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3866d4306af3a181f35f026483b0e3fdb54390a5

Henri Sivonen (:hsivonen)

Assignee

Comment 20

•

6 years ago

Try run to confirm that the changes to make Coverity happy don't make any compiler unhappy:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=767709687873bd09c432a441c8b1f812a179830f

:Gijs (he/him)

Updated

•

6 years ago

Comment 21

•

6 years ago

Jörg, can you confirm that comm-central isn't using the strings from charsetMenu.properties outside CharsetMenu.jsm and other places in comm-central use charsetTitles.properties? (See string removals from charsetMenu.properties in Part 4 here.)

Flags: needinfo?(jorgk)

Jorg K (CEST = GMT+2)

Comment 22

•

6 years ago

Looks like we don't use charsetMenu.properties:
https://searchfox.org/comm-central/search?q=charsetMenu.properties&case=false&regexp=false&path=
If we don't references it, we can't use strings from it, right?

As you said, we use charsetTitles.properties:
https://searchfox.org/comm-central/search?q=charsetTitles.properties&case=false&regexp=false&path=
of which we have our own copy:
mail/locales/en-US/chrome/messenger/charsetTitles.properties

You're not removing the ability to select various Japanese encodings, like ISO-2022-JP or Shift_JIS, are you?

Flags: needinfo?(jorgk)

Henri Sivonen (:hsivonen)

Assignee

Comment 23

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #22)

You're not removing the ability to select various Japanese encodings, like ISO-2022-JP or Shift_JIS, are you?

I am removing the ability to select a specific Japanese encoding and replacing the three options with one option that automatically selects one of the three possibilities instead of the user having to try to guess (or to be able to diagnose from mojibake) which one the old three options to pick. (See comment 7 for the design.)

Jorg K (CEST = GMT+2)

Comment 24

•

6 years ago

OK, but we have a menu that allows to select a single encoding for an outgoing message. I hope that's not affected. And sorry, I didn't study comment #7 in detail so far.

Henri Sivonen (:hsivonen)

Assignee

Comment 25

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #24)

OK, but we have a menu that allows to select a single encoding for an outgoing message. I hope that's not affected. And sorry, I didn't study comment #7 in detail so far.

For Japanese, that's always ISO-2022-JP with name from charsetTitles.properties, right?

Jorg K (CEST = GMT+2)

Comment 26

•

6 years ago

I'd have to look how we do the menu (no time right this minute), bug there are these three items:
Japanese (ISO-2022-JP)
Japanese (Shift_JIS)
Japanese (EUC-JP).

Jorg K (CEST = GMT+2)

Comment 27

•

6 years ago

OK, the popup is built here
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/compose/content/messengercompose.xul#1629
using CharsetMenu.build(). That comes from CharsetMenu.jsm. If that will now only say "Japanese", then we have a problem.

We have other UI that lets users select a charset, for example here:
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/preferences/fonts.xul#276
and the content of those lists is built here:
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mailnews/base/content/menulist-charsetpicker.js#84

Henri Sivonen (:hsivonen)

Assignee

Comment 28

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #27)

OK, the popup is built here
https://searchfox.org/comm-central/rev/09b367315a57acf9b5c64fcd352b8faf284898db/mail/components/compose/content/messengercompose.xul#1629
using CharsetMenu.build(). That comes from CharsetMenu.jsm. If that will now only say "Japanese", then we have a problem.

:-(

Is there actually a use case for sending email in Shift_JIS or EUC-JP? If there is, I guess CharsetMenu.jsm menu building code could take a flag that indicates that the menu is being built for message compose, but 1) is there really a use case for letting the user choose Shift_JIS or EUC-JP (as opposed to making the single "Japanese" item mean ISO-2022-JP in the email case) and 2) any idea when bug 862292 could be fixed and the menu removed from the message compose UI?

Flags: needinfo?(jorgk)

Jorg K (CEST = GMT+2)

Comment 29

•

6 years ago

Hmm, I've been doing a lot of encoding work in recent years, but Magnus has been at the party much longer. He's also a "do it all in UTF-8" proponent, so his answer might be biased ;-)

Personally I have no insight in what Japanese users want to use. From working on some CJK bugs I know that people need ISO-2022-JP from some older(?) mail clients. I don't know what the use of Shift_JIS or EUC-JP is. Shift_JIS is the "native" Windows encoding in Japan, close to Windows-932, no?

Apparently the Japanese localisation switched to UTF-8 a while ago
https://hg.mozilla.org/l10n-central/ja/rev/11867217bdd2d495aefe5fd3257d30316dafda69
so most likely the Shift_JIS and EUC-JP can be removed from the choices. Magnus?

Flags: needinfo?(jorgk) → needinfo?(mkmelin+mozilla)

Henri Sivonen (:hsivonen)

Assignee

Comment 30

•

6 years ago

(In reply to Jorg K (GMT+2) from comment #29)

I don't know what the use of Shift_JIS or EUC-JP is. Shift_JIS is the "native" Windows encoding in Japan, close to Windows-932, no?

It's almost precisely (except for three or so non-Private Use Area characters, precisely) Windows 932 for encoder purposes, yes.

From working on some CJK bugs I know that people need ISO-2022-JP from some older(?) mail clients.

Considering the previously the main concern for Japanese email has always been ISO-2022-JP, I find it highly implausible that there'd be recipients that could deal with EUC-JP and Shift_JIS but couldn't deal with ISO-2022-JP. (Thunderbird already does not allow setting EUC-JP or Shift_JIS as the permanent outgoing email encoding from the pref UI.)

emk, any comment on this?

Flags: needinfo?(VYV03354)

Jorg K (CEST = GMT+2)

Comment 31

•

6 years ago

For what it's worth, I agree, it should be OK to remove Shift_JIS and EUC-JP from the menu. Most likely we used the M-C-generated menu for convenience and those two happened to be on it. As Henri pointed out, they are not supported as permanent outgoing encodings, so why support them there.

Masatoshi Kimura [:emk]

Reporter

Comment 32

•

6 years ago

Outlook uses Japanese (auto-detect) for incoming mails by default (with three encodings under More submenu). So I think we can also use Japanese (auto-detect) considering the reliability of ISO-2022-JP detection.

Flags: needinfo?(VYV03354)

Magnus Melin [:mkmelin]

Comment 33

•

6 years ago

Agreed, I don't think we need to support setting Shift_JIS and EUC-JP.

Flags: needinfo?(mkmelin+mozilla)

Henri Sivonen (:hsivonen)

Assignee

Comment 34

•

6 years ago

Now that the soft freeze is over, what should my expectations be in terms of review and being able to land this near the beginning of the Firefox 69 cycle?

Flags: needinfo?(VYV03354)

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Keywords: regression

Beth Rennie [:beth] (she/her)

Comment 35

•

6 years ago

FYI, some patches from bug 1510569 landed on autoland today that move the OnStateChange handler from WebProgressChild.jsm/RemoteWebProgress.jsm into BrowserChild/BrowserParent.

Masatoshi Kimura [:emk]

Reporter

Comment 36

•

6 years ago

Phabricator should send a reminder mail as Bugzilla review request did.

Flags: needinfo?(VYV03354)

Pulsebot

Comment 37

•

6 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1a0e7ced8e47 part 1 - Vendor shift_or_euc into m-c. r=emk https://hg.mozilla.org/integration/autoland/rev/1cbaafb9373a part 2 - Use mozilla::JapaneseDetector in the HTML parser. r=emk https://hg.mozilla.org/integration/autoland/rev/4573c25b1ce0 part 3 - Remove the old Japanese detector from the tree. r=emk https://hg.mozilla.org/integration/autoland/rev/ccc438262e29 part 4 - Have only one item for Japanese in the Text Encoding menu. r=emk,Gijs https://hg.mozilla.org/integration/autoland/rev/25449ba8aceb part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. r=emk https://hg.mozilla.org/integration/autoland/rev/f593045cc48f part 6 - Tests for the new Japanese encoding override. r=emk

Henri Sivonen (:hsivonen)

Assignee

Comment 38

•

6 years ago

(In reply to Masatoshi Kimura [:emk] from comment #36)

Phabricator should send a reminder mail as Bugzilla review request did.

Thanks for the reviews!

(In reply to Barret Rennie [:brennie] from comment #35)

FYI, some patches from bug 1510569 landed on autoland today that move the OnStateChange handler from WebProgressChild.jsm/RemoteWebProgress.jsm into BrowserChild/BrowserParent.

Landed with this taken into account when rebasing. Thanks.

Alexandru Michis [:malexandru]

Comment 39

•

6 years ago

Backed out 6 changesets for causing bc failures at docshell/test/browser/browser_bug1543077.js

Backout link: https://hg.mozilla.org/integration/autoland/rev/a293e80437afd33f0187f5bbe730de5a30ec8deb

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=f593045cc48feddcc74ce948a90ed41c3bc0b829&searchStr=chrome&selectedJob=248504375

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=248504375&repo=autoland&lineNumber=21146

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 40

•

6 years ago

•

Edited

The feature itself appears to work on Mac and Windows. The new tests that fail on Mac and Windows succeed on Linux64 locally and on try, but the Test Verify mode manages to make them fail even on Linux64 on try. I find this very odd considering that the tests are variations of tests using an existing harness. That is, the tests are variations on this theme: https://searchfox.org/mozilla-central/source/docshell/test/browser/browser_bug234628-1.js

If anyone has insight into what's going wrong, I'm very curious to hear ideas.

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 41

•

6 years ago

The current form of the test harness came from bug 967873, which Gijs reviewed, so needinfoing Gijs for ideas.

Flags: needinfo?(gijskruitbosch+bugs)

Henri Sivonen (:hsivonen)

Assignee

Comment 42

•

6 years ago

Looks like TV can be green on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7271ee8af19b8d634375a625dd3dd1b4598be0ec&selectedJob=248524828

:Gijs (he/him)

Comment 43

•

6 years ago

I don't know. Off-hand, a few points:

we should probably not implicitly make the test depend on the textContent serializing of the HTML doc by specifying the exact index we expect to find certain unicode characters. Use <str>.includes() instead.
adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.
you can run verify yourself with --verify.
I don't understand from the comments here whether this always fails TV or not, and/or if there's any difference between what got pushed to try vs. what landed (like, say, ancestor csets). If so, that may be worth looking at.

Flags: needinfo?(gijskruitbosch+bugs)

Henri Sivonen (:hsivonen)

Assignee

Comment 44

•

6 years ago

(In reply to :Gijs (he/him) from comment #43)

I don't know. Off-hand, a few points:

we should probably not implicitly make the test depend on the textContent serializing of the HTML doc by specifying the exact index we expect to find certain unicode characters. Use <str>.includes() instead.

I think this bit is fine. It's a pretty serious bug if the indices aren't exact.

adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.

Does the test framework make the test assertions and the content being tested run in the same process or does stuff gets proxied with some kind of cross-process wrappers? A recording of the failure suggests that the HTML parser is doing the right thing, yet the test JS sees something else.

I don't understand from the comments here whether this always fails TV or not, and/or if there's any difference between what got pushed to try vs. what landed (like, say, ancestor csets). If so, that may be worth looking at.

TV does not always fail on Linux.

Henri Sivonen (:hsivonen)

Assignee

Comment 45

•

6 years ago

Looks like in the step where the Shift_JIS byte pair is supposed to be decoded as windows-1252, it gets decoded into two REPLACEMENT CHARACTERS.

Henri Sivonen (:hsivonen)

Assignee

Comment 46

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #45)

Looks like in the step where the Shift_JIS byte pair is supposed to be decoded as windows-1252, it gets decoded into two REPLACEMENT CHARACTERS.

I think Lando mangled the non-UTF-8 parts of the patch and generated REPLACEMENT CHARACTERS as UTF-8 bytes where the Shift_JIS and EUC-JP bytes should have been.

I'm going to try relanding this by pushing to inbound.

:Gijs (he/him)

Comment 47

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #44)

adding logging for the full text content should clarify what is actually happening when the current indexOf check is returning -1. Perhaps one of the browserLoaded checks resolves when about:blank is loaded at some point? Perhaps the subframe checks are somehow not working? Perhaps something else. But clearly some funny business is happening so it seems worth adding logging to figure out exactly what that is.

Does the test framework make the test assertions and the content being tested run in the same process or does stuff gets proxied with some kind of cross-process wrappers? A recording of the failure suggests that the HTML parser is doing the right thing, yet the test JS sees something else.

I don't understand the question. There are no longer any "cross process wrappers" (CPOWs) if that's what you mean. The tests run with e10s enabled, so the check functions in the test JS all run in the content process, see https://searchfox.org/mozilla-central/source/docshell/test/browser/head.js#79 .

Henri Sivonen (:hsivonen)

Assignee

Comment 48

•

6 years ago

(In reply to :Gijs (he/him) from comment #47)

There are no longer any "cross process wrappers" (CPOWs) if that's what you mean.

That's what I meant.

Anyway, it looks like this is a case of Lando landing different bytes than I meant to land. (Now the mystery, which I'm not going to pursue, is: Why didn't Linux mochitest-browser-chrome go perma-orange like Mac and Windows after the landing?)

Pulsebot

Comment 49

•

6 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/af8569c103d2 part 1 - Vendor shift_or_euc into m-c. r=emk. https://hg.mozilla.org/integration/mozilla-inbound/rev/f6d04ade73b5 part 2 - Use mozilla::JapaneseDetector in the HTML parser. r=emk. https://hg.mozilla.org/integration/mozilla-inbound/rev/2d4822860fe7 part 3 - Remove the old Japanese detector from the tree. r=emk. https://hg.mozilla.org/integration/mozilla-inbound/rev/52b6ea9ab31e part 4 - Have only one item for Japanese in the Text Encoding menu. r=Gijs,emk. https://hg.mozilla.org/integration/mozilla-inbound/rev/35bf777b62f7 part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. r=emk. https://hg.mozilla.org/integration/mozilla-inbound/rev/76dbcd8529c4 part 6 - Tests for the new Japanese encoding override. r=emk.

Henri Sivonen (:hsivonen)

Assignee

Comment 50

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #48)

Anyway, it looks like this is a case of Lando landing different bytes than I meant to land.

Filed as bug 1556381.

Pulsebot

Comment 51

•

6 years ago

Pushed by csabou@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/beaa9e87cca2 Fix merge conflict bustage, add missing comma. r=bustage-fix CLOSED TREE

Cosmin Sabou [:CosminS]

Comment 52

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/af8569c103d2
https://hg.mozilla.org/mozilla-central/rev/f6d04ade73b5
https://hg.mozilla.org/mozilla-central/rev/2d4822860fe7
https://hg.mozilla.org/mozilla-central/rev/52b6ea9ab31e
https://hg.mozilla.org/mozilla-central/rev/35bf777b62f7
https://hg.mozilla.org/mozilla-central/rev/76dbcd8529c4

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox69: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla69

Cristian Brindusan [:cbrindusan]

Comment 53

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/beaa9e87cca2

Masatoshi Kimura [:emk]

Reporter

Updated

•

6 years ago

Depends on: 1556746

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox67: --- → wontfix

status-firefox68: --- → wontfix

status-firefox-esr60: --- → unaffected

Flags: in-testsuite+

Jorg K (CEST = GMT+2)

Updated

•

6 years ago

Regressions: 1565888

Masatoshi Kimura [:emk]

Reporter

Updated

•

6 years ago

Regressions: 1604145

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Has Regression Range: --- → yes

Bug 1543077 part 1 - Vendor shift_or_euc into m-c. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1543077 part 2 - Use mozilla::JapaneseDetector in the HTML parser. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1543077 part 3 - Remove the old Japanese detector from the tree. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1543077 part 4 - Have only one item for Japanese in the Text Encoding menu. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1543077 part 5 - Enable autodetect of ISO-2022-JP for local files when Fallback Encoding is set to Japanese. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1543077 part 6 - Tests for the new Japanese encoding override. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review