Closed Bug 431054 Opened 13 years ago Closed 13 years ago

Auto-Detect of character encoding is failed on some Japanese page

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: masayuki, Assigned: smontagu)

References

()

Details

(Keywords: intl, jp-critical, regression)

Attachments

(3 files, 1 obsolete file)

At the Japanese Fx default setting:
intl.charset.default = Shift_JIS
intl.charset.detector = ja_parallel_state_machine

The testcase ( http://bugzilla.mozilla.gr.jp/attachment.cgi?id=3762 ) is detected as "windows-1252".

On Fx2, that succeeded.

If intl.charset.detector is set to Japanese, the Western encoding should not be chosen by Auto-Detect.

smontagu: Do you know which bug can be a cause of this bug?
Flags: blocking1.9?
This is presumably a regression from bug 426271 or bug 424916.
Status: NEW → ASSIGNED
What's the user experience impact of this?
2008032504:
ja_parallel_state_machine works fine.

After Bug 424916 landed (from 2008032604 to 2008040804):
ja_parallel_state_machine broken.

After Bug 426271 landed (from 2008040904):
ja_parallel_state_machine works generally, but fails for some cases.

After Bug 426271 landed, actually ja_parallel_state_machine is not used.
The universal detector with language filter is used now. (Bug 426271 Comment 8)

Universal detector sometimes misdetects Japanese as Western (ISO--8859-1 or Windows-1252).
This was not a big problem because default setting of Japanese l10n build is ja_parallel_state_machine, not universal detector.
But now, it is problem for Japanese users because universal detector is used.

Ideally, universal detector should be able to completely distinguish Japanese and Western, but it's too difficult.
How about removing Western from language filter when auto-detect Japanese?

(In reply to comment #2)
> What's the user experience impact of this?

For some Japanese web page, Japanese Fx trunk with default settings shows garbled page, even if Japanese Fx2 shows correct page.
To show correct page, user must select encoding manually.
Attached patch patch v1.0 (obsolete) — Splinter Review
removing Latin1Prober when NON_CJK is not selected.

This patch works fine for me.
But I'm not sure this change is acceptable for all users.
I think the question is whether this regression occurs in real-world web pages, or only in minimized examples like the testcase here. Attachment 318361 [details] [diff] will mean that Latin-1 pages are never detected when "Japanese" is set as auto-detect option, which I think will also be a regression from Firefox 2 in some cases.

On the other hand, if this is a problem in real-world pages then we need to decide which regression is best for our users. As comment 3 says, the detector is never going to give perfect results in all cases.
Comment on attachment 318361 [details] [diff] [review]
patch v1.0

Simon, does this patch address/fix the problem?
Attachment #318361 - Flags: review?(smontagu)
I need an answer to the question in comment 5 before I can review this. Does the regression occur in real-world pages?
Steps to reproduce in English environment

1. Set View > Encoding > Auto-detect to Japanese
2. In Tools > Options > Content panel click Advanced in the Fonts group
3. Set the default charset to Shift_JIS
4. Click on the URL link

Result: user sees corrupted text

Shift_JIS ‚̃eƒXƒg

Should be:

Shift_JIS のテスト

The settings above reflect the default settings for the Japanese localization of Firefox, which means that any Japanese Firefox user will see corrupted text on any site that

(1) doesn't include a charset meta tag
(2) the first text on the page is Latin text

In Japan (1) is *very* common but (2) is probably not so.  However, this problem will affect lots of sites.

I'm not sure disabling Latin text detection is the best answer though.  

We definitely need to block on this.
Severity: normal → major
(In reply to comment #8)
> any Japanese Firefox user will see corrupted text
> on any site that
> 
> (1) doesn't include a charset meta tag
> (2) the first text on the page is Latin text

This isn't correct: after talking this over with John on IRC I realized what is happening here: the GetConfidence methods in extensions/universalchardet/src/base/CharDistribution.cpp and extensions/universalchardet/src/base/JpCntx.cpp have a minimum threshold of 4 characters, so they just bail out in the testcase here. Adding one more hiragana character to the testcase makes it be detected correctly.

> We definitely need to block on this.

Based on the above, I'm not convinced. We might experiment with reducing the threshold for Japanese when Japanese autodetection is selected, etc., but I would target that for the next point release.
Severity: major → normal
This testcase works for me.
Not blocking as per comment 9 ... Masayuki, is it common to have fewer than 4 characters on which to make an auto detection judgement? If so, renominate.
Flags: blocking1.9? → blocking1.9-
(In reply to comment #11)
> Not blocking as per comment 9 ... Masayuki, is it common to have fewer than 4
> characters on which to make an auto detection judgement? If so, renominate.

o.k. this should not be blocker.
Attached file Testcase for bad case.
This file is written in EUC-JP.
But Firefox recognize as Windows-1252.
Attachment #330561 - Attachment mime type: text/html → text/html; charset=
Attached patch PatchSplinter Review
Attachment #318361 - Attachment is obsolete: true
Attachment #345167 - Flags: review?(VYV03354)
Attachment #318361 - Flags: review?(smontagu)
Comment on attachment 345167 [details] [diff] [review]
Patch

r=me
This patch makes the detector better than Fx2 in some cases.
Attachment #345167 - Flags: review?(VYV03354) → review+
http://hg.mozilla.org/mozilla-central/rev/f1bb0f862b5b
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Flags: wanted1.9.0.x?
Flags: in-testsuite+
Flags: wanted1.9.0.x?
You need to log in before you can comment on or make changes to this bug.