Closed Bug 620106 Opened 13 years ago Closed 13 years ago

Auto-detection of character encoding fails

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0
Tracking Status
blocking2.0 --- final+

People

(Reporter: alice0775, Assigned: hsivonen)

References

Details

(Keywords: regression, Whiteboard: [softblocker])

Attachments

(5 files, 3 obsolete files)

Build Identifier:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b9pre) Gecko/20101217
Firefox/4.0b9pre ID:20101217030324

Auto-detection of character encoding fails if HTML5.enable=true (default).
test case1 (padding much number of ASCII characters): fails
test case2: works

Reproducible: Always

Steps to Reproduce:
1. Start Minefield with new profile
2. Alt > View > Character Encoding > Auto-Detect > Check Japanese or Universal
3. Open Testcase1

Actual Results:
Auto-detection of character encoding fails

Expected Results:
Auto-detection of character encoding works properly

Regression window:
Works;
http://hg.mozilla.org/mozilla-central/rev/5b4d16af6e6b
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a5pre) Gecko/20100502 Minefield/3.7a5pre ID:20100502160919
Fails:
http://hg.mozilla.org/mozilla-central/rev/3a7920df7580
Mozilla/5.0 (Windows; U; Windows NT 6.1; WOW64; en-US; rv:1.9.3a5pre) Gecko/20100503 Minefield/3.7a5pre ID:20100503105056
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5b4d16af6e6b&tochange=3a7920df7580
This irritate a Japanese user.
And this is regression, I think this should block.
blocking2.0: --- → ?
Target Milestone: --- → mozilla2.0
Second regression, Regardless of setting of HTML5,it fails.
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=dda7c56c17b4&tochange=846890403c24
I guess the following c-set causes this problem.
4a6b283ec78c	Ehren Metcalfe — Remove dead code in intl. Bug 559489, r=smontagu
Blocks: 559489
Summary: Auto-detection of character encoding fails if HTML5.enable=true (default) → Auto-detection of character encoding fails
(In reply to comment #2)
> This irritate a Japanese user.

Which sites does this break?

> And this is regression, I think this should block.

This is an intentional change. Chardet now sees only the first 1024 bytes. We've managed to have it this way for months and months and this is the first complaint.
(In reply to comment #4)
> (In reply to comment #2)
> > This irritate a Japanese user.
> 
> Which sites does this break?
it is blog, it depend on the contents of day base change. One day it succeed and fail on another day.
http://planet.debian.or.jp/

> 
> > And this is regression, I think this should block.
> 
> This is an intentional change. Chardet now sees only the first 1024 bytes.
> We've managed to have it this way for months and months and this is the first
> complaint.
Which Bug.?
I think that there are many pages where there are scripts (most, ASCII letter) in the top of the page. 
In this case Minefield fails in the auto-detection with high probability.

(In reply to comment #4)
> This is an intentional change. Chardet now sees only the first 1024 bytes.
This assumption(first 1024)bytes is *not* reasonable.
Assignee: nobody → smontagu
Component: HTML: Parser → Internationalization
QA Contact: parser → i18n
(In reply to comment #5)
> it is blog, it depend on the contents of day base change. One day it succeed
> and fail on another day.
> http://planet.debian.or.jp/

It's sad that even Debian is running a site without an encoding label. I think evang is needed here.

(In reply to comment #6)
> This assumption(first 1024)bytes is *not* reasonable.

The problem with not limiting chardet to the <meta> sniffing buffer is that enabling chardet could affect script execution or alternatively pages would not show incrementally. It seems bad to have the chardet setting affect script execution. Wrecking incremental rendering seems even worse.

(The script execution scenario is that first a script runs and has side effects, then chardet decides to change the ecoding and reloads the page and then the script runs again.)
(In reply to comment #5)
> Which Bug.?

There's no bug number. This was part of integrating chardet into the new parser before the new parser was enabled.
(In reply to comment #8)
> (In reply to comment #5)
> > Which Bug.?
> 
> There's no bug number. This was part of integrating chardet into the new parser
> before the new parser was enabled.

I think that 1024bytes  is too short.
Minefield fails auto chardet.
                                Minefield4.0b9pre  Firefox3.6.13(incl.Chrome8,IE8, Opera11)
attachment 498505 [details]      NG                         OK
attachment 498506 [details]      OK                         OK
attachment 498725 [details]      NG                         OK
How many bytes do other browsers sniff? Or is the sniffing limit time-based? Do they have a limit? Do they start displaying the file before they commit to an encoding? If they do, how do they deal with scripts?
>How many bytes do other browsers sniff? Or is the sniffing limit time-based? 
>Do they have a limit?
Maybe they(Firefox3.6.13, IE8, Chrome and Opera) parse all text.

>Do they start displaying the file before they commit to an encoding?
Minefield4.0b9pre and Firefox3.6.13 do. 
Opera ,IE8 and Chrome do not.

>If they do, how do they deal with scripts?
Firefox3.6.13: run twice.
Minefiels4.0b9pre, Opera ,IE8 and Chrome: run once.
(In reply to comment #12)
> >How many bytes do other browsers sniff? Or is the sniffing limit time-based? 
> >Do they have a limit?
> Maybe they(Firefox3.6.13, IE8, Chrome and Opera) parse all text.

It would be interesting to know for sure what IE8, Chrome and Opera do here.

> >Do they start displaying the file before they commit to an encoding?
> Minefield4.0b9pre and Firefox3.6.13 do. 

How did you test this? Minefield is not supposed to start displaying before chardet has had its chance. Only a late <meta> can cause a reload due to an encoding commitment change.

> Opera ,IE8 and Chrome do not.

Interesting.

What evidence is there besides http://planet.debian.or.jp/ that the current limit of 1024 is unsuitable for real-world usage? That is, is this bug report asking for a change because of one site or because of a substantially larger practical problem?
(In reply to comment #13)

> > >Do they start displaying the file before they commit to an encoding?
> > Minefield4.0b9pre and Firefox3.6.13 do. 
> 
> How did you test this? Minefield is not supposed to start displaying before
> chardet has had its chance. Only a late <meta> can cause a reload due to an
> encoding commitment change.

inset onload attribute.
You can check how many alert box pops up and what charset detect.
<body onload="alert(document.charset ? document.charset : document.characterSet);">
Attached file how to check?
Firefox3.6.x: two alert box pop up. first: fails detect. second: success
Minefield4.0x: one alert box pop up. and detection fails
Opera11: two alert box pop up. first: fails detect. second: success
Chrome10: one alert box pop up. first: fails detect but correctly displayed, 
IE8: one alert box pop up and detection success
Assignee: smontagu → nobody
Component: Internationalization → HTML: Parser
QA Contact: i18n → parser
(In reply to comment #15)

> Firefox3.6.x: two alert box pop up. first: fails detect. second: success

Basically confirmed. But is it possible that under some conditions, Firefox 3.6 get it right on the first try? I seemed to get such results sometimes, but I'm not 100% sure.

> Minefield4.0x: one alert box pop up. and detection fails

Confirmed. 
Another difference from Firefox3.6 — and also a deviation from HTML5 — is that Firefox4.beta falls back to Windows 1252 instead of to Shift JIS. 

> Opera11: two alert box pop up. first: fails detect. second: success

Partially confirmed: in version 10.5 on Mac OS X, Leopard it only works that way whener the active locale is a "non-Windows-1252-is-the-default locale". (I suppose that version 11 and 10.5 are similar.)

> Chrome10: one alert box pop up. first: fails detect but correctly displayed, 

*Not* confirmed.  (I tested with Japanese localization on Mac OS X.) When using the Japanse localization, then Chrome 10 falls back to Shift JIS. I forgot to to test without the Japanee locale, but I belive in the fallbs back to something else - usually Windows 1252. 

{If I remember correctly, then Chrome 8 falls back to Windows 1252, always.)

> IE8: one alert box pop up and detection success

(Did not test IE in this round.)

Henri, note that according to HTML5, the default charset for the Japanese locale should be SHIFT JIS/Windows-31J).
(In reply to comment #16)
Thanks for testing. For the record, there's an interesting result documented in
http://www.w3.org/mid/20110103193517457018.e08c2045@xn--mlform-iua.no

> But is it possible that under some conditions, Firefox 3.6
> get it right on the first try?

Firefox remembers the encoding of the page in the history entry, so if your last visit to a page hasn't been purged from history, the initial tentative encoding may differ from a first-visit situation.

> Henri, note that according to HTML5, the default charset for the Japanese
> locale should be SHIFT JIS/Windows-31J).

Indeed. That's what Firefox does:
http://mxr.mozilla.org/l10n-central/source/ja/toolkit/chrome/global/intl.properties#26
Component: HTML: Parser → Internationalization
This really needs to be in the HTMLparser component, because that's where any fix would occur. jst, do you think this would block?
Component: Internationalization → HTML: Parser
Johnny, can we get a call on this - should it block?
Assignee: nobody → hsivonen
FWIW, I think this shouldn't block given the current level of evidence. The behavior change was a code simplification and, more importantly, a measure that makes page loading work more consistently and with less perf impact with chardet on vs. off.

It was to be expected that there'd be existence proof of a site that'd fail with the simplification. So far, there's reported (and assigned to the right bugzilla component) evidence of *one* such site. If the problem indeed isn't common, making this a blocker would be an overreaction.

Of course, it's *possible* that the problem is indeed more serious, but we just don't have enough nightly testing coverage by people who browse CJK and Cyrillic sites. I think that's unlikely, though. If the sniffing change had seriously broken e.g. the Japanese-language Web, I'd have expected multiple reports to show up within days of turning the HTML5 parser on by default even if we had only a handful of testers browsing Japanese sites. (As opposed to a report of one site turning up over 6 months later.)
Blocking on figuring out what to do here. Henri, would be great if you could find me on irc and we can figure things up. I'm currently missing too many details to be able to make a good decision.
blocking2.0: ? → final+
Whiteboard: [softblocker]
This patch makes http://planet.debian.or.jp/ readable albeit with a noticeable flash of misdecoded content on the first visit. (The flash can't be gotten rid of without breaking incremental rendering of sites that don't end up needing chardet. And in any case, whenever a site needs chardet, the site has been misauthored anyway.)
Note that reloading the page in mid-parse can cause problems with legacy Apache servers that implement Content-Encoding: gzip in a bogus way. Dreamhost is known to have such bogus Apache, but maybe Dreamhost isn't hosting too much unlabeled CJK or Cyrillic content for this to be an issue.
While trying to write a mochitest for this, I discovered two things:

 1) It's really hard to write a mochitest for this, because putting the test data in the main test file means it's too late to flip the pref to enable chardet from that file itself but putting the test data in an iframe seems to interfere with the late reload request.

 2) This patch reintroduces buffer boundary-dependent behavior when chardet is enabled, since chardet gets fed in the chunks that Necko gives. That's pretty sad. Compared to the previous version of the patch, this version deliberately ignores some data in the case where a POST response is parsed in order to avoid buffer boundary-dependent results in that case.

sicking, ping? Did you get my email? With the email and this additional information, do you have a decision where to go with this? Shall we leave the code as-is? Shall we land the code in this patch? Or should I try to add complexity and data copying in order to hide the effects of buffer boundaries completely?
Attachment #509399 - Attachment is obsolete: true
(In reply to comment #25)
>  1) It's really hard to write a mochitest for this, because putting the test
> data in the main test file means it's too late to flip the pref to enable
> chardet from that file itself but putting the test data in an iframe seems to
> interfere with the late reload request.

Can you use the framework in extensions/universalchardet/tests? I never got round to writing a HOWTO for it, but it should be pretty self-explanatory.
(In reply to comment #26)
> Can you use the framework in extensions/universalchardet/tests?

Yes! Thank you.

sicking, in case we decide that this approach is what we want, the patch is ready for review. When can we make the decision if this is what we want?
Attachment #509712 - Attachment is obsolete: true
Attachment #510240 - Flags: review?(jonas)
The status here is that sicking and I discussed this and decided to proceed with landing this without further buffering tweaking.
Status: NEW → ASSIGNED
Comment on attachment 510240 [details] [diff] [review]
Make the new parser behave like the old one, exact timing of sniffing success depends on buffer boundaries :-(

Looks good. Only nit is change mDontFeedChardet to mFeedChardet to avoid all the double-negatives.

r=me with that change.
Attachment #510240 - Flags: review?(jonas) → review+
(In reply to comment #29)
> Comment on attachment 510240 [details] [diff] [review]
> Make the new parser behave like the old one, exact timing of sniffing success
> depends on buffer boundaries :-(
> 
> Looks good. Only nit is change mDontFeedChardet to mFeedChardet to avoid all
> the double-negatives.

It's that way because chardet's API wants a "don't feed" boolean out param. To flip it, I'd need to add another temporary boolean for the out param instead of passing the field's address.
That's ok. The readability increase everywhere else makes it worth it.
Landed with the field inverted.
http://hg.mozilla.org/mozilla-central/rev/584a654b4be6
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.