Closed Bug 742806 Opened 12 years ago Closed 12 years ago

Incorrect character encoding used even though meta tag is present

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 664743

People

(Reporter: mark, Unassigned)

References

Details

(Keywords: regression)

I tried to use Yahoo Babelfish with UTF-8 content (Cyrillic)
http://babelfish.yahoo.com/

The output was shown as raw UTF-8 html elements instead of cyrillic.

Further investigation showed:
Meta tag in the page header: <meta http-equiv="content-type" content="text/html; charset=UTF-8">
Character encoding selected by Firefox: Western ISO-8859-1

The 9 branch still did this right, the 11 branch does this wrong.
I just looked a little closer at the headers there, apparently babelfish has TWO meta tags there for content-type:

meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"
and below that
meta http-equiv="content-type" content="text/html; charset=UTF-8"

So apparently in previous versions it would use the LAST one encountered (which seems logical?) and the current version stops checking after the FIRST one, and ignores anything after that.
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#meta-charset-during-parse
says we should "change the encoding" when seeing the second one.

In this case, I guess that means step 5 here:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#change-the-encoding

Is this related to bug 61363?
Severity: critical → normal
Component: General → HTML: Parser
OS: Windows 7 → All
Product: Firefox → Core
QA Contact: general → parser
Hardware: x86_64 → All
Version: 11 Branch → unspecified
Henri?
Regression window
Detected UTF-8:
http://hg.mozilla.org/mozilla-central/rev/a8f07cad55e2
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a1) Gecko/20110508 Firefox/6.0a1 ID:20110508030616
Detected ISO-8859-1:
http://hg.mozilla.org/mozilla-central/rev/b6a4eb4ec5d4
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a1) Gecko/20110507 Firefox/6.0a1 ID:20110508032326
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a8f07cad55e2&tochange=b6a4eb4ec5d4

Triggered by:
c3081b5db3d1	Ms2ger — Bug 572652 - Remove the Accept-Charset header from HTTP requests. r=bz

As far as I remember, Bug 572652 was backed out if 6,7,8 and 9.
So, this problem occurs in Firefox 10 and later.
Blocks: 572652
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
(In reply to Mark Straver from comment #1)
> I just looked a little closer at the headers there, apparently babelfish has
> TWO meta tags there for content-type:
> 
> meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"
> and below that
> meta http-equiv="content-type" content="text/html; charset=UTF-8"
> 
> So apparently in previous versions it would use the LAST one encountered
> (which seems logical?) and the current version stops checking after the
> FIRST one, and ignores anything after that.

Firefox uses the first declaration that declares an encoding that Firefox recognizes. This is what the spec says. This is what makes sense for performance.

As I understand it, previously, Babelfish served different content to Firefox compared to what it serves now, because Babelfish is sensitive to the Accept-Charset header that Firefox stopped sending. The <meta> handling hasn't changed in the way you guess.

(In reply to Mats Palmgren [:mats] from comment #2)
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-
> construction.html#meta-charset-during-parse
> says we should "change the encoding" when seeing the second one.

Only if confidence is "tentative". The first meta has made the parser confident already.

The behavior here is not a Gecko bug. I'd move this over to evangelism, but we already have an open evangelism bug for this issue on Babelfish, so marking as duplicate.

> Is this related to bug 61363?

Only to the extent both bugs relate to encoding sniffing.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
> Babelfish is sensitive to the Accept-Charset header that Firefox stopped sending.

Maybe I'm missing something here, but why exactly did Firefox stop sending this header?
(In reply to Mark Straver from comment #6)
> > Babelfish is sensitive to the Accept-Charset header that Firefox stopped sending.
> 
> Maybe I'm missing something here, but why exactly did Firefox stop sending
> this header?

It was a waste of bytes except on a couple of Yahoo! properties. Now it's a waste of bytes everywhere except of Babelfish.

(We don't do what other browsers do and vary behavior for specific broken top sites.)
Haha! Perfectly clear. IOW: Babelfish/Yahoo needs to get their sites in order, is what you are saying.

It makes me wonder though: isn't this supposed to be part of the normal specification? If accept-charset is NOT sent, doesn't that make the encoding tentative by default, as in this case the server doesn't know what the browser accepts? I think in this case the babelfish server sends a default encoding unless it realizes that for the target language, UTF-8 (or a different encoding) is required, and assumes that, lacking anything conclusive in the headers received, the encoding will be interpreted as tentative by the browser and can therefore be changed on-the-fly?
(In reply to Mark Straver from comment #8)
> It makes me wonder though: isn't this supposed to be part of the normal
> specification? 

Sending Accept-Charset isn't mandatory per spec.

> If accept-charset is NOT sent, doesn't that make the encoding
> tentative by default, 

"Tentative" is a state in the rules that the browser uses for determining the encoding of an HTML document given the HTTP response headers and the payload (arriving over time). The request headers have nothing to do with "tentative" here.

> as in this case the server doesn't know what the
> browser accepts? 

Any reasonable server should expect the browser to accept UTF-8. Being able to vary the served encoding is a useless capability these days. Any server that has that capability should use it to always serve UTF-8. All browsers accept UTF-8 and have done so for years and years.

> I think in this case the babelfish server sends a default
> encoding unless it realizes that for the target language, UTF-8 (or a
> different encoding) is required, and assumes that, lacking anything
> conclusive in the headers received, the encoding will be interpreted as
> tentative by the browser and can therefore be changed on-the-fly?

No, that's not how it works. The first meta makes the encoding no longer tentative.
(In reply to Henri Sivonen (:hsivonen) from comment #9)
> Sending Accept-Charset isn't mandatory per spec.
Check!
> "Tentative" is a state in the rules that the browser uses for determining
> the encoding of an HTML document given the HTTP response headers and the
> payload (arriving over time). The request headers have nothing to do with
> "tentative" here.
Awesome, crystal clear and thanks for taking the time to explain! :)

> No, that's not how it works. The first meta makes the encoding no longer
> tentative.
Yup, got it now. It's explicitly stated in the first meta tag so gecko takes that as a hard definition of the character set to use for the page. Babelfish should simply not send the first tag to make sure their page doesn't break. 
Unfortunately, until they fix this, it DOES mean that I can't use my favorite browser to use their service. I guess I'll have to stick to an older fox or a different browser for translation work.
(In reply to Mark Straver from comment #10)
> I guess I'll have to stick to an
> older fox or a different browser for translation work.

Or you could use Google Translate.
You need to log in before you can comment on or make changes to this bug.