Closed Bug 627231 Opened 13 years ago Closed 3 years ago

Inconsistent character encoding in different tabs when one of the tabs has been window.open'ed.

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: kdevel, Assigned: kdevel)

Details

(Whiteboard: DUPEME)

Attachments

(2 files)

User-Agent:       
Build Identifier: Mozilla/5.0 (X11; Linux x86_64; rv:2.0b10pre) Gecko/20110114 Firefox/4.0b10pre

If an explicit charset is missing in the HTTP response and in the HTML one expects the page to be rendered assuming ISO-8859-1. After having created a tab with window.open FF is inclined to open any such page --- if it is not already cached with a different encoding --- using UTF-8 instead.

Reproducible: Always

Steps to Reproduce:
1. Start FF with new profile (and empty history/cache).
2. Open testcase
3. click on the second link.
4. click on "Weiter zum n�chsten Highlight"
Actual Results:  
Page is assigned 'UTF-8', it contains lots of incarnations of the unicode replacement character (U+FFFD, �) instead of umlauts/'ß'

Expected Results:  
Assign ISO-8859-1 to the page, render umlauts/'ß' correctly.

You may open additional pages in the first tab using the first link in the testcase.
Attached file testcase
Confirmed on
http://hg.mozilla.org/mozilla-central/rev/e807269acaa3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10pre) Gecko/20110119 Firefox/4.0b10pre ID:20110119030331
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → i18n
Hardware: x86_64 → All
> one expects the page to be rendered assuming ISO-8859-1

One does?  Why?  That really doesn't work for a large chunk of the web (say anything that's not in a Western European language), which is why if we know who linked to or opened the page we use the encoding of said opener as the fallback if no other encoding is specified.  There are also various other heuristics involved here too, by the way.

The current spec draft on this is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding and you're interested in steps 6 and 8.  And possibly in the text starting "This algorithm is a willful violation".
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
This issue is about the character set chosen after STR step 4 not that after step 3. If you open the page of step 4 directly (via the first link) ISO-8859-1 is chosen as expected. If you perform the STR it is UTF-8. This is the inconsistency.

| 6. If the user agent has information on the likely encoding for this page, e.g.
| based on the encoding of the page when it was last visited, then return that
| encoding, with the confidence tentative, and abort these steps.

The page visible after step 4 has not been visited before. I doubt that the UTF-8 is assingment to the page after step 4 is the effect of an explicit implementation (piece of code) of "likely".

I suppose that the specific tab has become "stateful" with regard to having a preference towards UTF-8. This statefullness is obviously caused by being window.open'ed from a UTF-8 encoded page.

| 8. Otherwise, return an implementation-defined or user-specified default character encoding, 

The user specified default encoding is ISO-8859-1 not UTF-8 (take a look into Edit| Preferences | Advanced | Character endoding). If not otherwise stated I use the unlocalized version of FF which has ever since I can remember set the default to ISO-8859-1.

IFAICS it's also not an implementation-defined default at least not in the classical meaning

    implementation-defined behavior
    unspecified behavior where each implementation documents how the choice is made

since there does not seem to be such a documentation of the observed bahaviour.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
> If you open the page of step 4 directly (via the first link) ISO-8859-1
> is chosen as expected.

No, what's chosen is your default fallback charset (and you know this, based on the rest of your comment).  This is a preference you can set in Preferences > Content > Fonts & Colors, click the Advanced button, see the "Default Character Encoding" dropdown.  The default value of the preference depends on the localization; if you're in a Western European or en-US or whatever localization it'll be ISO-8859-1.  But if you're using the Japanese localization, say, it'll be Shift_JIS.

> The page visible after step 4 has not been visited before.

Yes, but that's just an example of where one would get information on the likely encoding.

In this case, we do have information on the likely encoding of the page: the same as the page that opened it or linked to it, if we have no other information.  There are tons of websites out there that rely on this heuristic and would break if it did not happen.

Now there _are_ proposals to not carry this information across origins.  If you prefer, you could find that and mark this bug as duplicate of the bug on that.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → INVALID
Boris! 

1. Yes. You may set your browser's (user's) default charset to any value != UTF-8 in order to reproduce this issue. Take for example "KOI8-R". If you take the first URL of the testcase you will (as expected) read "Was wirklich zДhlt"on screen. But if you take the route via the second link your screen reads "Was wirklich z�hlt:".

2. "likely encoding"

>> The page visible after step 4 has not been visited before.
> Yes, but that's just an example of where one would get information on the
> likely encoding.

And that is the point: This is as wrong as taking the CPU temperature or any other unrelated piece of information in order to control the encoding of the current page.

Exactly this is the bug! 

The inconsistency comes from the window.open which imposes the UTF-8 on every successive load in this specific tab. And it only happens if the tab has been window.opened.

Honestly, out there is not a single website which relies on this "heuristic". 

3. I could not find any bug which adresses the issue of this. Do you have a Bug id?
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
> unrelated piece of information

The thing is... in practice the encoding of the opening page is NOT unrelated information.  It's very highly correlated with the encoding of the page being opened.  Which is why the heuristic is there.

> And it only happens if the tab has been window.opened.

This is false.  You can get the same behavior with subframes, etc.  Again, it's quite purposeful.

> Honestly, out there is not a single website which relies on this "heuristic". 

Bunk.  Go read the bugs that added the behavior, please.  These sites are pretty common in non-Western locales.  Just because _you_ don't run into such sites doesn't mean you should impose your ethnocentrism on the rest of the world.

> Do you have a Bug id?

If I did, I would have marked this duplicate.  Please search.  You can do that as well as I can.
Assignee: smontagu → kdevel
Whiteboard: DUPEME
> These sites are pretty common in non-Western locales.

This issue is not bound to western locales. As explained you can reproduce it with any user defined charset != UTF-8.
Yes, of course.  Please read what I said.
Attached patch change proposalSplinter Review

We no longer have a configurable fallback.

Status: REOPENED → RESOLVED
Closed: 13 years ago3 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: