Closed Bug 846936 Opened 11 years ago Closed 11 years ago

Non-ASCII characters not displayed correctly in encoding IBM850 (and possibly others?)

Categories

(Core :: DOM: HTML Parser, defect)

19 Branch
defect
Not set
major

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox19 --- affected
firefox20 - ---
firefox21 - ---
firefox22 - ---

People

(Reporter: mrwarper, Unassigned)

References

()

Details

(Keywords: dev-doc-needed, regression)

Attachments

(1 file)

A friend reported 'FireFox' does not correctly display non-ASCII characters in some pages. I checked in FireFox 18.0.2 and everything was OK. FF updated itself to v19 and it proceeded to fail as reported.

Try with any page you know is encoded with charset IBM850. I haven't checked other encodings yet.
Component: General → HTML: Parser
Product: Firefox → Core
Target Milestone: Firefox 19 → ---
Regression range:
m-c
good=2012-11-08
bad=2012-11-09
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=36e99ea02c05&tochange=90cea19e27e2

Suspected bug:
Bug 801402 - Use EncodingUtils::FindEncodingForLabel instead of nsCharsetAlias::GetPreferred from HTML5 parser and DOM APIs
Blocks: 801402
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
Chrome doesn't display the URL correctly, either.
(In reply to Masatoshi Kimura [:emk] from comment #2)
> Chrome doesn't display the URL correctly, either.

Yes, I knew about that all along. I started to mention it but it got lost when switching from basic to advanced bug reporting and then I forgot -- after all this isn't «BugChromilla», is it?
(In reply to Alfredo Fernández Díaz from comment #3)
> (In reply to Masatoshi Kimura [:emk] from comment #2)
> > Chrome doesn't display the URL correctly, either.
> 
> Yes, I knew about that all along. I started to mention it but it got lost
> when switching from basic to advanced bug reporting and then I forgot --
> after all this isn't «BugChromilla», is it?

The relevance of the page not working in Chrome is that the page was already not working across browsers. In other words, the page was already broken regardless of Firefox removing support for IBM850.

The removal of IBM850 was intentional. Since Chrome has had market success without support, it was inferred that the Web doesn't depend on that particular encoding.

Clearly, there's now one counter example.

Note that IBM850 was a legacy code page even when the Web was introduced. Microsoft was already replacing it with windows-1252 at that time. It's highly unusual for a Web site to use IBM850.

Reporter, is the site maintained by you?
In Safari, the encoding is supported but not listed in the menu.
For reference, we dropped these:

armscii-8
IBM850
IBM852
IBM855
IBM857
IBM862
IBM864
ISO-2022-CN
ISO-8859-12
ISO-IR-111
T.61-8bit
VISCII
x-euc-tw
x-johab
x-mac-arabic
x-mac-ce
x-mac-croatian
x-mac-devanagari
x-mac-farsi
x-mac-greek
x-mac-gujarati
x-mac-gurmukhi
x-mac-hebrew
x-mac-icelandic
x-mac-romanian
x-mac-turkish
x-viet-tcvn5712
x-viet-vps
Triage comment: will wait on response to the question in comment 4 - it's not clear how many people/sites would be impacted here but this looks like it is a tech evangelism and not a release blocking issue.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
...
> The relevance of the page not working in Chrome is that the page was already
> not working across browsers. In other words, the page was already broken
> regardless of Firefox removing support for IBM850.
> 
> The removal of IBM850 was intentional. Since Chrome has had market success
> without support, it was inferred that the Web doesn't depend on that
> particular encoding.

In other words, Chrome trusts its own momentum not to support anything they don't feel like and they've gotten away with it so far, so we'd thought we'd do the same.

If a page isn't tweaked and re-tweaked so it looks the same across browsers, then it is broken? I don't think so, and I remember a time when people at Mozilla would have agreed. A page not being properly rendered when the right encoding is explicitly set doesn't mean the page is broken, it means *the browser is broken*, no matter what it's called or how you spin it.

As for how relevant a particular page is, the question misses the point -- removing support for something and not even mentioning it in the release notes (a tiny but key detail) is a matter of principles.
 
> Clearly, there's now one counter example.
> 
> Note that IBM850 was a legacy code page even when the Web was introduced.
> Microsoft was already replacing it with windows-1252 at that time. It's
> highly unusual for a Web site to use IBM850.

IBM850 and 437 are the defaults for pretty much any Windows text mode session.
 
> Reporter, is the site maintained by you?

Yes, so it's not a problem for me to re-encode the stuff there. However, it will be the first time in history I've tweaked a website to accommodate 'standards-compliant' browsers and not Internet Explorer. The customer and owner gave me an irate call complaining how it was possible that a 'standards advocate' had made something that only looked right in IE. Good for laughs :)
(In reply to Lukas Blakk [:lsblakk] from comment #7)
> Triage comment: will wait on response to the question in comment 4 - it's
> not clear how many people/sites would be impacted here but this looks like
> it is a tech evangelism and not a release blocking issue.

Certainly not, especially if we keep in mind the issue arose post-release.

No impact as far as I am concerned. Still, from here, it seems a bad move to drop previously working code -- but as I said I think it's a matter of principles, so if I were you I wouldn't wait for me either.

I'm not sure who's preaching what Gospel, but you're right making a biblical reference, for I could say scales just fell from my eyes, and now I can see :)

P.S. I forgot to thank Henri for the reference of newly unsupported encodings.
Since the site in question can workaround and we have not seen much impact on other sites, not tracking this for release. If those circumstances change we can revisit.
(In reply to Alfredo Fernández Díaz from comment #8)
> Yes, so it's not a problem for me to re-encode the stuff there.
...
> No impact as far as I am concerned.

Thank you.

> Still, from here, it seems a bad move to
> drop previously working code

That depends on how much the code is in actual use. It is possible that you are the only (or almost only) person in the world who authored a Web page in an old DOS encoding and knew how to declare it.

I did a large number of Bugzilla searches for words like IBM, DOS, charset, character, encoding, code page, accented, umlaut, arabic, turkish, croatian, gujarati, gurmukhi, devanagari, hindi, indian, hebrew, VISCII, vietnamese, serbian, macedonian, bulgarian, french, german, spanish, armscii, armenian, etc. and found no other bugs filed about desupporting the encodings listed in comment 6.

> The customer and owner gave me an irate call complaining how it was possible 
> that a 'standards advocate' had made something that only looked right in IE. 
> Good for laughs :)

I'm sorry that this caused a problem with your customer. For standards advocacy, I recommend advocating the use of UTF-8.
Alfredo, for what it's worth, the standard Gecko is trying to follow here is http://encoding.spec.whatwg.org/ which is indeed not completely compatible with previous deployments of Gecko, but we believe following it will be better for the health of the web long term.

Now granted, in developing that document not all the trade offs might have been correct so any feedback you have is definitely appreciated.
(In reply to Anne (:annevk) from comment #12)
> Alfredo, for what it's worth, the standard Gecko is trying to follow here is
> http://encoding.spec.whatwg.org/ which is indeed not completely compatible
> with previous deployments of Gecko, but we believe following it will be
> better for the health of the web long term.
> 
> Now granted, in developing that document not all the trade offs might have
> been correct so any feedback you have is definitely appreciated.

Anne Van Kesteren?!?! Wow! And you're a Mozillian now. Double wow! :)

Unfortunately I don't have that much to add. I understand not every encoding can be supported, especially if they have not much of a presence in the web. But given we're talking encodings that were previously supported, I wonder what the removal really aims at.

I have been in positions were keeping support for stuff was causing headaches and it was far simpler and easier to just remove it (so there was a reason), but in such cases I always directed a warning to end users and set a reasonable phase-out period. Doing otherwise --and I know it for a fact-- would have only caused major trouble.

Now I may have missed both the reasons and such an announcement in this case (I'll be glad to be pointed to them, but I found none following the regular users' path), but my point is, if you do something like this without warning you'll be lucky if all you get is my reaction (as a tech-type I may be quicker finding who's to blame but that's it). Believe me, I really hope you're that lucky ;)

(In reply to Henri Sivonen (:hsivonen) from comment #11)
...
> It is possible that you
> are the only (or almost only) person in the world who authored a Web page in
> an old DOS encoding and knew how to declare it.

Oh, I know of at least two others, but of course they're mates of mine. It is possible we are (almost) the only people in the world who can do a lot of stuff right and still get punished for it. So sad... 
 
...
> I'm sorry that this caused a problem with your customer.

Oh, it was just a call, and I quickly pointed out the real cause, so they're not blaming me. As I said, good for laughs.

> For standards advocacy, I recommend advocating the use of UTF-8.

While I'd agree for the most part, I prefer not to waste space using multiple-byte or variable-length encodings whenever possible.
>IBM850 and 437 are the defaults for pretty much any Windows text mode session.
FYI, each default system locale on Windows has an OEMCP and an ACP. With a ACP of 1252, yes an OEMCP of 437 or 850 is typical. Other ACPs typically have different OEMCPs. DBCS locales typically have ACP == OEMCP.
(In reply to Yuhong Bao from comment #14)
> FYI, each default system locale on Windows has an OEMCP and an ACP. With a
...
> have different OEMCPs. DBCS locales typically have ACP == OEMCP.

Your point being *not all* Windows text boxes are set to CP850/437, or...?

Sure, CJK systems had standard CPs set long ago by industry and not as much backwards compatibility issues with DOS apps, so just for this once Microsoft spared us from coming up with yet even more code pages... thank God for that.
I searched for duplicates again and didn't find any. Looks like we can get away with not exposing these legacy encodings to the Web.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
I don't understand how important is to remove support to any codepage. Is it too difficult to leave them in Firefox? In fact those codepages now removed are no more than a subset of UTF. There are few pages with ibm850 but there are still some and I think it would be nice to keep supporting them.

I repeat, it isn't that difficult.
dgimeno, the idea is to get all browsers aligned on a common standard. That standard does not include ibm850. If you have a list of pages this breaks that would help.
That's sugar-coating things a bit. The idea seems to be enforcing all web pages to use one of a reduced set of encodings. Nothing against that per se, but it would have been nice to read about it in the release notes at the time (a year ago). Apparently it's never too late to document changes.
If that happened I'm sorry. It must have slipped through. We definitely want to document this.

teoli, any idea with regards to documentation around encoding support?
Keywords: dev-doc-needed
Flags: needinfo?(jypenator)
What further information is needed? Support for several character encodings was removed in the interim between FireFox v18.0.2 and v19, and it wasn't documented in a way visible to normal users (https://www.mozilla.org/en-US/firefox/19.0/releasenotes/), if at all. Maybe I didn't spell it out like this, but I mentioned it anyway a year ago -- comments #8 and #13.

If people need to read anything besides https://www.mozilla.org/en-US/firefox/no./releasenotes/ it would be nice to know. If I had read 'removed support for encodings X Y Z' there at the time I may not have liked it, but I wouldn't have filed a bug.
annevk, I think the common standard does include ibm850. http://www.iana.org/assignments/character-sets/character-sets.xhtml

This page is coded in ibm850.

Further than this, I repeat I think is really odd to remove something already made. Simply that.
Sorry, the page I point is this one: http://sima.cat/vcatcont.php
Flags: needinfo?(jypenator)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: