799917 - Make document.characterSet always lowercase

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Description

•

13 years ago

This is what the specs require right now, and it seems to match IE/Opera. Gecko/WebKit use mixed-case, like "UTF-8" and "Shift_JIS" instead of "utf-8" and "shift_jis". Since IE does it, it's probably web-compatible, and Anne says he'd know if Opera ran into any issues. hsivonen said to ping smontagu and maybe bz, so consider yourselves pinged. I think this is worth trying, because there's no way we'll get interop here otherwise, but obviously we need to keep a close eye out for compat in case IE goes down different codepaths.

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 1

•

13 years ago

This causes us to fail tests in <http://w3c-test.org/webapps/DOMCore/tests/approved/Node-properties.html>, which should be imported into our repo relatively soon. They probably aren't comprehensive enough to mark this in-testsuite+, though.

Anne (:annevk)

Updated

•

13 years ago

URL: http://encoding.spec.whatwg.org/

Boris Zbarsky [:bzbarsky]

Comment 2

•

13 years ago

I don't think I'm opposed to this particularly, as long as you check the uses of this property in our chrome and extensions.

Boris Zbarsky [:bzbarsky]

Comment 3

•

13 years ago

And in our C++. And in particular, if you make this change, watch out for consumers that treat the value returned by .characterSet as a canonical charset name.

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 4

•

13 years ago

Per the spec, the canonical names *are* all lowercase. I haven't looked at the code at all, but it seems simplest to just change our internal canonical names to lowercase, no? And grep for references to the old ones in the codebase, and lowercase those too.

Boris Zbarsky [:bzbarsky]

Comment 5

•

13 years ago

> but it seems simplest to just change our internal canonical names to lowercase, no? Maybe. You'd have to check with Simon on that. For one thing, it's likely to affect how things display in the UI, which may not be desirable. In general, from the point of view of not causing regressions, that sounds like the most dangerous approach...

Henri Sivonen (:hsivonen) (not reading bugmail until 2026-07-27)

Comment 6

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #5) > For one thing, it's likely > to affect how things display in the UI, which may not be desirable. How? AFAICT, the labels we show in the UI are not the Gecko-canonical names. (UI says “Windows-1252” but Gecko-canonical is “windows-1252”.)

Boris Zbarsky [:bzbarsky]

Comment 7

•

13 years ago

Ah, if the UI is not showing the Gecko-canonical ones, that makes things much simpler. Again, I'm not opposed to this change. I just want it to be done in such a way that nothing breaks in the process.

Masatoshi Kimura [:emk]

Comment 8

•

13 years ago

The safest approach would be using EncodingUtils::FindEncodingForLabel from HTML parser. It will always return a lowercased encoding name.

Masatoshi Kimura [:emk]

Comment 9

•

13 years ago

And only C++ callers would be affected because GetUnicode(En|De)coderRaw is not scriptable.

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 10

•

13 years ago

Attached patch Patch, just changes GetCharacterSet — Details — Splinter Review

So this is the easy way out -- it only changes .characterSet. This will also change .inputEncoding, which we apparently want to get rid of. And apparently nsIDocument::GetXMLDeclaration. Those are the only C++ callers, according to DXR. Should I change GetXMLDeclaration to use GetDocumentCharacterSet or something instead? I don't know what it's used for. There are probably more tests than this that will break, since I looked pretty superficially. Try run (Linux64 debug only): https://tbpl.mozilla.org/?tree=Try&rev=cdd1d87a0d57 Anything else you want me to do here, or is this okay and we'll see how it goes?

Assignee: nobody → ayg

Status: NEW → ASSIGNED

Attachment #670369 - Flags: review?(bzbarsky)

Attachment #670369 - Flags: feedback?(hsivonen)

Simon Montagu :smontagu

Comment 11

•

13 years ago

If we really have to do this, I *think* that the safest way would be to make our internal canonical names all-lowercase (and audit all callers of GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic example of fixing something not broken. Comment 6 is correct that encoding names displayed in the UI come from a localizable properties file and won't be affected. (In reply to :Aryeh Gregor from comment #4) > Per the spec, the canonical names *are* all lowercase. Which spec? Our current canonical names come from http://www.iana.org/assignments/character-sets, which I consider *the* spec for these purposes.

Masatoshi Kimura [:emk]

Comment 12

•

13 years ago

(In reply to Simon Montagu from comment #11) > Which spec? Our current canonical names come from > http://www.iana.org/assignments/character-sets, which I consider *the* spec > for these purposes. Per Encoding Standard. http://encoding.spec.whatwg.org/#concept-encoding-get DOM is going to reduce the dependency on external specs as much as possible (especially when those specs are open-ended). All supported encodings are given by the fixed list and those names are lower cased.

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 13

•

13 years ago

(In reply to Simon Montagu from comment #11) > If we really have to do this, I *think* that the safest way would be to make > our internal canonical names all-lowercase (and audit all callers of > GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic > example of fixing something not broken. I'm fine with either as long as browsers are willing to align. Specifically, this change would make us match IE and Opera. If they want to change to match us, that's fine by me too. > (In reply to :Aryeh Gregor from comment #4) > > Per the spec, the canonical names *are* all lowercase. > > Which spec? Our current canonical names come from > http://www.iana.org/assignments/character-sets, which I consider *the* spec > for these purposes. Well, this is a DOM attribute, so the spec is the DOM spec: "The characterSet attribute must return the name of the encoding." http://dom.spec.whatwg.org/#dom-document-characterset "name" is a link to: http://encoding.spec.whatwg.org/#name The names in that spec are all lowercase. Anne thinks the IANA spec is "pretty useless"; I gather it's not web-compatible at all. His spec is intended to supersede it for all web purposes. He feels that all-lowercase makes more sense than mixed-case, so Gecko/WebKit should change instead of IE/Opera. The IANA spec's casing is certainly inconsistent -- e.g., "windows-1252" instead of "Windows-1252", and a few other odd ones that are lowercase like "us-dk" and "latin-lap". Lowercase is more predictable. (Anne also points out that GBK and GB18030, at least, are uppercase in the IANA standard but lowercase in Gecko.) Basically, I don't think either way is particularly better than the other, but browsers do need to be consistent. So either we and WebKit have to change, or IE and Opera. The IANA spec is apparently not close to sufficient for browser interop on encodings, so Anne's Encodings spec is the one to look at, and he went with IE/Opera on this point. I don't see compelling advantages to either way, so I'd be in favor of going with whatever Anne wants to spec. I don't think we should push back just because it's more of a pain for us, or because we don't the other way is better than ours. *Someone* has to take the pain, and we're not going to get interop if everyone wants to push it off on everyone else.

Henri Sivonen (:hsivonen) (not reading bugmail until 2026-07-27)

Comment 14

•

13 years ago

Comment on attachment 670369 [details] [diff] [review] Patch, just changes GetCharacterSet feedback+ if you change ToLowerCase() to nsContentUtils::ASCIIToLower(). Our current canonical names come from > http://www.iana.org/assignments/character-sets, which I consider *the* spec > for these purposes. I really think we should treat http://encoding.spec.whatwg.org/ as *the* spec. The IANA does not even admit that ISO-8859-1 is an alias for Windows-1252.

Attachment #670369 - Flags: feedback?(hsivonen) → feedback+

Masatoshi Kimura [:emk]

Comment 15

•

13 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #14) > I really think we should treat http://encoding.spec.whatwg.org/ as *the* > spec. The IANA does not even admit that ISO-8859-1 is an alias for > Windows-1252. Our current decoders for ISO-8859-1 and windows-1252 are indeed different, although bug 562096 is present for them. I made EncodingUtils::FindEncodingForLabel as an interim workaround until bug 562096 is fixed.

Boris Zbarsky [:bzbarsky]

Comment 16

•

13 years ago

Comment on attachment 670369 [details] [diff] [review] Patch, just changes GetCharacterSet OK, so we're saved by the fact that necko uses GetUnicodeEncoder without Raw in URI code, right? Because that code certainly gets the output of .characterSet. In general, I skimmed the users of GetUnicodeEncoderRaw and GetUnicodeDecoderRaw and none look like they should end up with values from here, but it's worth double-checking that. GetXMLDeclaration is used when serializing XML documents. Having it lowercase is probably fine. r=me if you do the above double-checking.

Attachment #670369 - Flags: review?(bzbarsky) → review+

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 17

•

13 years ago

Comment on attachment 670369 [details] [diff] [review] Patch, just changes GetCharacterSet I fixed a bunch of test failures and did a new try push, mochitest-only but on multiple platforms, because I'm nervous about the possibility of platform-specific code here. Maybe I'm being overly cautious: https://tbpl.mozilla.org/?tree=Try&rev=003fa29f6e66 Simon, are you on board with the change at this point? (In reply to Boris Zbarsky (:bz) from comment #16) > In general, I skimmed the users of GetUnicodeEncoderRaw and > GetUnicodeDecoderRaw and none look like they should end up with values from > here, but it's worth double-checking that. > > GetXMLDeclaration is used when serializing XML documents. Having it > lowercase is probably fine. > > r=me if you do the above double-checking. I glanced at all of them, and none of them look like they could plausibly use .characterSet, but I'm not familiar enough with this code to make guarantees.

Attachment #670369 - Flags: feedback?(smontagu)

Simon Montagu :smontagu

Comment 18

•

13 years ago

Comment on attachment 670369 [details] [diff] [review] Patch, just changes GetCharacterSet I suppose I am "on board" in as much as I'm not going to be the sole hold-out, but I'm not happy with the current trend of rewriting specs to retroactively dictate interop on the behaviour of the implementers that are least likely to conform to the spec. Not much of an incentive for anyone to care about the spec, is it?

Attachment #670369 - Flags: feedback?(smontagu) → feedback+

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 19

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74 (In reply to Simon Montagu from comment #18) > I suppose I am "on board" in as much as I'm not going to be the sole > hold-out, but I'm not happy with the current trend of rewriting specs to > retroactively dictate interop on the behaviour of the implementers that are > least likely to conform to the spec. Not much of an incentive for anyone to > care about the spec, is it? To the contrary -- it makes the spec more likely to match the consensus of implementations, which makes it more useful, and makes it more likely that we'll get interop. If the spec is more realistic, that only increases people's incentives to care about it, because there are practical reasons as well as principled ones. It's annoying when it's not clear-cut whether the spec is right and we follow the spec and then it changes to match the people who didn't follow the spec, but it's better than not getting as much interop. If IE and/or WebKit aren't going to change to get interop, better that we do than no one does.

Flags: in-testsuite+

Masatoshi Kimura [:emk]

Comment 20

•

13 years ago

(In reply to :Aryeh Gregor from comment #19) > https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74 You didn't change ToLowerCase() to nsContentUtils::ASCIIToLower(). > 1.12 + ToLowerCase(aCharacterSet);

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 21

•

13 years ago

Oops. Thanks for pointing that out. Henri, would you like me to push a followup?

Ryan VanderMeulen [:RyanVM]

Comment 22

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/ba1ece534b74

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

Henri Sivonen (:hsivonen) (not reading bugmail until 2026-07-27)

Comment 23

•

13 years ago

(In reply to :Aryeh Gregor from comment #21) > Oops. Thanks for pointing that out. Henri, would you like me to push a > followup? I guess ToLowerCase is good enough when the input to is guaranteed to be ASCII, which is the case here.

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

Depends on: 803355

neil@parkwaycc.co.uk

Comment 24

•

13 years ago

This broke the character set menu, which assumes that document.characterSet will be canonical.

neil@parkwaycc.co.uk

Updated

•

13 years ago

Depends on: 804554

Boris Zbarsky [:bzbarsky]

Comment 25

•

13 years ago

Uh... didn't I ask for that to be tested in comment 5? Aryeh, what's the plan for fixing the regressions?

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

Depends on: 805111

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 26

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #25) > Uh... didn't I ask for that to be tested in comment 5? I thought you were only talking about the approach of changing our canonical case, not the approach I took of just changing .characterSet. > Aryeh, what's the plan for fixing the regressions? I don't have the time -- I'm working only a few hours a week at present and have security bugs and crashers being assigned to me. If no one else has time either, this will have to be backed out.

Boris Zbarsky [:bzbarsky]

Comment 27

•

13 years ago

In that case, I think we should back this out for now. This is very low priority, imo, so I have a hard time asking someone to drop anything else and work on the regressions.

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Comment 28

•

13 years ago

Makes sense to me. Maybe when I next have a block of free time I can take this and tackle the regressions then.

Masatoshi Kimura [:emk]

Comment 29

•

13 years ago

Attached patch Backout patch — Details — Splinter Review

I want this backout because it interferes with bug 801402.

Masatoshi Kimura [:emk]

Comment 30

•

13 years ago

This is a simple backout, so review should not be required.

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 31

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/507e7ad0aa82

Status: RESOLVED → REOPENED

Flags: in-testsuite+

Keywords: checkin-needed

Resolution: FIXED → ---

Target Milestone: mozilla19 → ---

Ryan VanderMeulen [:RyanVM]

Comment 32

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/507e7ad0aa82

Aryeh Gregor (:ayg) (no longer with Mozilla)

Reporter

Updated

•

13 years ago

Assignee: ayg → nobody

Jens Stutte [:jstutte]

Comment 33

•

5 years ago

Bulk-downgrade of unassigned, untouched DOM/Storage bug's priority.

If you have reason to believe, this is wrong, please write a comment and ni :jstutte.

Severity: minor → S4

Priority: -- → P5

Henri Sivonen (:hsivonen) (not reading bugmail until 2026-07-27)

Comment 34

•

5 years ago

We decided that we don't want to do this and changed the spec accordingly. Not sure how the bug remained open.

Status: REOPENED → RESOLVED

Closed: 13 years ago → 5 years ago

Resolution: --- → WONTFIX

Patch, just changes GetCharacterSet 13 years ago Aryeh Gregor (:ayg) (no longer with Mozilla) 8.83 KB, patch	bzbarsky : review+ hsivonen : feedback+ smontagu : feedback+	Details \| Diff \| Splinter Review
Backout patch 13 years ago Masatoshi Kimura [:emk] 8.77 KB, patch		Details \| Diff \| Splinter Review