Make document.characterSet always lowercase

REOPENED
Unassigned

Status

()

Core
HTML: Parser
--
minor
REOPENED
6 years ago
5 years ago

People

(Reporter: ayg, Unassigned)

Tracking

(Depends on: 3 bugs)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(2 attachments)

This is what the specs require right now, and it seems to match IE/Opera.  Gecko/WebKit use mixed-case, like "UTF-8" and "Shift_JIS" instead of "utf-8" and "shift_jis".  Since IE does it, it's probably web-compatible, and Anne says he'd know if Opera ran into any issues.  hsivonen said to ping smontagu and maybe bz, so consider yourselves pinged.  I think this is worth trying, because there's no way we'll get interop here otherwise, but obviously we need to keep a close eye out for compat in case IE goes down different codepaths.
This causes us to fail tests in <http://w3c-test.org/webapps/DOMCore/tests/approved/Node-properties.html>, which should be imported into our repo relatively soon.  They probably aren't comprehensive enough to mark this in-testsuite+, though.
I don't think I'm opposed to this particularly, as long as you check the uses of this property in our chrome and extensions.
And in our C++.

And in particular, if you make this change, watch out for consumers that treat the value returned by .characterSet as a canonical charset name.
Per the spec, the canonical names *are* all lowercase.  I haven't looked at the code at all, but it seems simplest to just change our internal canonical names to lowercase, no?  And grep for references to the old ones in the codebase, and lowercase those too.
> but it seems simplest to just change our internal canonical names to lowercase, no? 

Maybe.  You'd have to check with Simon on that.  For one thing, it's likely to affect how things display in the UI, which may not be desirable.

In general, from the point of view of not causing regressions, that sounds like the most dangerous approach...
(In reply to Boris Zbarsky (:bz) from comment #5)
>  For one thing, it's likely
> to affect how things display in the UI, which may not be desirable.

How? AFAICT, the labels we show in the UI are not the Gecko-canonical names. (UI says “Windows-1252” but Gecko-canonical is “windows-1252”.)
Ah, if the UI is not showing the Gecko-canonical ones, that makes things much simpler.

Again, I'm not opposed to this change.  I just want it to be done in such a way that nothing breaks in the process.
The safest approach would be using EncodingUtils::FindEncodingForLabel from HTML parser. It will always return a lowercased encoding name.
And only C++ callers would be affected because GetUnicode(En|De)coderRaw is not scriptable.
Created attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet

So this is the easy way out -- it only changes .characterSet.  This will also change .inputEncoding, which we apparently want to get rid of.  And apparently nsIDocument::GetXMLDeclaration.  Those are the only C++ callers, according to DXR.  Should I change GetXMLDeclaration to use GetDocumentCharacterSet or something instead?  I don't know what it's used for.

There are probably more tests than this that will break, since I looked pretty superficially.  Try run (Linux64 debug only): https://tbpl.mozilla.org/?tree=Try&rev=cdd1d87a0d57  Anything else you want me to do here, or is this okay and we'll see how it goes?
Assignee: nobody → ayg
Status: NEW → ASSIGNED
Attachment #670369 - Flags: review?(bzbarsky)
Attachment #670369 - Flags: feedback?(hsivonen)
If we really have to do this, I *think* that the safest way would be to make our internal canonical names all-lowercase (and audit all callers of GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic example of fixing something not broken.

Comment 6 is correct that encoding names displayed in the UI come from a localizable properties file and won't be affected.

(In reply to :Aryeh Gregor from comment #4)
> Per the spec, the canonical names *are* all lowercase.

Which spec? Our current canonical names come from http://www.iana.org/assignments/character-sets, which I consider *the* spec for these purposes.
(In reply to Simon Montagu from comment #11)
> Which spec? Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.
Per Encoding Standard.
http://encoding.spec.whatwg.org/#concept-encoding-get
DOM is going to reduce the dependency on external specs as much as possible (especially when those specs are open-ended). All supported encodings are given by the fixed list and those names are lower cased.
(In reply to Simon Montagu from comment #11)
> If we really have to do this, I *think* that the safest way would be to make
> our internal canonical names all-lowercase (and audit all callers of
> GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic
> example of fixing something not broken.

I'm fine with either as long as browsers are willing to align.  Specifically, this change would make us match IE and Opera.  If they want to change to match us, that's fine by me too.

> (In reply to :Aryeh Gregor from comment #4)
> > Per the spec, the canonical names *are* all lowercase.
> 
> Which spec? Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.

Well, this is a DOM attribute, so the spec is the DOM spec:

"The characterSet attribute must return the name of the encoding."
http://dom.spec.whatwg.org/#dom-document-characterset

"name" is a link to:

http://encoding.spec.whatwg.org/#name

The names in that spec are all lowercase.  Anne thinks the IANA spec is "pretty useless"; I gather it's not web-compatible at all.  His spec is intended to supersede it for all web purposes.  He feels that all-lowercase makes more sense than mixed-case, so Gecko/WebKit should change instead of IE/Opera.  The IANA spec's casing is certainly inconsistent -- e.g., "windows-1252" instead of "Windows-1252", and a few other odd ones that are lowercase like "us-dk" and "latin-lap".  Lowercase is more predictable.

(Anne also points out that GBK and GB18030, at least, are uppercase in the IANA standard but lowercase in Gecko.)


Basically, I don't think either way is particularly better than the other, but browsers do need to be consistent.  So either we and WebKit have to change, or IE and Opera.  The IANA spec is apparently not close to sufficient for browser interop on encodings, so Anne's Encodings spec is the one to look at, and he went with IE/Opera on this point.  I don't see compelling advantages to either way, so I'd be in favor of going with whatever Anne wants to spec.

I don't think we should push back just because it's more of a pain for us, or because we don't the other way is better than ours.  *Someone* has to take the pain, and we're not going to get interop if everyone wants to push it off on everyone else.
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet

feedback+ if you change ToLowerCase() to nsContentUtils::ASCIIToLower().

Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.

I really think we should treat http://encoding.spec.whatwg.org/ as *the* spec. The IANA does not even admit that ISO-8859-1 is an alias for Windows-1252.
Attachment #670369 - Flags: feedback?(hsivonen) → feedback+
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> I really think we should treat http://encoding.spec.whatwg.org/ as *the*
> spec. The IANA does not even admit that ISO-8859-1 is an alias for
> Windows-1252.
Our current decoders for ISO-8859-1 and windows-1252 are indeed different, although bug 562096 is present for them. I made EncodingUtils::FindEncodingForLabel as an interim workaround until bug 562096 is fixed.
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet

OK, so we're saved by the fact that necko uses GetUnicodeEncoder without Raw in URI code, right?  Because that code certainly gets the output of .characterSet.

In general, I skimmed the users of GetUnicodeEncoderRaw and GetUnicodeDecoderRaw and none look like they should end up with values from here, but it's worth double-checking that.

GetXMLDeclaration is used when serializing XML documents.  Having it lowercase is probably fine.

r=me if you do the above double-checking.
Attachment #670369 - Flags: review?(bzbarsky) → review+
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet

I fixed a bunch of test failures and did a new try push, mochitest-only but on multiple platforms, because I'm nervous about the possibility of platform-specific code here.  Maybe I'm being overly cautious:

https://tbpl.mozilla.org/?tree=Try&rev=003fa29f6e66

Simon, are you on board with the change at this point?

(In reply to Boris Zbarsky (:bz) from comment #16)
> In general, I skimmed the users of GetUnicodeEncoderRaw and
> GetUnicodeDecoderRaw and none look like they should end up with values from
> here, but it's worth double-checking that.
> 
> GetXMLDeclaration is used when serializing XML documents.  Having it
> lowercase is probably fine.
> 
> r=me if you do the above double-checking.

I glanced at all of them, and none of them look like they could plausibly use .characterSet, but I'm not familiar enough with this code to make guarantees.
Attachment #670369 - Flags: feedback?(smontagu)
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet

I suppose I am "on board" in as much as I'm not going to be the sole hold-out, but I'm not happy with the current trend of rewriting specs to retroactively dictate interop on the behaviour of the implementers that are least likely to conform to the spec. Not much of an incentive for anyone to care about the spec, is it?
Attachment #670369 - Flags: feedback?(smontagu) → feedback+
https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74

(In reply to Simon Montagu from comment #18)
> I suppose I am "on board" in as much as I'm not going to be the sole
> hold-out, but I'm not happy with the current trend of rewriting specs to
> retroactively dictate interop on the behaviour of the implementers that are
> least likely to conform to the spec. Not much of an incentive for anyone to
> care about the spec, is it?

To the contrary -- it makes the spec more likely to match the consensus of implementations, which makes it more useful, and makes it more likely that we'll get interop.  If the spec is more realistic, that only increases people's incentives to care about it, because there are practical reasons as well as principled ones.

It's annoying when it's not clear-cut whether the spec is right and we follow the spec and then it changes to match the people who didn't follow the spec, but it's better than not getting as much interop.  If IE and/or WebKit aren't going to change to get interop, better that we do than no one does.
Flags: in-testsuite+
(In reply to :Aryeh Gregor from comment #19)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74
You didn't change ToLowerCase() to nsContentUtils::ASCIIToLower().
>    1.12 +  ToLowerCase(aCharacterSet);
Oops.  Thanks for pointing that out.  Henri, would you like me to push a followup?
https://hg.mozilla.org/mozilla-central/rev/ba1ece534b74
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
(In reply to :Aryeh Gregor from comment #21)
> Oops.  Thanks for pointing that out.  Henri, would you like me to push a
> followup?

I guess ToLowerCase is good enough when the input to is guaranteed to be ASCII, which is the case here.
This broke the character set menu, which assumes that document.characterSet will be canonical.

Updated

6 years ago
Depends on: 804554
Uh... didn't I ask for that to be tested in comment 5?

Aryeh, what's the plan for fixing the regressions?
(In reply to Boris Zbarsky (:bz) from comment #25)
> Uh... didn't I ask for that to be tested in comment 5?

I thought you were only talking about the approach of changing our canonical case, not the approach I took of just changing .characterSet.

> Aryeh, what's the plan for fixing the regressions?

I don't have the time -- I'm working only a few hours a week at present and have security bugs and crashers being assigned to me.  If no one else has time either, this will have to be backed out.
In that case, I think we should back this out for now.  This is very low priority, imo, so I have a hard time asking someone to drop anything else and work on the regressions.
Makes sense to me.  Maybe when I next have a block of free time I can take this and tackle the regressions then.
Created attachment 675931 [details] [diff] [review]
Backout patch

I want this backout because it interferes with bug 801402.
This is a simple backout, so review should not be required.
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/507e7ad0aa82
Status: RESOLVED → REOPENED
Flags: in-testsuite+
Keywords: checkin-needed
Resolution: FIXED → ---
Target Milestone: mozilla19 → ---
Assignee: ayg → nobody
You need to log in before you can comment on or make changes to this bug.