Closed Bug 727346 Opened 12 years ago Closed 8 years ago

Investigate using compatibility caseless matching for radio button groups

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: ehoogeveen, Unassigned)

Details

While investigating bug 492931 I found that the HTML5 spec requires "compatibility caseless" matching when grouping input elements into a radio button group [1]. While not unique, this is one of only a few mentions of compatibility caseless matching - the spec usually prefers ASCII case-insensitive matching.

The code currently refers to ToLowerCase(), which does do a form of Unicode case folding through nsCompressedMap::Map() - but it's not clear at first glance whether this matches compatibility caseless per the Unicode spec.

The relevant calls are in nsDocument::GetRadioGroup(), nsDocument::GetRequiredRadioCount() and nsDocument::GetValueMissingState() of content/base/src/nsDocument.cpp, with an additional unused but related call in nsStringCaseInsensitiveHashKey::HashKey() of content/html/content/src/nsHTMLFormElement.h.

For this bug I think two things need to happen:
1) ToLowerCase() and ToUpperCase() should be checked against the compatibility caseless matching algorithm specified in chapter 3 of the Unicode spec [2].
2) The underlying functions (nsCompressedMap::Map() and IS_NOCASE_CHAR()) should be clarified or rewritten to match the spec.

I'd like to take this bug - at worst, if the current code is particularly clever, I should be able to brute force test it against differences from the spec.

[1] http://dev.w3.org/html5/spec/Overview.html#radio-button-group
[2] http://www.unicode.org/versions/latest/
I think this was fixed with the landing of bug 210501 - Jonathan, can you confirm? (for posterity, I also don't know if I filed this in the right component)
No, bug 210501 doesn't affect this. It extended the case-mapping APIs to support the full Unicode repertoire, but *compatibility* matching involves applying Unicode compatibility decompositions to the strings, in addition to case folding. Our case-insensitive string comparison only deals with case folding, *not* with normalization.

TBH, I find it rather surprising that the spec would require this. I wonder how many browsers actually implement it?
FWIW, the spec mentions using compatibility caseless matching in three places:

http://www.w3.org/TR/html5/common-microsyntaxes.html#syntax-references
http://www.w3.org/TR/html5/the-map-element.html#the-map-element
http://www.w3.org/TR/html5/number-state.html#radio-button-state

I can try to write a test for these and check the various browsers. Can you give two code points that are a compatibility caseless match, but don't match with case folding alone?
This doesn't just affect individual codepoints: sometimes a compatibility decomposition will result in a single codepoint expanding to two or more. I think any of the following pairs would work as  examples:

<00B5> MICRO SIGN
<039C> GREEK CAPITAL LETTER MU

<0149> LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
<02BC, 006E> MODIFIER LETTER APOSTROPHE, LATIN SMALL LETTER N

<0149> LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
<02BC, 004E> MODIFIER LETTER APOSTROPHE, LATIN CAPITAL LETTER N

<0132> LATIN CAPITAL LIGATURE IJ
<0069, 006A> LATIN SMALL LETTER I, LATIN SMALL LETTER J

<017F> LATIN SMALL LETTER LONG S
<0053> LATIN CAPITAL LETTER S

<02B0> MODIFIER LETTER SMALL H
<0048> LATIN CAPITAL LETTER H

...and many similar instances.
(In reply to Jonathan Kew (:jfkthame) from comment #2)

> TBH, I find it rather surprising that the spec would require this. I wonder
> how many browsers actually implement it?

The answer is they don't for the most part!!  I put together a simple testcase:

http://people.mozilla.org/~jdaggett/tests/radiobuttonnamecase.html

Webkit does case sensitive matching, Firefox/Opera some adhoc form of Unicode case insensitive matching and IE8/9 that does *some* normalization and decomposition but not in a way that matches a clearly defined algorithm in Unicode.

I think this should probably be changed in the spec to either case sensitive matching or Unicode case insensitive matching but with no normalization.  Given that this is defined as "legacy" behavior, I wonder why we can go with case sensitive matching.  Things like radio groups are very localized, the reason to use different casing is hard to imagine for actual authoring situations.
Reopened the HTML5 bug on this:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16970

(although that may not be the right one, there's another cloned one)
Spec has been changed to use case-sensitive matching.  See bug 1312456 tracking us aligning with that.
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.