Closed Bug 717529 Opened 12 years ago Closed 12 years ago

Support CSS escape sequences like `\d834\df06 ` (broken up in UTF-16 code units)

Categories

(Core :: CSS Parsing and Computation, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: mathias, Unassigned)

References

()

Details

Gecko is the only engine that doesn’t support CSS escape sequences of the form `\d834\df06 ` (broken up in UTF-16 code units). Even though I cannot find any mention of these in the spec (http://www.w3.org/TR/CSS21/syndata.html#characters), it would be better for interoperability if Gecko added support for these.

The example I’m using — `\d834\df06 ` should be identical to `\1d306 ` or `\01d306`, both of which are escape sequences for the “tetragram for centre” symbol (U+1D306).

Here’s a simple test case: http://jsfiddle.net/mathias/jY7ra/

Opera and IE8+ support both types of escape sequences.

Note that WebKit doesn’t support the standard CSS escape sequences for symbols outside the BMP; see https://bugs.webkit.org/show_bug.cgi?id=76152.
Given that a backslash-hexadecimal CSS escape sequence is defined as providing the code number of _an ISO 10646 character_, not the code number of a UTF16 code unit, I don't think this should be supported. The correct way to represent a non-BMP character is to use a 5- or 6-hexdigit sequence.

Perhaps this could be clarified in http://www.w3.org/TR/css3-syntax; IMO, the range \d000 to \dfff should, if anything, be explicitly made invalid. cc'ing dbaron for any thoughts.
(In reply to Jonathan Kew (:jfkthame) from comment #1)
> Given that a backslash-hexadecimal CSS escape sequence is defined as
> providing the code number of _an ISO 10646 character_, not the code number
> of a UTF16 code unit, I don't think this should be supported. The correct
> way to represent a non-BMP character is to use a 5- or 6-hexdigit sequence.

Don’t get me wrong, I definitely agree that is the only correct way according to the spec; I filed this “bug” purely because of interoperability concerns. Since Opera, WebKit and IE8+ support this non-standard syntax, it would be nice if Firefox could support it as well.

Either way the spec could use some tweaks, be it by just clarifying what you suggested, or by changing it to reflect reality (i.e. defining the “broken-up UTF-16 code units” syntax that almost all browsers have implemented).
If we were to support this, we'd need a way to prevent unpaired surrogates from ending up in internal data structures -- that could be a security risk.

The current code that does this is the use of ENSURE_VALID_CHAR() inside of nsCSSScanner::ParseAndAppendEscape (in layout/style/nsCSSScanner.cpp).  ENSURE_VALID_CHAR is defined in xpcom/string/public/nsCharTraits.h and converts surrogates (U+D800 to U+DFFF) and codepoints greater than U+10FFFF to U+FFFD.
This bug is invalid as per http://lists.w3.org/Archives/Public/www-style/2012Feb/0006.html.
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.