Last Comment Bug 820909 - nsCharTraits<PRUnichar>::ASCIIToLower special-cases some non-ASCII stuff
: nsCharTraits<PRUnichar>::ASCIIToLower special-cases some non-ASCII stuff
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: String (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla20
Assigned To: Boris Zbarsky [:bz] (still a bit busy)
:
: Nathan Froyd [:froydnj]
Mentors:
Depends on: 825191
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-12 09:28 PST by Boris Zbarsky [:bz] (still a bit busy)
Modified: 2012-12-28 03:44 PST (History)
6 users (show)
bzbarsky: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Make LowerCaseEqualsASCII and LowerCaseEqualsLiteral actually do ASCII-case-insensitive matching instead of doing odd things with KELVIN SIGN and LATIN CAPITAL LETTER I WITH DOT ABOVE. s (3.40 KB, patch)
2012-12-15 21:04 PST, Boris Zbarsky [:bz] (still a bit busy)
no flags Details | Diff | Splinter Review
Now with some tests (7.26 KB, patch)
2012-12-17 16:42 PST, Boris Zbarsky [:bz] (still a bit busy)
benjamin: review+
dbaron: superreview+
Details | Diff | Splinter Review

Description Boris Zbarsky [:bz] (still a bit busy) 2012-12-12 09:28:16 PST
Specifically:

241         /**
242          * Convert c to its lower-case form, but only if the lower-case form is
243          * ASCII. Otherwise leave it alone.
244          *
245          * There are only two non-ASCII Unicode characters whose lowercase
246          * equivalents are ASCII: KELVIN SIGN and LATIN CAPITAL LETTER I WITH
247          * DOT ABOVE. So it's a simple matter to handle those explicitly.
248          */

What this means is that calling LowerCaseEqualsASCII or LowerCaseEqualsLiteral on a string does not actually do ASCII-case-insensitive compares.  This is somewhat unfortunate.  Should we fix the string code here, or remove all use of the above functions in DOM code?  :(

In particular, this came up because nsAttrValue::ParseEnumValue uses LowerCaseEqualsASCII so some non-ASCII values of @type work on <input> yet again...
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2012-12-12 09:30:23 PST
This also makes nsContentUtils::IsJavascriptMIMEType not do what's expected, and various other cases....
Comment 3 Jonas Sicking (:sicking) No longer reading bugmail consistently 2012-12-12 14:23:13 PST
This seems like surprising behavior, which is never good.

Do we have any callers that actually want this behavior? I would be surprised if that's the case.
Comment 4 Boris Zbarsky [:bz] (still a bit busy) 2012-12-14 00:54:32 PST
So the only consumers of nsCharTraits<T>::ASCIIToLower are nsCharTraits<T>::compareLowerCaseToASCII() and nsCharTraits<T>::compareLowerCaseToASCIINullTerminated().

These are both only called from nsTSubstring_CharT::LowerCaseEqualsASCII (the two different overrides of it).

LowerCaseEqualsASCII is called from LowerCaseEqualsLiteral and various other code.

I guess I'll go through and try to make a list of what all the consumers are here and what they might really want.
Comment 5 Jonas Sicking (:sicking) No longer reading bugmail consistently 2012-12-14 14:12:55 PST
Or we just change it and hope that we hear if it breaks anyone :)

The turkish I seems like the biggest risk. If we take this approach then maybe we can reach out to turkish community and ask them to hammer on an aurora build extra hard.
Comment 6 Boris Zbarsky [:bz] (still a bit busy) 2012-12-15 20:25:51 PST
OK, so callers of LowerCaseEqualsLiteral that pass a literal string containing 'k' in it are passing these strings: "-moz-keyframes", ".lnk", "_blank", "allow-pointer-lock", "backward", "desktop.ini", "keyboardevent", "keyevents", "keyframes", "popupblockedevents", "set-cookie", "track".

I don't think any of those care about matching Kelvin sign.

The same list for 'i' is: "-moz-fixed", "-moz-image-rect", "-moz-linear-gradient", "-moz-radial-gradient", "-moz-repeating-linear-gradient", "-moz-repeating-radial-gradient", "_content", "afterbegin", "allow-pointer-lock", "allow-same-origin", "allow-scripts", "allow-top-navigation", "animationevent", "application/x-www-form-urlencoded", "authorization", "beforebegin", "compositionevent", "content-disposition", "content-primary", "cubic-bezier", "cursive", "datacontainerevent", "datacontainerevents", "description", "desktop.ini", "devicemotionevent", "deviceorientationevent", "dialogheight", "dialogleft", "dialogtop", "dialogwidth", "domain", "dpi", "file", "filename", "gif", "hidden", "horizontal", "ibm862", "image/svg+xml", "import", "important", "inherit", "initial", "inline", "iso-8859-8", "javascript", "javascript1.0", "javascript1.1", "javascript1.2", "javascript1.3", "javascript1.4", "javascript1.5", "javascript1.6", "javascript1.7", "javascript1.8", "li", "line", "linear-gradient", "lineboundary", "livescript", "magic", "media", "mozaudioavailableevent", "multipart/form-data", "mutationevent", "mutationevents", "notifypaintevent", "pagetransition", "pluginspage", "portrait", "radial-gradient", "repeating-linear-gradient", "repeating-radial-gradient", "resizable", "right", "sameorigin", "sans-serif", "serif", "set-cookie", "sha1-digest", "sha1-digest-manifest", "simplegestureevent", "size", "text/plain", "timeevent", "timeevents", "title", "title*", "transitionevent", "uievent", "uievents", "url-prefix", "view-source", "window", "x-imap4-modified-utf7", "moz-nullprincipal", "expires", "application/octet-stream".

Again, I don't think any of them care about matching Turkish dotted I.

None of that is surprising.  Next to check the LowerCaseEqualsASCII callsites.
Comment 7 Boris Zbarsky [:bz] (still a bit busy) 2012-12-15 20:48:57 PST
For LowerCaseEqualsASCII, I'm ignoring all the ipc/chromium stuff because that uses its own version.

I'm also ignoring all the places that use it on literal strings not containing 'k' or 'i', and also on the following literal strings: ".ico", ".lnk", "set-cookie", "set-cookie2", "application/guess-from-ext" since I think it's clear those don't want to match Kelvin sign or Turkish dotted I.

That leaves the following callers:

1)  nsContentUtils::IsJavascriptMIMEType.  This definitely wants ASCII-CI matching.
2)  nsAttrValue::ParseEnumValue.  That's what led me to file this bug.
3)  nsXMLHttpRequest::GetResponseHeader when comparing to kCrossOriginSafeHeaders
    entries.  DEFINITELY wants ASCII-CI.  What we have now looks like a security bug.
4)  nsXMLHttpRequest::SetRequestHeader when comparing to kInvalidHeaders entries.  Wants
    ASCII-CI.
5)  nsXMLHttpRequest::SetRequestHeader when comparing to kCrossOriginSafeHeaders
    entries.  DEFINITELY wants ASCII-CI.  What we have now looks like a security bug.
6)  CSSParserImpl::ParseAttributeSelector when comparing to caseInsensitiveHTMLAttribute
    entries.  This wants ASCII-CI.
7)  CSSParserImpl::TranslateDimension comparing a string to length unit names.  I think
    this wants ASCII-CI.  It only matters for "in", "vmin", and "khz", and I doubt
    writing the first two with dotted I or the last with Kelvin sign is desirable.
8)  nsCSSProps::LookupProperty comparing the property name to our list of aliased
    properties.  I think we probably want ASCII-CI here.
9)  nsProtocolProxyService::NewProxyInfo comparing the proxy type to "socks", "socks4",
    "direct".  I think ASCII-CI is fine here.
10) nsHtml5Portability::lowerCaseLiteralEqualsIgnoreAsciiCaseString, which wants
    ASCII-CI per spec and is buggy as it stands.
11) nsExternalHelperAppService::ApplyDecodingForExtension comparing extension to "zip"
    and type to "application/zip" or "application/x-gip" or "application/x-compress".
    These are fine with ASCII-CI.
12) nsExternalHelperAppService::GetTypeFromExtension comparing extension to "gif" and
    "xpi".  I think ASCII-CI is fine here.
13) nsStaticCaseInsensitiveNameTable::Lookup.  The only consumers are CSS keyword
    parsing, CSS property name parsing, and named colors.  I think ASCII-CI is probably
    fine for all of these.

David, do you have any objections to items 6, 7, 8, 13 on that list becoming ASCII-CI instead of the hackery we have now?
Comment 8 Boris Zbarsky [:bz] (still a bit busy) 2012-12-15 21:04:41 PST
Created attachment 692690 [details] [diff] [review]
Make LowerCaseEqualsASCII and LowerCaseEqualsLiteral actually do ASCII-case-insensitive matching instead of doing odd things with KELVIN SIGN and LATIN CAPITAL LETTER I WITH DOT ABOVE.   s

I'm slightly tempted to also factor out all the common code between the two nsCharTraits specializations....  The copy/paste is bugging me.  Let me know if you think I shouldn't?
Comment 9 Boris Zbarsky [:bz] (still a bit busy) 2012-12-15 21:05:28 PST
Comment on attachment 692690 [details] [diff] [review]
Make LowerCaseEqualsASCII and LowerCaseEqualsLiteral actually do ASCII-case-insensitive matching instead of doing odd things with KELVIN SIGN and LATIN CAPITAL LETTER I WITH DOT ABOVE.   s

I still need to write some tests...
Comment 10 Boris Zbarsky [:bz] (still a bit busy) 2012-12-17 16:31:09 PST
Ah, looks like we're already ok on comment 7 item 13, because while the matchEntry callback uses this weird matcher the hashEntry just takes out 0x20 from each char, so you get the wrong hashcode if you're not actually ASCII.
Comment 11 Boris Zbarsky [:bz] (still a bit busy) 2012-12-17 16:42:54 PST
Created attachment 693181 [details] [diff] [review]
Now with some tests
Comment 12 David Baron :dbaron: ⌚️UTC-10 2012-12-26 12:48:33 PST
(In reply to Boris Zbarsky (:bz) from comment #7)
> 6)  CSSParserImpl::ParseAttributeSelector when comparing to caseInsensitiveHTMLAttribute
>     entries.  This wants ASCII-CI.
> 7)  CSSParserImpl::TranslateDimension comparing a string to length unit names.  I think
>     this wants ASCII-CI.  It only matters for "in", "vmin", and "khz", and I doubt
>     writing the first two with dotted I or the last with Kelvin sign is desirable.
> 8)  nsCSSProps::LookupProperty comparing the property name to our list of aliased
>     properties.  I think we probably want ASCII-CI here.
> 13) nsStaticCaseInsensitiveNameTable::Lookup.  The only consumers are CSS keyword
>     parsing, CSS property name parsing, and named colors.  I think ASCII-CI is probably
>     fine for all of these.
> 
> David, do you have any objections to items 6, 7, 8, 13 on that list becoming
> ASCII-CI instead of the hackery we have now?

Nope, this change will be correct; http://www.w3.org/TR/CSS21/syndata.html#characters describes CSS syntax as case-insensitive within the ASCII range.
Comment 13 David Baron :dbaron: ⌚️UTC-10 2012-12-26 12:50:21 PST
Comment on attachment 693181 [details] [diff] [review]
Now with some tests

sr=dbaron
Comment 14 Boris Zbarsky [:bz] (still a bit busy) 2012-12-26 13:46:36 PST
Perfect, thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/e364512a658f
Comment 15 Daniel Holbert [:dholbert] 2012-12-27 14:02:13 PST
https://hg.mozilla.org/mozilla-central/rev/e364512a658f

Note You need to log in before you can comment on or make changes to this bug.