Closed Bug 492931 Opened 15 years ago Closed 12 years ago

Case change algorithm in DOM should use ASCII upper/lowercasing

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: hsivonen, Assigned: ehoogeveen)

References

(Blocks 1 open bug)

Details

(Keywords: html5, Whiteboard: [mentor=khuey][lang=c++])

Attachments

(6 files, 8 obsolete files)

1.10 KB, text/html
Details
8.35 KB, text/plain
Details
4.78 KB, patch
ehoogeveen
: review+
sicking
: review+
Details | Diff | Splinter Review
19.10 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
941 bytes, patch
Details | Diff | Splinter Review
23.64 KB, patch
Details | Diff | Splinter Review
ToLowerCase and ToUpperCase used in the DOM internals use upper/lowercasing algorithms that touch non-ASCII characters (but aren't apparently performing full Unicode upper/lowercasing--hmm).

Per HTML5, where Gecko DOM uses ToLowerCase and ToUpperCase, ASCII upper/lowercasing should be used instead (to match what the HTML tokenizer does--both the HTML5 tokenizer and the old tokenizer).
Attached file Testcase
This hits a couple of the web-visible issues this causes.
Whiteboard: [mentor=khuey][lang=c++]
In general, uses of ToLowerCase or similar functions in content/ should be replaced by nsContentUtils::ASCIIToLower.  It's hard to construct an MXR query for this, but if anyone is interested in picking this up I can point them at the right places.
I'd like to take this. Any legitimate non-ASCII uses I should watch out for?
In content/ and dom/?  Not that I can think of off hand.
Probably not, but it's probably best if you just check every place you change against the relevant spec...
Alright, I'm all set up, and I have a question: In content/base/src/nsWebSocket.cpp there are two instances where ToLowerCase() is called on obviously ASCII strings. However these are nsCStrings, and there's no overload of nsContentUtils::ASCIIToLower() for these types. Adding these overloads is trivial, and I've checked that Firefox compiles with the change. But is it overkill to add something like this? Should I just ignore instances where nsCStrings are used (since Firefox generally uses UTF-16, I believe)? I can't judge right now whether there are a lot of cases of this or just the one.
ToLowerCase on an nsCString already does ASCII lowercasing, no?  So no need to do anything with it.
Hmm.. Yes, it seems to use the implementation of ToLowerCase implemented in xpcom/string/src/nsReadableUtils.cpp - at least as far as I can tell, and that one does use ASCII lowercasing. I got confused because there are several definitions of ToLowerCase and I had been looking at the wrong one. This may be a stupid question, but is there a better way to tell which implementation will be used than simply breaking the code and seeing what the compiler complains about? All of these files have a lot of includes.
You can usually tell based on the argument types.
This is a WIP patch that only covers only the files in content/base/src; I'm attaching it with a feedback request to see if I'm on the right track, since this is my first patch. I'll also attach a text file with my reasoning for each of the changes, since I had to dig into the specs to figure them out.
Attachment #584383 - Flags: feedback?(bzbarsky)
This text file contains my reasoning for the changes in the WIP patch. It wasn't always immediately obvious which specs were involved, but I've tried to be thorough.

Notably, I decided not to change some lines in content/base/src/nsDocument.cpp because the spec appeared to say that a unicode "compatibility caseless match" comparison was required, which definitely doesn't match ASCII case conversion (but I don't know whether the ToUpperCase and ToLowerCase functions match it either).
Oh, and I remembered too late that the mentor keyword is set to khuey - I don't mean to pass you over! Feedback from either of you would be much appreciated :)
Attachment #584383 - Flags: feedback?(khuey)
Comment on attachment 584383 [details] [diff] [review]
WIP - Fix uses under content/base/src

The sanitizing serializer code is wrong, imo.  It should be using ASCIIToLower.  Otherwise it might treat attr names with dotted capital I as allowed, no?  Please fix this.

For the radio code, compatibility caseless matching doesn't mean the same thing as ToLowerCase.  That code might need fixing.  Followup bug?

The rest looks great.  Thank you for the very helpful analysis attachment!
Attachment #584383 - Flags: feedback?(bzbarsky) → feedback+
Thanks for the feedback Boris, I'll strive to have a patch up for review this weekend or shortly after.
This should do it, at least everything under content/.

Kyle, did you want me to look at dom/ as well (comment #5)? How do I try out your test from comment #2?

Boris, I didn't end up changing the sanitizing serializer code from the WIP patch, because NS_LossyConvertUTF16toASCII() already drops all the non-ASCII characters and spits out an nsCString, and ToLowerCase() on an nsCString implicitly performs ASCII lowercasing (this is a bit confusing really, since its nsString counterpart does some form of unicode case folding).
Attachment #584383 - Attachment is obsolete: true
Attachment #597218 - Flags: review?(khuey)
Attachment #584383 - Flags: feedback?(khuey)
(In reply to Emanuel Hoogeveen from comment #16)
> Created attachment 597218 [details] [diff] [review]
> Change to ASCII lowercasing in content/ where appropriate
> 
> This should do it, at least everything under content/.
> 
> Kyle, did you want me to look at dom/ as well (comment #5)? How do I try out
> your test from comment #2?

Add it as a mochitest.
Here's my reasoning for all the changes, updated for the finished patch. Note that not all the line numbers may be correct anymore as they were based on an earlier revision.

Most of the decisions ended up being fairly clear. I decided not to change anything where only a single character was being changed (since ASCIIToLower and ASCIIToUpper have no overload for single characters), where nsCStrings were being changed (since ToLowerCase and ToUpperCase are implicitly ASCII for nsCStrings) - and somewhat bafflingly, in the case of radio buttons, because the spec says so.
Attachment #584385 - Attachment is obsolete: true
Ah, I missed that there's the lossy convert to ASCII there...  That's even more broken, actually.  It'll give false positives for "allowed", which is likely a security issue.  Followup bug, please?
Kyle: I ran the test - and it failed. But is it testing what it should? Firstly, it says 'dIv' is a Unicode tag name and those should match nothing, but surely dIv is pure ASCII. Secondly it says 'id' and 'Id' should be distinct, but how would that change by switching to ASCII case conversion? The test seems to be saying there should be *no* case conversion of any kind going on in these situations.

Boris: I'll file bugs about the radio code and the sanitizing serializer code.
> but surely dIv is pure ASCII.

Presumably that part of the text should have used a dotted capital I?

Same for some of the other tests in that file....
Ah, you're right. I was getting confused with Ï which would presumably match ï, but you were talking about İ which has no lowercase equivalent. With İ inserted in three places, the tests pass in my personal build - but I haven't tested it (yet) against vanilla Firefox to see if they fail.
I've filed bug 727337 on the mozSanitizingHTMLSerializer issue, and bug 727346 on the radio button group issue.
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
I decided to split off the changes for bug 727337 as the major one already had review+ and the refactoring was trivial.
Attachment #597218 - Attachment is obsolete: true
Attachment #598485 - Flags: review?(khuey)
Attachment #597218 - Flags: review?(khuey)
Comment on attachment 598485 [details] [diff] [review]
Change to ASCII lowercasing in content/ where appropriate

Review of attachment 598485 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me, but I'd like somebody else to look at the XSLT/XBL stuff.

::: content/base/src/nsDOMAttribute.cpp
@@ +213,5 @@
>        aContent->IsInHTMLDocument() &&
>        aContent->IsHTML()) {
>      nsAutoString name;
> +    mNodeInfo->GetName(name);
> +    nsContentUtils::ASCIIToLower(name);

I think this would be more efficient, in terms of memory usage/heap allocations, if it were written:

nsString name;
mNodeInfo->GetName(name);
nsAutoString lowercaseName;
nsContentUtils::ASCIIToLower(name, lowercaseName);
nsCOMPtr<nsIAtom> nameAtom = do_GetAtom(lowercaseName);
Attachment #598485 - Flags: review?(khuey)
Attachment #598485 - Flags: review?(jonas)
Attachment #598485 - Flags: review+
Yes, I wasn't particularly sure about the XSLT/XBL uses. As I noted, the XSLT spec does specify that attributes and elements can contain Unicode characters - but I don't think it specifies the type of case conversion to use.

Having said that, I don't really know what the reason is for the case conversion in many of these functions. I haven't really found any documentation that says what each function does - and HG blame shows the code predates the switch to Mercurial. 

Since ASCII case conversion is 'safe', in the sense that it won't touch characters outside the ASCII range, I suppose the question is whether we care that, say, Greek characters are stored in a case insensitive manner. I figured it didn't make sense to exclude them, so I didn't change the relevant functions to ASCII case conversion (except the one where HTML is concerned, as specified by HTML5 [1][2]). However I should note that the existing ToLowerCase in nsString isn't particularly clearly documented either, so I don't know whether or not it matches Unicode's compatibility caseless (something I want to look into as part of bug 727346).

[1] http://www.w3.org/TR/html5/apis-in-html-documents.html#interactions-with-xpath-and-xslt
[2] http://www.w3.org/TR/html5/semantics.html#attr-meta-http-equiv
Comment on attachment 598485 [details] [diff] [review]
Change to ASCII lowercasing in content/ where appropriate

Review of attachment 598485 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with those changes and the change requested by Kyle.

::: content/html/content/src/nsGenericHTMLElement.cpp
@@ +2491,5 @@
>  nsresult
>  nsGenericHTMLElement::SetContentEditable(const nsAString& aContentEditable)
>  {
>    nsString contentEditable;
> +  nsContentUtils::ASCIIToLower(aContentEditable, contentEditable);

Hmm.. what we really should have here is a nsContentUtils::EqualsLiteralIgnoreASCIICase function. Feel free to write one and use that instead.

::: content/xslt/src/xslt/txMozillaXMLOutput.cpp
@@ +159,5 @@
>      nsCOMPtr<nsIAtom> lname;
>  
>      if (mOpenedElementIsHTML && aNsID == kNameSpaceID_None) {
>          nsAutoString lnameStr;
> +        TX_ToLowerCase(aLocalName, lnameStr);

Why not use nsContentUtils::ASCIIToLower here? In fact, all uses of TX_ToLowerCase should probably use nsContentUtils::ASCIIToLower for the same reason as the other changes you are making in this patch (TX_ToLowerCase is really the same thing as ToLowerCase)

@@ +498,5 @@
>      if (mOutputFormat.mMethod == eHTMLOutput && aNsID == kNameSpaceID_None) {
>          nsId = kNameSpaceID_XHTML;
>  
>          nsAutoString lnameStr;
> +        TX_ToLowerCase(aLocalName, lnameStr);

Same here.
The XSLT case conversions are all HTML related, so using ASCII conversion is the right thing to do.

I.e. we should get rid of TX_ToLowerCase in favor of nsContentUtils::ASCIIToLower and make TX_ToLowerCaseAtom use nsContentUtils::ASCIIToLower
Jonas, here's the function you asked for (used in part 1). I'm asking for review for two reasons:

1) It causes some code duplication that seems necessary unless we require wrapping the literals in NS_LITERAL_STRING() (which seems inconvenient at best).

2) It uses the template trick from nsTSubstring.h to avoid calculating the length of the literal at runtime. It can be turned off using the NS_DISABLE_LITERAL_TEMPLATE define, and this happens in the following environments:
if (defined(_MSC_VER) && (_MSC_VER < 1310)) || (defined(__SUNPRO_CC) && (__SUNPRO_CC < 0x560)) || (defined(__HP_aCC) && (__HP_aCC <= 012100))

Do we still care about these environments?
Attachment #601148 - Flags: review?(jonas)
This makes the change Kyle asked for, gets rid of TX_ToLowerCase in favor of nsContentUtils::ASCIIToLower, and refactors nsGenericHTMLElement::SetContentEditable to use the function from part 0. Jonas, could you give the refactoring a glance? It seemed like a waste to use the new function, then do the lowercasing anyway just to pass "true" or "false".
Attachment #598485 - Attachment is obsolete: true
Attachment #601151 - Flags: review?(jonas)
Could you attach an interdiff or otherwise describe the changes you've made since the last patch so that I don't have to re-review the whole thing.

In general if you're making changes to a large patch always provide an interdiff when making small changes to it since it's really hard to tell exactly what changed otherwise.

Another solution is always to just address review comments in the patch and do any further changes as a separate patch. We can always merge the patches at the end before checking in if we think that's the right thing to do.
Comment on attachment 601148 [details] [diff] [review]
Part 0: Add nsContentUtils::EqualsLiteralIgnoreASCIICase

Review of attachment 601148 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great.
Attachment #601148 - Flags: review?(jonas) → review+
Erk, I should have kept a better patch queue. Oh well, here's the original patch (rebased); carrying forward r+.
Attachment #601162 - Flags: review+
And here's the interdiff you asked for. There aren't really any conflicts so I'll just obsolete the previous full patch.
Attachment #601151 - Attachment is obsolete: true
Attachment #601163 - Flags: review?(jonas)
Attachment #601151 - Flags: review?(jonas)
Hmm, maybe you wanted to flag the full patch as r+.. In my defense, I should not have been up at 5 AM.
Comment on attachment 601163 [details] [diff] [review]
Part 1b: Change to ASCII case conversion in content/ where appropriate

Review of attachment 601163 [details] [diff] [review]:
-----------------------------------------------------------------

Yay! Thanks for doing that! Made reviewing a breeze.
Attachment #601163 - Flags: review?(jonas) → review+
Keywords: checkin-needed
Backed out due to crashes when running tests.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1820c723a6ad

For crash logs:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=dab441ffa459

Please run this through Try before asking for checkin again.
Damn, I see. Thanks for backing out, I'll be more careful next time.
I finally had time to finish getting everything setup for debugging Mochitests. In the end the problem turned out to be braindead simple. I'm attaching both patches again because IIRC I've had to do at least one rebase.
Attachment #601148 - Attachment is obsolete: true
Attachment #604342 - Flags: review?(jonas)
Parts 1a and 1b qfolded together and rebased. Carrying forward r+ as there have been no changes.
Attachment #601162 - Attachment is obsolete: true
Attachment #601163 - Attachment is obsolete: true
Attachment #604343 - Flags: review+
Not meant for landing, since 0v2 already contains the change.

And here's the interdiff between 0v1 and 0v2 - you'll facepalm when you see it. I'm not sure why this even compiled before.

Jonas or anyone else who has time, could you run parts 0v2 and 1v2 through try?
Attachment #604344 - Flags: review?(jonas)
Attachment #604343 - Attachment description: 601163: Part 1v2: Change to ASCII case conversion in content/ where appropriate → Part 1v2: Change to ASCII case conversion in content/ where appropriate
Attachment #597222 - Attachment description: Reasoning for the changes in the patch → Reasoning for the changes in the patch (somewhat obsoleted by comment #27)
Emanuel: I appreciate your desire not to get everything reviewed. But as previously mentioned, for large patches like this, please always state exactly what changed as to not make all the time the reviewer spent reviewing previous versions wasted.

If you haven't made any changes and just rebased, then no re-review is needed as long as the previous version was reviewed.
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase

removing requests. If it's was just a rebase just mark them as r+
Attachment #604342 - Flags: review?(jonas)
Comment on attachment 604344 [details] [diff] [review]
Interdiff between part 0v1 and part 0v2.

removing requests. If it's was just a rebase just mark them as r+
Attachment #604344 - Flags: review?(jonas)
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase

Well, it was just a bustage fix - literally a one-liner. I could have carried over r+ but I figured it would only take a cursory glance since I attached the interdiff.. At any rate, I would still like this sent to try since it *is* a fairly large patch and landing it failed once before.
Attachment #604342 - Flags: review+
Attachment #604344 - Attachment is obsolete: true
Attachment #604344 - Attachment is obsolete: false
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase

Doh! My bad, you did indeed make a change and supply an interdiff. So sorry. Changes look good.
Attachment #604342 - Flags: review+
Here's the folded patch as requested.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Flags: in-testsuite?
Depends on: 776075
Blocks: 614725
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: