Closed
Bug 492931
Opened 16 years ago
Closed 13 years ago
Case change algorithm in DOM should use ASCII upper/lowercasing
Categories
(Core :: DOM: Core & HTML, defect)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
mozilla13
People
(Reporter: hsivonen, Assigned: ehoogeveen)
References
(Blocks 1 open bug)
Details
(Keywords: html5, Whiteboard: [mentor=khuey][lang=c++])
Attachments
(6 files, 8 obsolete files)
1.10 KB,
text/html
|
Details | |
8.35 KB,
text/plain
|
Details | |
4.78 KB,
patch
|
ehoogeveen
:
review+
sicking
:
review+
|
Details | Diff | Splinter Review |
19.10 KB,
patch
|
ehoogeveen
:
review+
|
Details | Diff | Splinter Review |
941 bytes,
patch
|
Details | Diff | Splinter Review | |
23.64 KB,
patch
|
Details | Diff | Splinter Review |
ToLowerCase and ToUpperCase used in the DOM internals use upper/lowercasing algorithms that touch non-ASCII characters (but aren't apparently performing full Unicode upper/lowercasing--hmm).
Per HTML5, where Gecko DOM uses ToLowerCase and ToUpperCase, ASCII upper/lowercasing should be used instead (to match what the HTML tokenizer does--both the HTML5 tokenizer and the old tokenizer).
Blocks: domperf
Assignee: nobody → me
This hits a couple of the web-visible issues this causes.
Assignee: khuey → nobody
Whiteboard: [mentor=khuey][lang=c++]
In general, uses of ToLowerCase or similar functions in content/ should be replaced by nsContentUtils::ASCIIToLower. It's hard to construct an MXR query for this, but if anyone is interested in picking this up I can point them at the right places.
Assignee | ||
Comment 4•13 years ago
|
||
I'd like to take this. Any legitimate non-ASCII uses I should watch out for?
In content/ and dom/? Not that I can think of off hand.
Comment 6•13 years ago
|
||
Probably not, but it's probably best if you just check every place you change against the relevant spec...
Assignee | ||
Comment 7•13 years ago
|
||
Alright, I'm all set up, and I have a question: In content/base/src/nsWebSocket.cpp there are two instances where ToLowerCase() is called on obviously ASCII strings. However these are nsCStrings, and there's no overload of nsContentUtils::ASCIIToLower() for these types. Adding these overloads is trivial, and I've checked that Firefox compiles with the change. But is it overkill to add something like this? Should I just ignore instances where nsCStrings are used (since Firefox generally uses UTF-16, I believe)? I can't judge right now whether there are a lot of cases of this or just the one.
Comment 8•13 years ago
|
||
ToLowerCase on an nsCString already does ASCII lowercasing, no? So no need to do anything with it.
Assignee | ||
Comment 9•13 years ago
|
||
Hmm.. Yes, it seems to use the implementation of ToLowerCase implemented in xpcom/string/src/nsReadableUtils.cpp - at least as far as I can tell, and that one does use ASCII lowercasing. I got confused because there are several definitions of ToLowerCase and I had been looking at the wrong one. This may be a stupid question, but is there a better way to tell which implementation will be used than simply breaking the code and seeing what the compiler complains about? All of these files have a lot of includes.
Comment 10•13 years ago
|
||
You can usually tell based on the argument types.
Assignee | ||
Comment 11•13 years ago
|
||
This is a WIP patch that only covers only the files in content/base/src; I'm attaching it with a feedback request to see if I'm on the right track, since this is my first patch. I'll also attach a text file with my reasoning for each of the changes, since I had to dig into the specs to figure them out.
Attachment #584383 -
Flags: feedback?(bzbarsky)
Assignee | ||
Comment 12•13 years ago
|
||
This text file contains my reasoning for the changes in the WIP patch. It wasn't always immediately obvious which specs were involved, but I've tried to be thorough.
Notably, I decided not to change some lines in content/base/src/nsDocument.cpp because the spec appeared to say that a unicode "compatibility caseless match" comparison was required, which definitely doesn't match ASCII case conversion (but I don't know whether the ToUpperCase and ToLowerCase functions match it either).
Assignee | ||
Comment 13•13 years ago
|
||
Oh, and I remembered too late that the mentor keyword is set to khuey - I don't mean to pass you over! Feedback from either of you would be much appreciated :)
Updated•13 years ago
|
Attachment #584383 -
Flags: feedback?(khuey)
Comment 14•13 years ago
|
||
Comment on attachment 584383 [details] [diff] [review]
WIP - Fix uses under content/base/src
The sanitizing serializer code is wrong, imo. It should be using ASCIIToLower. Otherwise it might treat attr names with dotted capital I as allowed, no? Please fix this.
For the radio code, compatibility caseless matching doesn't mean the same thing as ToLowerCase. That code might need fixing. Followup bug?
The rest looks great. Thank you for the very helpful analysis attachment!
Attachment #584383 -
Flags: feedback?(bzbarsky) → feedback+
Assignee | ||
Comment 15•13 years ago
|
||
Thanks for the feedback Boris, I'll strive to have a patch up for review this weekend or shortly after.
Assignee | ||
Comment 16•13 years ago
|
||
This should do it, at least everything under content/.
Kyle, did you want me to look at dom/ as well (comment #5)? How do I try out your test from comment #2?
Boris, I didn't end up changing the sanitizing serializer code from the WIP patch, because NS_LossyConvertUTF16toASCII() already drops all the non-ASCII characters and spits out an nsCString, and ToLowerCase() on an nsCString implicitly performs ASCII lowercasing (this is a bit confusing really, since its nsString counterpart does some form of unicode case folding).
Attachment #584383 -
Attachment is obsolete: true
Attachment #597218 -
Flags: review?(khuey)
Attachment #584383 -
Flags: feedback?(khuey)
(In reply to Emanuel Hoogeveen from comment #16)
> Created attachment 597218 [details] [diff] [review]
> Change to ASCII lowercasing in content/ where appropriate
>
> This should do it, at least everything under content/.
>
> Kyle, did you want me to look at dom/ as well (comment #5)? How do I try out
> your test from comment #2?
Add it as a mochitest.
Assignee | ||
Comment 18•13 years ago
|
||
Here's my reasoning for all the changes, updated for the finished patch. Note that not all the line numbers may be correct anymore as they were based on an earlier revision.
Most of the decisions ended up being fairly clear. I decided not to change anything where only a single character was being changed (since ASCIIToLower and ASCIIToUpper have no overload for single characters), where nsCStrings were being changed (since ToLowerCase and ToUpperCase are implicitly ASCII for nsCStrings) - and somewhat bafflingly, in the case of radio buttons, because the spec says so.
Attachment #584385 -
Attachment is obsolete: true
Comment 19•13 years ago
|
||
Ah, I missed that there's the lossy convert to ASCII there... That's even more broken, actually. It'll give false positives for "allowed", which is likely a security issue. Followup bug, please?
Assignee | ||
Comment 20•13 years ago
|
||
Kyle: I ran the test - and it failed. But is it testing what it should? Firstly, it says 'dIv' is a Unicode tag name and those should match nothing, but surely dIv is pure ASCII. Secondly it says 'id' and 'Id' should be distinct, but how would that change by switching to ASCII case conversion? The test seems to be saying there should be *no* case conversion of any kind going on in these situations.
Boris: I'll file bugs about the radio code and the sanitizing serializer code.
Comment 21•13 years ago
|
||
> but surely dIv is pure ASCII.
Presumably that part of the text should have used a dotted capital I?
Same for some of the other tests in that file....
Assignee | ||
Comment 22•13 years ago
|
||
Ah, you're right. I was getting confused with Ï which would presumably match ï, but you were talking about İ which has no lowercase equivalent. With İ inserted in three places, the tests pass in my personal build - but I haven't tested it (yet) against vanilla Firefox to see if they fail.
Assignee | ||
Comment 23•13 years ago
|
||
I've filed bug 727337 on the mozSanitizingHTMLSerializer issue, and bug 727346 on the radio button group issue.
Updated•13 years ago
|
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Assignee | ||
Comment 24•13 years ago
|
||
I decided to split off the changes for bug 727337 as the major one already had review+ and the refactoring was trivial.
Attachment #597218 -
Attachment is obsolete: true
Attachment #598485 -
Flags: review?(khuey)
Attachment #597218 -
Flags: review?(khuey)
Comment on attachment 598485 [details] [diff] [review]
Change to ASCII lowercasing in content/ where appropriate
Review of attachment 598485 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me, but I'd like somebody else to look at the XSLT/XBL stuff.
::: content/base/src/nsDOMAttribute.cpp
@@ +213,5 @@
> aContent->IsInHTMLDocument() &&
> aContent->IsHTML()) {
> nsAutoString name;
> + mNodeInfo->GetName(name);
> + nsContentUtils::ASCIIToLower(name);
I think this would be more efficient, in terms of memory usage/heap allocations, if it were written:
nsString name;
mNodeInfo->GetName(name);
nsAutoString lowercaseName;
nsContentUtils::ASCIIToLower(name, lowercaseName);
nsCOMPtr<nsIAtom> nameAtom = do_GetAtom(lowercaseName);
Attachment #598485 -
Flags: review?(khuey)
Attachment #598485 -
Flags: review?(jonas)
Attachment #598485 -
Flags: review+
Assignee | ||
Comment 26•13 years ago
|
||
Yes, I wasn't particularly sure about the XSLT/XBL uses. As I noted, the XSLT spec does specify that attributes and elements can contain Unicode characters - but I don't think it specifies the type of case conversion to use.
Having said that, I don't really know what the reason is for the case conversion in many of these functions. I haven't really found any documentation that says what each function does - and HG blame shows the code predates the switch to Mercurial.
Since ASCII case conversion is 'safe', in the sense that it won't touch characters outside the ASCII range, I suppose the question is whether we care that, say, Greek characters are stored in a case insensitive manner. I figured it didn't make sense to exclude them, so I didn't change the relevant functions to ASCII case conversion (except the one where HTML is concerned, as specified by HTML5 [1][2]). However I should note that the existing ToLowerCase in nsString isn't particularly clearly documented either, so I don't know whether or not it matches Unicode's compatibility caseless (something I want to look into as part of bug 727346).
[1] http://www.w3.org/TR/html5/apis-in-html-documents.html#interactions-with-xpath-and-xslt
[2] http://www.w3.org/TR/html5/semantics.html#attr-meta-http-equiv
Comment on attachment 598485 [details] [diff] [review]
Change to ASCII lowercasing in content/ where appropriate
Review of attachment 598485 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with those changes and the change requested by Kyle.
::: content/html/content/src/nsGenericHTMLElement.cpp
@@ +2491,5 @@
> nsresult
> nsGenericHTMLElement::SetContentEditable(const nsAString& aContentEditable)
> {
> nsString contentEditable;
> + nsContentUtils::ASCIIToLower(aContentEditable, contentEditable);
Hmm.. what we really should have here is a nsContentUtils::EqualsLiteralIgnoreASCIICase function. Feel free to write one and use that instead.
::: content/xslt/src/xslt/txMozillaXMLOutput.cpp
@@ +159,5 @@
> nsCOMPtr<nsIAtom> lname;
>
> if (mOpenedElementIsHTML && aNsID == kNameSpaceID_None) {
> nsAutoString lnameStr;
> + TX_ToLowerCase(aLocalName, lnameStr);
Why not use nsContentUtils::ASCIIToLower here? In fact, all uses of TX_ToLowerCase should probably use nsContentUtils::ASCIIToLower for the same reason as the other changes you are making in this patch (TX_ToLowerCase is really the same thing as ToLowerCase)
@@ +498,5 @@
> if (mOutputFormat.mMethod == eHTMLOutput && aNsID == kNameSpaceID_None) {
> nsId = kNameSpaceID_XHTML;
>
> nsAutoString lnameStr;
> + TX_ToLowerCase(aLocalName, lnameStr);
Same here.
Attachment #598485 -
Flags: review?(jonas) → review+
The XSLT case conversions are all HTML related, so using ASCII conversion is the right thing to do.
I.e. we should get rid of TX_ToLowerCase in favor of nsContentUtils::ASCIIToLower and make TX_ToLowerCaseAtom use nsContentUtils::ASCIIToLower
Assignee | ||
Comment 29•13 years ago
|
||
Jonas, here's the function you asked for (used in part 1). I'm asking for review for two reasons:
1) It causes some code duplication that seems necessary unless we require wrapping the literals in NS_LITERAL_STRING() (which seems inconvenient at best).
2) It uses the template trick from nsTSubstring.h to avoid calculating the length of the literal at runtime. It can be turned off using the NS_DISABLE_LITERAL_TEMPLATE define, and this happens in the following environments:
if (defined(_MSC_VER) && (_MSC_VER < 1310)) || (defined(__SUNPRO_CC) && (__SUNPRO_CC < 0x560)) || (defined(__HP_aCC) && (__HP_aCC <= 012100))
Do we still care about these environments?
Attachment #601148 -
Flags: review?(jonas)
Assignee | ||
Comment 30•13 years ago
|
||
This makes the change Kyle asked for, gets rid of TX_ToLowerCase in favor of nsContentUtils::ASCIIToLower, and refactors nsGenericHTMLElement::SetContentEditable to use the function from part 0. Jonas, could you give the refactoring a glance? It seemed like a waste to use the new function, then do the lowercasing anyway just to pass "true" or "false".
Attachment #598485 -
Attachment is obsolete: true
Attachment #601151 -
Flags: review?(jonas)
Could you attach an interdiff or otherwise describe the changes you've made since the last patch so that I don't have to re-review the whole thing.
In general if you're making changes to a large patch always provide an interdiff when making small changes to it since it's really hard to tell exactly what changed otherwise.
Another solution is always to just address review comments in the patch and do any further changes as a separate patch. We can always merge the patches at the end before checking in if we think that's the right thing to do.
Comment on attachment 601148 [details] [diff] [review]
Part 0: Add nsContentUtils::EqualsLiteralIgnoreASCIICase
Review of attachment 601148 [details] [diff] [review]:
-----------------------------------------------------------------
Looks great.
Attachment #601148 -
Flags: review?(jonas) → review+
Assignee | ||
Comment 33•13 years ago
|
||
Erk, I should have kept a better patch queue. Oh well, here's the original patch (rebased); carrying forward r+.
Attachment #601162 -
Flags: review+
Assignee | ||
Comment 34•13 years ago
|
||
And here's the interdiff you asked for. There aren't really any conflicts so I'll just obsolete the previous full patch.
Attachment #601151 -
Attachment is obsolete: true
Attachment #601163 -
Flags: review?(jonas)
Attachment #601151 -
Flags: review?(jonas)
Assignee | ||
Comment 35•13 years ago
|
||
Hmm, maybe you wanted to flag the full patch as r+.. In my defense, I should not have been up at 5 AM.
Comment on attachment 601163 [details] [diff] [review]
Part 1b: Change to ASCII case conversion in content/ where appropriate
Review of attachment 601163 [details] [diff] [review]:
-----------------------------------------------------------------
Yay! Thanks for doing that! Made reviewing a breeze.
Attachment #601163 -
Flags: review?(jonas) → review+
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Comment 37•13 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/e2354d606ab0
https://hg.mozilla.org/integration/mozilla-inbound/rev/db828397a288
https://hg.mozilla.org/integration/mozilla-inbound/rev/7891f39211d6
Keywords: checkin-needed
Comment 38•13 years ago
|
||
Backed out due to crashes when running tests.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1820c723a6ad
For crash logs:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=dab441ffa459
Please run this through Try before asking for checkin again.
Assignee | ||
Comment 39•13 years ago
|
||
Damn, I see. Thanks for backing out, I'll be more careful next time.
Assignee | ||
Comment 40•13 years ago
|
||
I finally had time to finish getting everything setup for debugging Mochitests. In the end the problem turned out to be braindead simple. I'm attaching both patches again because IIRC I've had to do at least one rebase.
Attachment #601148 -
Attachment is obsolete: true
Attachment #604342 -
Flags: review?(jonas)
Assignee | ||
Comment 41•13 years ago
|
||
Parts 1a and 1b qfolded together and rebased. Carrying forward r+ as there have been no changes.
Attachment #601162 -
Attachment is obsolete: true
Attachment #601163 -
Attachment is obsolete: true
Attachment #604343 -
Flags: review+
Assignee | ||
Comment 42•13 years ago
|
||
Not meant for landing, since 0v2 already contains the change.
And here's the interdiff between 0v1 and 0v2 - you'll facepalm when you see it. I'm not sure why this even compiled before.
Jonas or anyone else who has time, could you run parts 0v2 and 1v2 through try?
Attachment #604344 -
Flags: review?(jonas)
Assignee | ||
Updated•13 years ago
|
Attachment #604343 -
Attachment description: 601163: Part 1v2: Change to ASCII case conversion in content/ where appropriate → Part 1v2: Change to ASCII case conversion in content/ where appropriate
Assignee | ||
Updated•13 years ago
|
Attachment #597222 -
Attachment description: Reasoning for the changes in the patch → Reasoning for the changes in the patch (somewhat obsoleted by comment #27)
Emanuel: I appreciate your desire not to get everything reviewed. But as previously mentioned, for large patches like this, please always state exactly what changed as to not make all the time the reviewer spent reviewing previous versions wasted.
If you haven't made any changes and just rebased, then no re-review is needed as long as the previous version was reviewed.
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase
removing requests. If it's was just a rebase just mark them as r+
Attachment #604342 -
Flags: review?(jonas)
Comment on attachment 604344 [details] [diff] [review]
Interdiff between part 0v1 and part 0v2.
removing requests. If it's was just a rebase just mark them as r+
Attachment #604344 -
Flags: review?(jonas)
Assignee | ||
Comment 46•13 years ago
|
||
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase
Well, it was just a bustage fix - literally a one-liner. I could have carried over r+ but I figured it would only take a cursory glance since I attached the interdiff.. At any rate, I would still like this sent to try since it *is* a fairly large patch and landing it failed once before.
Attachment #604342 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Attachment #604344 -
Attachment is obsolete: true
Assignee | ||
Updated•13 years ago
|
Attachment #604344 -
Attachment is obsolete: false
Comment on attachment 604342 [details] [diff] [review]
Part 0v2: Add nsContentUtils::EqualsLiteralIgnoreASCIICase
Doh! My bad, you did indeed make a change and supply an interdiff. So sorry. Changes look good.
Attachment #604342 -
Flags: review+
Assignee | ||
Comment 48•13 years ago
|
||
Here's the folded patch as requested.
Pushed to inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/d09b4e60bb09
Comment 50•13 years ago
|
||
Target Milestone: --- → mozilla13
Updated•13 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Flags: in-testsuite?
You need to log in
before you can comment on or make changes to this bug.
Description
•