attributes are serialized with double quote even when value contains double quotes

RESOLVED FIXED

Status

()

Core
Serializers
RESOLVED FIXED
15 years ago
15 years ago

People

(Reporter: Brodie, Assigned: Brodie)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 5 obsolete attachments)

(Assignee)

Description

15 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3) Gecko/20030312
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3) Gecko/20030312

In nsXMLContentSerializer::SerializeAttr, all attributes are output using double
quote delimiters, however in the HTML spec, both single quote and double quote
delimiters are valid. This simple output of all attributes surrounded by double
quotes results in incorrect HTML when the attribute value contains double quotes.

For instance, if the original source specified the attribute as:
    <a onclick='goToLocation("new");'>new</a>
When we run through the DOM tree and serialize this tag it will be written out as:
    <a onclick="goToLocation("new");">new</a>

Which is invalid HTML and will cause javascript errors and parsing errors in
anything that looks at this HTML in the future. 

Reproducible: Always

Steps to Reproduce:
(Assignee)

Comment 1

15 years ago
Created attachment 117188 [details] [diff] [review]
Required changes to fix the problem

I'm not sure that this is the best possible solution to the problem, but it is
infinitely better then the current state of the code and seems to fix the error
for all html that I've seen so far.

lxr link for this file is:
http://lxr.mozilla.org/seamonkey/source/content/base/src/nsXMLContentSerializer.cpp#403
(Assignee)

Comment 2

15 years ago
Created attachment 117189 [details] [diff] [review]
updated fix

Fixes error in original patch
What about an attribute whose value contains both types of quotes?  eg:

<input value="I'm not &quot;happy&quot;">

How about we do this right and just escape the '"' chars in the value?
(Assignee)

Comment 4

15 years ago
In your example, this value will be represented in memory as [I'm not "happy"]
right? When I wrote the fix I wasn't sure if we could have this because I knew
we couldn't have both types of unescaped quotes in the value in the HTML, I
forgot about escaped quotes. Escaping the quotes won't cause any problems in
javascript or elsewhere because the HTML parser should always unescape the
string before using them, right? 

We could scan the value string first to determine if we have only 1 type of
quote, and if so just pick the correct value delimiter, otherwise if we have
both types then go ahead and replace one of the types with the &quot;
representation. This would mean we save a bit of buffer manipulation if we only
have a single quote type in the value.

i.e.
scan value string, count all single & double quote chars
if single quote count = 0 or double quote count = 0
   set quote type to count = 0 type
else
   set quote type to double
   create new value string by escaping all double quotes
output value string

I can create a new patch like this if you agree.
> this value will be represented in memory as [I'm not "happy"] right?

It should be.  You could check (with your patch, of course).  ;)

Your approach sounds good.
(Assignee)

Comment 6

15 years ago
In the calling function nsHTMLContentSerializer::SerializeAttributes (see
http://lxr.mozilla.org/seamonkey/source/content/base/src/nsHTMLContentSerializer.cpp#406)
 the value of aDoEscapeEntities is set to PR_FALSE for attribute values
containing javascript. Any idea why this would be? Any reason why we can't just
set this to PR_TRUE for all attributes including JS?
If you click the "Blame" link in the top-right corner of that page, it gives you
a listing of the file with each line of code "blamed" to the checkin that last
touched it... In this case, see bug 68167 and bug 92271
(Assignee)

Comment 8

15 years ago
The problem discussed in bug 68167 seems to be more indesirable formatting than
an error in the HTML. It appears that this bug was that apostrophe entity
references were used unnecessarily. The following sample works perfectly in Moz 1.3:
<a href="  javascript: alert(&apos;href&apos;); alert('quoting problem');
alert(&quot;string&quot;); ">blah</a>

Bug 92271 isn't directly related to this problem as it is about URL escaping.

According to the W3 spec, it is recommended for attribute values containing
quotes and ampersands to have them turned into character entities references. 
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.3.2.2

[partial quote]
When script or style data is the value of an attribute (either style or the
intrinsic event attributes), authors should escape occurrences of the delimiting
single or double quotation mark within the value according to the script or
style language convention. Authors should also escape occurrences of "&" if the
"&" is not meant to be the beginning of a character reference.
[endquote]

The main problem at the moment seems to be where to introduce the fix. Is it
valid to modify nsXMLContentSerializer::SerializeAttr and use a character entity
for a quote when necessary, thus ignoring the value of aDoEscapeEntities?
ccing brade, seeing as she added this code....
(Assignee)

Comment 10

15 years ago
Created attachment 117455 [details] [diff] [review]
patch for comments

I've attached an updated patch. This follows the following pseudocode...

if ( translating to entities ) {
  add unmodified string (use current code)
}
else {
  determine which quote characters exist in the string (apos, quot)
  select a value delimiter character (see table in attached patch)
  if both apos and quot are used then {
    translate quot chars to entities
    add modified string
  } 
  else {
    add unmodified string
  }
}

Comments on design?
Attachment #117189 - Attachment is obsolete: true
To Brodie, btw... ;)
Assignee: harishd → bugzilla
Status: UNCONFIRMED → NEW
Ever confirmed: true
The design seems pretty reasonable to me...  One comment on the code -- NS_LL is
not exactly portable as an nsAStrin; if you need literal strings use
NS_LITERAL_STRING, if you need to convert char to unichar use PRUnichar(char),
in the |case| statements you can just use chars.

Comment 13

15 years ago
my only suggestion is to do some testing in Composer and make sure things still
roundtrip correctly (see testcases in various bugs including bug 68167)
(Assignee)

Comment 14

15 years ago
Created attachment 117561 [details] [diff] [review]
updated patch (bugfix)

The line that adds the '=' sign had been mistakenly deleted at the time I made
the last patch. This fixes that problem.
(Assignee)

Updated

15 years ago
Attachment #117455 - Attachment is obsolete: true
(Assignee)

Comment 15

15 years ago
Created attachment 117562 [details]
composer testcase for roundtrip testing

Testcase to check this bug and the results of this patch. Using composer, open
the file and switch to html view. The changes to the html source as detailed in
the paragraph should have occurred.

Brade:
Note that this does not provide true roundtrip to the original source. Instead
the code is simplified and made consistent while still being valid attributes
according to the standard. For example, what won't roundtrip is:
    <a href="alert(&quot;Hello&quot;)">hello</a>
instead, you will get the simpler but still valid version:
    <a href='alert("Hello")'>hello</a>

The only way that I can see to get real roundtrip is to store the actual string
from the source file at parse time (along with the delimiter type).

Boris:
I've changed the code to eliminate NS_LL where possible. The only place that I
couldn't find a replacement was the call to nsString::ReplaceString(). This
requires an nsString or PRUnichar* parameter. NS_LITERAL_STRING doesn't provide
a compatible type for this. I'm not sure what you mean about the case
statements - it is comparisons between PRUnichar types here not char, I still
need to cast as far as I understand.
> This requires an nsString or PRUnichar* parameter.

Yeah, but NS_LL is neither of those, on Linux.  In fact, it's a no-op (unless
you're building with GCC 3.x and have --fshort-wchar enabled).  You can get a
PRUnichar* out of NS_LITERAL_STRING by using .get() like:

NS_LITERAL_STRING("\"").get()
(Assignee)

Comment 17

15 years ago
Created attachment 118387 [details] [diff] [review]
patch with requested changes

I have built and tested this patch in the composer on Windows 2000 and Redhat
Linux 2.2.16-22 without a problem (changes to code on round-trip as described
in the testcase). Could I get review and s-review please.
Attachment #117561 - Attachment is obsolete: true
(Assignee)

Updated

15 years ago
Attachment #118387 - Flags: superreview?
Attachment #118387 - Flags: review?

Updated

15 years ago
Attachment #118387 - Flags: superreview?(kin)
Attachment #118387 - Flags: superreview?
Attachment #118387 - Flags: review?(akkana)
Attachment #118387 - Flags: review?
(Assignee)

Comment 18

15 years ago
Since the 1.4 version of composer is going to have a few extra features, and 1.4
is trying to be a stable and fairly bugfree platform for development, it would
be good to get this fix into the 1.4 version as well since this is fixing
problems in the composer. Nominating this for 1.4b. This requires reviews and a
checkin.
Flags: blocking1.4b?
Comment on attachment 118387 [details] [diff] [review]
patch with requested changes

>+    // depending on if the attribute value contains quotes or apostrophes we

"depending on whether"

r=bzbarsky; please mail kin@netscape.com in person and prod lightly for review?
Attachment #118387 - Flags: review?(akkana) → review+

Updated

15 years ago
Attachment #118387 - Flags: superreview?(kin) → superreview?(heikki)
Comment on attachment 118387 [details] [diff] [review]
patch with requested changes

>Index: nsXMLContentSerializer.cpp
>===================================================================
>-  AppendToString(NS_LITERAL_STRING("\""), aStr);
>+    AppendToString(NS_LITERAL_STRING("\""), aStr);

There is a version of AppendToString that takes PRUnichar, so you could do:

+    AppendToString(PRUnichar('"'), aStr);

>+    // character entity references, ignoring the value of aDoEscapeEntities.
>+    // http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.3.2.2 for
>+    // the standard on character entity references in values etc.

Missing something above? Should the URL line start with "See "?

>+    // Delimiter and escaping is according to the following table
>+    //    bIncludesDouble     bIncludesSingle     Delimiter       Escape Quotes
>+    //    FALSE               FALSE               "               FALSE
>+    //    FALSE               TRUE                "               FALSE
>+    //    TRUE                FALSE               '               FALSE
>+    //    TRUE                TRUE                "               TRUE

Reading this comment I wondered why you escape both single and double quotes in
the last case, but reading the code that was not the case. So maybe you should
change the column heading to be "Escape Double Quotes". You might also want to
move this table further down, where the actual escaping happens.

>+    for (;iCurr != iEnd; ++iCurr) {
>+      switch (*iCurr) {
>+      case PRUnichar('\''):
>+        bIncludesSingle = PR_TRUE;
>+        if (bIncludesDouble) {
>+          // no point searching further if we have both
>+          iCurr = iEnd;
>+          --iCurr;    // for loop will increment before test
>+        }
>+        break;
>+      case PRUnichar('\"'):

Do you need to escape the double quote like that inside the apostrophes?

>+        bIncludesDouble = PR_TRUE;
>+        if (bIncludesSingle) {
>+          iCurr = iEnd;
>+          --iCurr;
>+        }
>+      }
>+    }

I would recomment you change the switch statement to if statement. Then you
could simply break when you have found that the string contains both single and
double quote, and save the manipulations with iCurr that is not used after that
point.

This is probably not perf critical piece of code but if it was that loop with
iterators could be written slightly more efficiently by following the string
guide:
http://www.mozilla.org/projects/xpcom/string-guide.html#Looping_Iterators

>+    PRUnichar cDelimiter = 
>+        (bIncludesDouble && !bIncludesSingle) ? PRUnichar('\'') : PRUnichar('\"');

Again is it necessary to escape the double quote inside single quotes?

Nothing that you absolutely must fix before checking in, but I would appreciate
it if you could make those changes.
Attachment #118387 - Flags: superreview?(heikki) → superreview+
(Assignee)

Comment 21

15 years ago
Created attachment 120967 [details] [diff] [review]
updated patch

I've updated the patch to include the comments from Heikki Toivonen. All of the
requested changes from comment 19 and comment 20 have been included. I have
built and tested this on Win32 and Linux with todays daily trunk.
Attachment #118387 - Attachment is obsolete: true
Brodie, please let me know if you need this checked in, ok?
(Assignee)

Comment 23

15 years ago
I don't have CVS checkin privs, so I will need it checked in for me. 

Because I changed the string iterator loop substantially to do the faster
looping, would you have a squiz at that before you check it in though. I am
confident that it is correct (I basically cut and pasted the sample), but I
would prefer someone else to have a look.
Checked in (yes, the iterator stuff looks correct).  Thanks for the patch and
your patience!
Status: NEW → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → FIXED

Comment 25

15 years ago
What was the reason for not doing the simpler "Always use double-quote as
delimiter, escape double-quotes if they exist"? You'd save one pass over the
attribute string that way, your code becomes a little simpler, your output has
consistent attribute value delimiters[0], at the cost of slightly larger and a
little less human readable xml source.

[0] I don't have to look at the delimiter to know which of ' and " are safe to
use directly, and (to me) it looks a little prettier to have consistent delimiters
(Assignee)

Comment 26

15 years ago
There didn't seem any compelling reason to go in either direction. Granted it
probably is a bit over-engineered, but the primary advantage of doing it this
way is that we avoid shuffling the buffer around to insert the entity string if
it isn't necessary. 

Neither this way or always using quote delimiters is the best solution for
roundtrip editing. Like I said in comment 15, we can't roundtrip back to the
quoting method used by the user because we never store that information.
Consequently this best fit is as good as any.
clearing 1.4b blocking request, since marked as fixed.
Flags: blocking1.4b?
You need to log in before you can comment on or make changes to this bug.