Closed Bug 464314 Opened 16 years ago Closed 15 years ago

view-source link-browsing doesn't decode entities

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: dao, Assigned: cbartley)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.9.1)

Attachments

(3 files, 2 obsolete files)

If you view the source of http://example.com/ and it contains a link to "/?foo=1&bar=2", this will lead to view-source:http://example.com/?foo=1&bar=2 instead of view-source:http://example.com/?foo=1&bar=2.
OS: Windows XP → All
Hardware: PC → All
Assignee: nobody → cbartley
This bug pretty seriously impacts view source linkification so I think we should absolutely want to fix it for 3.1.  I have a patch that is very nearly ready for review.
Flags: blocking1.9.1?
I had two choices here:

1. Modify the parser code to return both raw source and source with entities decoded.

or

2. Duplicate some entity decoding logic in nsViewSourceHTML.cpp

Neither solution is ideal.  Duplicating logic is not good (DRY is your friend) but making already crufty parser code even more crufty doesn't seem like a great alternative.

I opted for duplicating the logic because I had more confidence that I could get solid and correct code that way, and I believed I could do it faster.  (Faster here being a relative term, but I digress.)  The new code could also be structured in a way that would make it easy to unit test, providing another level of confidence.

I in fact do have some simple unit tests for this new code, but I'm not including them in this patch because I was never able to modify the build system to build the unit tests reliably.  I may upload the unit tests as another patch for reference.
Attachment #352879 - Flags: review?(mrbkap)
Simple manual testcase for attachment 352879 [details] [diff] [review]: View Source linkification now correctly decodes entities
Unit tests for the patch "attachment 352879 [details] [diff] [review]: View Source linkification now correctly decodes entities".  Included only for reference, I'm not requesting review.  A complete patch would include makefile changes necessary to build the unit test.  These makefile changes have turned out to be much more difficult than I anticipated, by which I mean I still haven't figured them out.
Can we abstract out the entity decoding code so that it lives only in one place?  Or is it too entwined with the parser innards?
It's pretty entwined.  See

static nsresult
ConsumeEntity(nsScannerSharedSubstring& aString,
              nsScanner& aScanner,
              PRBool aIECompatible,
              PRInt32 aFlag)

in parser/htmlparser/src/nsHTMLTokens.cpp.
(In reply to comment #6)
> It's pretty entwined.  See ...

I think it may be helpful if I elaborate on my thinking.

Normally the parser generates tokens with entities decoded to characters for their text content.  However, when parsing for view source, the parser stores the raw text in the tokens, since that's what we want to see in the view source window.

With linkification, however, there are cases where we need both kinds of text -- when we are displaying the value of an HREF attribute on an anchor, we want to display the undecoded source text in the view source window, but we want to use the decoded text to build the link we're going to attach to that token.

There are a couple of approaches that could be used here.

1. Modify the parser to return both kinds of text when the view source flag is set.  This would really only need to be done on attribute value tokens.
2. Leave the parser as-is and decode any entities in the attribute values separately when building URLs in nsViewSourceHTML.cpp.

For approach 2, there are a couple of variations:
a) Re-use the parser logic by calling into it at the right place to simply decode the string containing the attribute value
b) Write duplicate parsing logic to identify entities in the attribute value, and then call into the parser logic at the right place to convert the entities to unicode
c) Duplicate the parsing logic *and* the numeric entity decoding, but rely on parser logic to decode symbolic entities.
d) Duplicate all the logic including numeric and symbolic entity decoding.

Now to be fair, there's no real reason to duplicate the symbolic entity decoding since that's already been nicely factored out into the nsHTMLEntities::EntityToUnicode() function, and there's no reason not to call it directly.  So we can exclude d) as an option.

Now to step back for a second -- approach 1 requires modifying the parser, and modifying it in a way that feels pretty hacky.  Since the parser already has special casing inside it specifically for view source, maybe that's not such a big deal.  Nevertheless modifying the parser is a riskier change and would require me to invest a fair amount of time in understanding some portion of the parser code so that I could confidently make the changes.  So this approach seemed like more risk, more work, and more hacky.  Feeling the heat of an impending deadline, I decided not to take this approach.

I discounted approaches 2a and 2b because after a cursory inspection of the parsing code that handles entities, it didn't look like there was an easy way to call into this code.  It would be necessary at the very least to set up a dummy scanner instance, and not only would this probably be ugly, but it would also introduce a dependency from outside into parser internals that looked like they weren't ever designed to be used that way.

So I ended up using approach 2c.

This approach involves duplicating the fairly simple entity parsing code and numeric entity conversion, but relies on the existing symbolic entity conversion capability provided by nsHTMLEntities::EntityToUnicode().

In an ideal situation I'd prefer not to duplicate logic like this at all.  If the existing parser logic had been better factored out or if I felt comfortable doing refactoring of the existing parser logic (scary!) to make it easier to call, then I'd be all for that.

However, given the constraints here it seemed like duplicating some logic that was actually fairly simple seemed like the lesser of evils.  One advantage of writing new logic in this case is it's probably a lot easier to code review since it's self-contained, pretty aggressively documented, and easy to test in isolation.

Also a note on if-statement style.  I usually write:

  if (condition) {
    printf("hello, world");
  }

but 

  if (condition) return "hello, world";

That is to say, if my if-statement is lacking braces, it is:
a) a return statement (or a break)
b) all on the same line

I won't ever write

  if (condition)
    return "hello, world";

If I felt the need to put the return on a separate line, I'd always use braces.

I use this one line return style because I think in many cases it helps readability (see the IsTokenValueTrimmableCharacter() function in attachment 346379 [details] [diff] [review] for an example) and usually doesn't hurt in other cases.  That said, I don't think it will make any real difference in this patch, so I'm not making an impassioned argument to keep the style here.  All I want to do is communicate that I put some thought into the issue, it's not just me being sloppy.

Anyway, I apologize if this comment comes off as being pedantic.  Based on comments here and on IRC (back before Christmas), it seems like there is some question about why I did the things the way I did.  I figured it wouldn't hurt to elaborate a priori rather than answer them as part of the code review dialog.
Blocks: 472846
Comment on attachment 352879 [details] [diff] [review]
View Source linkification now correctly decodes entities

I'm not happy with the duplication of logic here, but I agree that this is probably better than trying to shoehorn the parser stuff into being usable from anywhere.

>diff -r 055cb77c977c parser/htmlparser/src/nsViewSourceHTML.cpp
>@@ -995,18 +996,20 @@ PRBool CViewSourceHTML::IsUrlAttribute(c
>+    const nsAString& baseSpec = TrimTokenValue(attrValue);
>+    nsString expandedBaseSpec;

This wants to be an nsAutoString to avoid unnecessary mallocs.

>@@ -1071,18 +1074,21 @@ nsresult CViewSourceHTML::CreateViewSour
>+  nsString expandedLinkUrl;

Ditto.

>@@ -1200,8 +1206,154 @@ nsresult CViewSourceHTML::SetBaseURI(con
>+void CViewSourceHTML::ExpandEntities(const nsAString& textIn, nsString& textOut) {
>+  

Nits: Put the { on its own line for function declarations and nix the extra empty line here.

>+  nsAString::const_iterator iter, end;
>+  textIn.BeginReading(iter);
>+  textIn.EndReading(end);
>+
>+  // The outer loop treats the input as a sequence of pairs of runs.  The first
>+  // run of each pair is just a run of regular characters.  The second run is
>+  // something that looks like it might be an entity reference, e.g. "&".
>+  // Any regular run may be empty, and the entity run can be skipped at the end
>+  // of the text.  Apparent entities that can't be translated are copied
>+  // verbatim.  In particular this allows for raw ampersands in the input.
>+  // Special care is taken to handle the end of the text at any point inside
>+  // the loop.
>+  while (iter != end) {
>+    

Nit: nix the extra newline here.

>+    // Copy characters to textOut until but not including the first ampersand.
>+    for (; iter != end; ++iter) {
>+      PRUnichar ch = *iter;
>+      if (ch == kAmpersand) break;

Here and below: put the |break;| on its own line. Otherwise, it's too easy to miss it when reading the loop.

>+        }
>+      }      
>+    }
>+      

Nit: Nix this extra blank line.

>+  }
>+
>+}
>+
>+PRInt32 CViewSourceHTML::ExpandEntity(const nsAString::const_iterator& start, 
>+                            const nsAString::const_iterator& end) {

Same nit about indentation and brace here.

>+  // Decode the entity to a Unicode character.
>+  switch (entityType) {
>+  case TYPE_ID:
>+    return nsHTMLEntities::EntityToUnicode(entityBody);

Need to imitate ConsumeEntity here and only return entities < 256 or ending with ;.

r- on account of that. Looks pretty good otherwise, though.
Attachment #352879 - Flags: review?(mrbkap) → review-
Sounds like this is something that significantly limits the usability of the linkification we do in view source, which is one of the new features in 3.1 that's been talked about a fair bit, so we should probably fix this. The patch looks close to ready, and looks safe (due to the duplication of code rather than invasive changes to the parser). Blocking.
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P2
Target Milestone: --- → mozilla1.9.1
Note that I rewrote a substantial chunk of the code from the previous patch.  The new CopyPossibleEntity() function subsumes both the parsing and expansion of individual entities (or entity like things).  I think the code is cleaner now, and it allows for missing semicolons on eight-bit entities mirroring the parser code.

My point-by-point comments are below, surrounded by "***"s.

--------

(From update of attachment 352879 [details] [diff] [review])
I'm not happy with the duplication of logic here, but I agree that this is
probably better than trying to shoehorn the parser stuff into being usable from
anywhere.

>diff -r 055cb77c977c parser/htmlparser/src/nsViewSourceHTML.cpp
>@@ -995,18 +996,20 @@ PRBool CViewSourceHTML::IsUrlAttribute(c
>+    const nsAString& baseSpec = TrimTokenValue(attrValue);
>+    nsString expandedBaseSpec;

This wants to be an nsAutoString to avoid unnecessary mallocs.

*** done ***

>@@ -1071,18 +1074,21 @@ nsresult CViewSourceHTML::CreateViewSour
>+  nsString expandedLinkUrl;

Ditto.

*** done ***

>@@ -1200,8 +1206,154 @@ nsresult CViewSourceHTML::SetBaseURI(con
>+void CViewSourceHTML::ExpandEntities(const nsAString& textIn, nsString& textOut) {
>+  

Nits: Put the { on its own line for function declarations and nix the extra
empty line here.

*** done ***

>+  nsAString::const_iterator iter, end;
>+  textIn.BeginReading(iter);
>+  textIn.EndReading(end);
>+
>+  // The outer loop treats the input as a sequence of pairs of runs.  The first
>+  // run of each pair is just a run of regular characters.  The second run is
>+  // something that looks like it might be an entity reference, e.g. "&amp;".
>+  // Any regular run may be empty, and the entity run can be skipped at the end
>+  // of the text.  Apparent entities that can't be translated are copied
>+  // verbatim.  In particular this allows for raw ampersands in the input.
>+  // Special care is taken to handle the end of the text at any point inside
>+  // the loop.
>+  while (iter != end) {
>+    

Nit: nix the extra newline here.

*** done ***

>+    // Copy characters to textOut until but not including the first ampersand.
>+    for (; iter != end; ++iter) {
>+      PRUnichar ch = *iter;
>+      if (ch == kAmpersand) break;

Here and below: put the |break;| on its own line. Otherwise, it's too easy to
miss it when reading the loop.

*** done ***

>+        }
>+      }      
>+    }
>+      

Nit: Nix this extra blank line.

*** done *** 
>+  }
>+
>+}
>+
>+PRInt32 CViewSourceHTML::ExpandEntity(const nsAString::const_iterator& start, 
>+                            const nsAString::const_iterator& end) {

Same nit about indentation and brace here.

*** ExpandEntity() has been superceded by the new CopyPossiblEntity(). ***

>+  // Decode the entity to a Unicode character.
>+  switch (entityType) {
>+  case TYPE_ID:
>+    return nsHTMLEntities::EntityToUnicode(entityBody);

Need to imitate ConsumeEntity here and only return entities < 256 or ending
with ;.

*** Done ***

r- on account of that. Looks pretty good otherwise, though.
Attachment #352879 - Attachment is obsolete: true
Attachment #357198 - Flags: review?(mrbkap)
Attachment #357198 - Flags: superreview+
Attachment #357198 - Flags: review?(mrbkap)
Attachment #357198 - Flags: review+
Comment on attachment 357198 [details] [diff] [review]
View Source linkification now correctly decodes entities v2

My only comment is to go through and avoid |if (blah) return| all on one line.

Also, I like |// Empty| comments in empty loop bodies, but feel free to ignore that :-).
> (From update of attachment 357198 [details] [diff] [review])
> My only comment is to go through and avoid |if (blah) return| all on one line.

Done.

> Also, I like |// Empty| comments in empty loop bodies, but feel free to ignore
> that :-).

Done.

[Am I supposed to keep asking for review every time I upload a modified patch?]
Attachment #357198 - Attachment is obsolete: true
Attachment #357299 - Flags: review?(mrbkap)
Attachment #357299 - Flags: superreview+
Attachment #357299 - Flags: review?(mrbkap)
Attachment #357299 - Flags: review+
(In reply to comment #12)
> [Am I supposed to keep asking for review every time I upload a modified patch?]

Usually, if a reviewer stamps the patch, you don't have to request re-review.
Keywords: checkin-needed
http://hg.mozilla.org/mozilla-central/rev/6a1b8efcecc8
Status: NEW → RESOLVED
Closed: 15 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: baking
Target Milestone: mozilla1.9.1 → mozilla1.9.2a1
The patch applies cleanly to 1.9.1 as of a few minutes ago.
Keywords: checkin-needed
Whiteboard: baking → [needs 191 landing]
Keywords: checkin-needed
Whiteboard: [needs 191 landing]
Attachment #357299 - Flags: approval1.9.1?
Comment on attachment 357299 [details] [diff] [review]
View Source linkification now correctly decodes entities v3

This patch doesn't need approval, because it's fixing a blocker.
Keywords: checkin-needed
Whiteboard: [needs 191 landing]
You need to log in before you can comment on or make changes to this bug.