302133 - Fix parsing of Atom 1.0 titles for Live Bookmarks

Reporter

Description

•

19 years ago

Atom 1.0 allows the content of titles to be plain-text (@type="text" or no type
attribute), escaped HTML (@type="html"), or an inline XHTML div (@type="xhtml"). 

For @type="text" we're fine: we pass &lt;b>&amp;copy&lt;/b> through as-is,
correctly.

For @type="html" we fail to convert escaped character entity references and
escaped numeric character references to characters (&amp;copy; &amp;#xA9; should
display as two copyright symbol characters), and we fail to strip markup ("this
is &lt;b>bold&lt;/b>" should be displayed as "this is bold" rather than "this is
<b>bold</b>").

For @type="xhtml" we fail completely: FindTextInNode looks like it should find
the first descendant text node (when what's needed is actually concatenating all
the text nodes), but rather than that, finding "strong" in <title
type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><strong>strong</strong>
isn't better, it's different</div></title>, or the correct result, "strong isn't
better, it's different" we get nothing: the entry fails to load as if there was
no title.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

19 years ago

Attached file Atom title testcase — Details

A variety-pack of Atom titles.

Robert Sayre

Assignee

Updated

•

19 years ago

Assignee: nobody → sayrer

Robert Sayre

Assignee

Updated

•

19 years ago

Status: NEW → ASSIGNED

Robert Sayre

Assignee

Comment 2

•

19 years ago

Attached patch First cut at handling escaped HTML and XML in Atom titles (obsolete) — Details — Splinter Review

Not sure if this is the optimal strategy, but it passes all of Phil's testcases.

Attachment #205503 - Flags: review?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 3

•

19 years ago

Comment on attachment 205503 [details] [diff] [review]
First cut at handling escaped HTML and XML in Atom titles


Thanks for the patch!  Some comments..

>-
>+    NS_METHOD FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString);
>+    NS_METHOD ParseHTMLFragment(nsAString &aFragString, nsIDocument* aTargetDocument, nsCOMPtr<nsIDOMNode> *outNode);

nsIDOMNode **outNode; don't use nsCOMPtr<> for params

> nsresult
>+nsFeedLoadListener::ParseHTMLFragment(nsAString &aFragString,  
>+                                      nsIDocument* aTargetDocument,
>+                                      nsCOMPtr<nsIDOMNode> *outNode)

nsIDOMNode **outNode

>+{
>+    NS_ENSURE_ARG(aTargetDocument);
>+    nsresult rv;
>+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+
>+    // create the html fragment sink
>+    nsCOMPtr<nsIContentSink> sink;
>+    sink = do_CreateInstance(NS_HTMLFRAGMENTSINK2_CONTRACTID);
>+    NS_ENSURE_TRUE(sink, NS_ERROR_FAILURE);
>+    nsCOMPtr<nsIFragmentContentSink> fragSink(do_QueryInterface(sink));
>+    NS_ENSURE_TRUE(fragSink, NS_ERROR_FAILURE);

I'm not sure offhand if we really want to be creating a new instance each time; not sure if we can reuse the parser or not.  If we can, it would be better to do so.


>+    *outNode = do_QueryInterface(contextfrag);  

CallQueryInterface (contextfrag, outNode);

>+    return rv;
>+}
>+
>+// find all of the text nodes below aNode at return them as a string
>+nsresult
>+nsFeedLoadListener::FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString)
>+{
>+    NS_ENSURE_ARG(aNode);
>+
>+    nsresult rv;
>+
>+    nsCOMPtr<nsIDOMDocument> aDoc;
>+    aNode->GetOwnerDocument(getter_AddRefs(aDoc));
>+    nsCOMPtr<nsIDOMDocumentTraversal> trav = do_QueryInterface(aDoc, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+    
>+    nsCOMPtr<nsIDOMTreeWalker> walker;
>+    rv = trav->CreateTreeWalker(aNode, 
>+                                nsIDOMNodeFilter::SHOW_TEXT | nsIDOMNodeFilter::SHOW_CDATA_SECTION,
>+                                nsnull, PR_TRUE, getter_AddRefs(walker));
>+    if (NS_FAILED(rv)) return rv;
>+
>+    nsCOMPtr<nsIDOMNode> currentNode;
>+    walker->GetCurrentNode(getter_AddRefs(currentNode));
>+    nsCOMPtr<nsIDOMCharacterData> charTextNode;
>+    nsAutoString tempString;
>+    while (currentNode)
>+    {

{ on same line as while, i.e. "while (currentNode) {"

>+        charTextNode = do_QueryInterface(currentNode);
>+        if(charTextNode)
>+        { 

same line { again, space after "if"  (and in a bunch of other places -- please try to follow the coding style of whatever file you're in :)

>+            charTextNode->GetData(tempString);
>+            aString.Append(tempString);
>+        }
>+        walker->NextNode(getter_AddRefs(currentNode));
>+    }
>+
>+    if(aString.IsEmpty()){

space after "if", before {

>+        return NS_ERROR_FAILURE;
>+    }else{

space around "else"

>+        return NS_OK;
>+    }
>+}
>+
>+nsresult
> nsFeedLoadListener::HandleRDFItem (nsIRDFDataSource *aDS, nsIRDFResource *aItem,
>                                    nsIRDFResource *aLinkResource,
>                                    nsIRDFResource *aTitleResource)
> {
>     nsresult rv;
> 
>     /* We care about this item's link and title */
>     nsCOMPtr<nsIRDFNode> linkNode;
>@@ -665,17 +749,53 @@ nsFeedLoadListener::TryParseAsSimpleRSS 
>                     rv = childNode->GetNodeType(&childNtype);
>                     if (NS_FAILED(rv)) return rv;
> 
>                     if (childNtype == nsIDOMNode::ELEMENT_NODE) {
>                         nsAutoString childNname;
>                         rv = childNode->GetNodeName (childNname);
> 
>                         if (childNname.Equals(NS_LITERAL_STRING("title"))) {
>-                            rv = FindTextInNode (childNode, titleStr);
>+                            if(isAtom){
>+                                /* Atom titles can contain HTML, so we need to find and remove it */
>+                                nsCOMPtr<nsIDOMElement> titleElem = do_QueryInterface(childNode);
>+                                if (!titleElem) break; // out of while(childNode) loop
>+                                
>+                                nsAutoString titleMode; // Atom 0.3 only
>+                                nsAutoString titleType;
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("type"), titleType);
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("mode"), titleMode);
>+                                
>+                                /* Cover Atom 0.3 and RFC 4287 together */
>+                                if(titleType.Equals(NS_LITERAL_STRING("text")) ||

space after "if"; also use EqualsLiteral("text") -- I know a bunch of other places don't use it, but it wasn't available on the 1.0 branch (but is available on the trunk).  We should start using it where possible now.

>+                                   titleType.Equals(NS_LITERAL_STRING("text/plain")) ||
EqualsLiteral
>+                                   titleType.IsEmpty())
>+                                {
>+                                    rv = FindTextInNode(childNode, titleStr);
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("text/html")) ||

space around "else if"; EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("html")))
>+                                {
>+                                    nsAutoString escapedHTMLStr;
>+                                    rv = FindTextInNode(childNode, escapedHTMLStr);
>+                                    if (NS_FAILED(rv)) break;  // out of while(childNode) loop
>+
>+                                    nsCOMPtr<nsIDOMNode> newNode;
>+                                    nsCOMPtr<nsIDocument> doc = do_QueryInterface(xmldoc);
>+                                    ParseHTMLFragment(escapedHTMLStr, doc, address_of(newNode));
>+                                    rv = FindTextInChildNodes(newNode, titleStr);                                    
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("xhtml")) ||

spaces, EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("application/xhtml")))
>+                                {
>+                                    rv = FindTextInChildNodes(childNode, titleStr);
>+                                }else{

spaces

>+                                    break; // out of while(childNode) loop
>+                                }
>+                            }else{

ditto

>+                                rv = FindTextInNode(childNode, titleStr);
>+                            }
>                             if (NS_FAILED(rv)) break;
>                         } else if (childNname.Equals(NS_LITERAL_STRING("pubDate")) ||
>                                    childNname.Equals(NS_LITERAL_STRING("updated")))
>                         {
>                             rv = FindTextInNode (childNode, dateStr);
>                             if (NS_FAILED(rv)) break;
>                         } else if (!isAtom && childNname.Equals(NS_LITERAL_STRING("guid"))) {
>                             nsCOMPtr<nsIDOMElement> linkElem = do_QueryInterface(childNode);

Robert Sayre

Assignee

Comment 4

•

19 years ago

FindTextInNode seems to have problems wrt all FeedFormats. Here are some tests:
http://franklinmint.fm/2005/12/18/pathological-rss2.xml

All valid:
http://feedvalidator.org/check.cgi?url=http%3A%2F%2Ffranklinmint.fm%2F2005%2F12%2F18%2Fpathological-rss2.xml

Robert Sayre

Assignee

Updated

•

19 years ago

Attachment #205503 - Flags: review?(vladimir)

Robert Sayre

Assignee

Comment 5

•

19 years ago

Attached patch fix formatting, better text handling — Details — Splinter Review

This patch removes FindTextInNode in favor of FindTextInChildNodes and cleans up whitespace in resulting values.

Tests nasty but valid RSS2 variants: 
http://franklinmint.fm/2005/12/18/pathological-rss2.xml

Tests real-world Atom 0.3 escaped HTML titles:
http://compulsorybloggerblog.blogspot.com/atom.xml

Phil's character and stripTags tests:
http://franklinmint.fm/2005/12/09/Atom-title-conversion-testcase.xml

Attachment #206272 - Flags: review?(vladimir)

Robert Sayre

Assignee

Updated

•

19 years ago

Attachment #205503 - Attachment is obsolete: true

Robert Sayre

Assignee

Comment 6

•

19 years ago

(In reply to comment #3)
>
> >+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
> >+    if (NS_FAILED(rv)) return rv;
> 
> I'm not sure offhand if we really want to be creating a new instance each time;
> not sure if we can reuse the parser or not.  If we can, it would be better to
> do so.

I tried reusing it, and got seg faults. I also asked if there was a way to reuse it on IRC and was told "probably not".

Phil Ringnalda (:philor)

Reporter

Updated

•

19 years ago

Blocks: 311065

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 7

•

19 years ago

Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

Looks fine to me.
r=vladimir

Attachment #206272 - Flags: review?(vladimir) → review+

Robert Sayre

Assignee

Updated

•

19 years ago

Whiteboard: [checkin needed]

Phil Ringnalda (:philor)

Reporter

Comment 8

•

19 years ago

*** Bug 321606 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Reporter

Comment 9

•

19 years ago

*** Bug 322951 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Reporter

Comment 10

•

19 years ago

*** Bug 253814 has been marked as a duplicate of this bug. ***

Bryan McQuade

Comment 11

•

19 years ago

i'm not sure that the desired behavior is ever to pass escaped content through and display it to the user. if someone publishes an rss feed with a title that has escaped html entities in it, they almost certainly want it unescaped. i think firefox should make this assumption. it makes the code much simpler.

so, i think the easy fix here is to make this assumption and always pass the bookmark's title through an unescape function. this is similar to what nsBookmarksService does when it loads the bookmarks file off of the disk.

it appears that someone hacked in an unescape function in nsBookmarksService.cpp (not sure why nsEscape.h isn't used here). i was able to make this function visible to the nsBookmarksFeedLoader and use it to preprocess every live bookmark title. this totally fixed the problem.

so i would not advise employing a complicated solution (the one below) to fix this simple but high visibility bug. just unescape all of the live bookmark titles, and you're done. it takes about 5 lines of code to get this working.

Bryan McQuade

Comment 12

•

19 years ago

I should clarify that I'm speaking specifically to the failure to unescape html in the titles. This is by far the most visible aspect of this bug. Fixing the other parts of this bug (parsing bold, italics, etc) is arguably less important, since those tags are much more rare in the live bookmark titles.

Robert Sayre

Assignee

Comment 13

•

19 years ago

I don't think we're going to fix this for 1.5, and FF2 will have a new parser. Not sure why we're talking about RSS2, but I'll gleefully note that the meaning of escaped markup in titles is undefined.

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → WONTFIX

Phil Ringnalda (:philor)

Reporter

Comment 14

•

19 years ago

*** Bug 311065 has been marked as a duplicate of this bug. ***

No longer blocks: 311065

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

18 years ago

Whiteboard: [checkin needed]

Phil Ringnalda (:philor)

Reporter

Comment 15

•

18 years ago

Since it's looking like we'll be using the old parser after all for 2.0, can we please have this?

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Robert Sayre

Assignee

Comment 16

•

18 years ago

(In reply to comment #15)
> Since it's looking like we'll be using the old parser after all for 2.0, can we
> please have this?

Looks to me like this landed on trunk already:
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsBookmarksFeedHandler.cpp&branch=&root=/cvsroot&subdir=mozilla/browser/components/bookmarks/src&command=DIFF_FRAMESET&rev1=1.10&rev2=1.11

Phil Ringnalda (:philor)

Reporter

Comment 17

•

18 years ago

Heh. So it did, though the checkin comment makes it look accidental. Okay, baked on the trunk for a month or so between accidental checkin and Places being on by default with no regressions reported, plus bug 321036 got it into the Places fork where it's also apparently working fine, so all we need is a1.8.1.

Status: REOPENED → RESOLVED

Closed: 19 years ago → 18 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Reporter

Comment 18

•

18 years ago

Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

(Well, approval and for the branch to reopen after a2, that is...)

Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor)

Mike Connor [:mconnor]

Updated

•

18 years ago

Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor) → approval-branch-1.8.1+

Robert Sayre

Assignee

Comment 19

•

18 years ago

checked in on branch.

Checking in Makefile.in;
/cvsroot/mozilla/browser/components/bookmarks/src/Makefile.in,v  <--  Makefile.in
new revision: 1.16.8.1; previous revision: 1.16
done
Checking in nsBookmarksFeedHandler.cpp;
/cvsroot/mozilla/browser/components/bookmarks/src/nsBookmarksFeedHandler.cpp,v  <--  nsBookmarksFeedHandler.cpp
new revision: 1.8.2.2; previous revision: 1.8.2.1
done

Keywords: fixed1.8.1

Phil Ringnalda (:philor)

Reporter

Comment 20

•

18 years ago

*** Bug 347316 has been marked as a duplicate of this bug. ***

Atom title testcase 19 years ago Phil Ringnalda (:philor) 1.50 KB, application/atom+xml		Details
First cut at handling escaped HTML and XML in Atom titles 19 years ago Robert Sayre 9.37 KB, patch		Details \| Diff \| Splinter Review
fix formatting, better text handling 19 years ago Robert Sayre 13.88 KB, patch	vlad : review+ mconnor : approval-branch-1.8.1+	Details \| Diff \| Splinter Review