Fix parsing of Atom 1.0 titles for Live Bookmarks

RESOLVED FIXED

Status

()

Firefox
Bookmarks & History
RESOLVED FIXED
13 years ago
12 years ago

People

(Reporter: philor, Assigned: Robert Sayre)

Tracking

({fixed1.8.1})

Trunk
fixed1.8.1
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

13 years ago
Atom 1.0 allows the content of titles to be plain-text (@type="text" or no type
attribute), escaped HTML (@type="html"), or an inline XHTML div (@type="xhtml"). 

For @type="text" we're fine: we pass <b>&copy</b> through as-is,
correctly.

For @type="html" we fail to convert escaped character entity references and
escaped numeric character references to characters (© © should
display as two copyright symbol characters), and we fail to strip markup ("this
is <b>bold</b>" should be displayed as "this is bold" rather than "this is
<b>bold</b>").

For @type="xhtml" we fail completely: FindTextInNode looks like it should find
the first descendant text node (when what's needed is actually concatenating all
the text nodes), but rather than that, finding "strong" in <title
type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><strong>strong</strong>
isn't better, it's different</div></title>, or the correct result, "strong isn't
better, it's different" we get nothing: the entry fails to load as if there was
no title.
(Reporter)

Comment 1

13 years ago
Created attachment 190502 [details]
Atom title testcase

A variety-pack of Atom titles.
(Assignee)

Updated

12 years ago
Assignee: nobody → sayrer
(Assignee)

Updated

12 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 2

12 years ago
Created attachment 205503 [details] [diff] [review]
First cut at handling escaped HTML and XML in Atom titles

Not sure if this is the optimal strategy, but it passes all of Phil's testcases.
Attachment #205503 - Flags: review?(vladimir)
Comment on attachment 205503 [details] [diff] [review]
First cut at handling escaped HTML and XML in Atom titles


Thanks for the patch!  Some comments..

>-
>+    NS_METHOD FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString);
>+    NS_METHOD ParseHTMLFragment(nsAString &aFragString, nsIDocument* aTargetDocument, nsCOMPtr<nsIDOMNode> *outNode);

nsIDOMNode **outNode; don't use nsCOMPtr<> for params

> nsresult
>+nsFeedLoadListener::ParseHTMLFragment(nsAString &aFragString,  
>+                                      nsIDocument* aTargetDocument,
>+                                      nsCOMPtr<nsIDOMNode> *outNode)

nsIDOMNode **outNode

>+{
>+    NS_ENSURE_ARG(aTargetDocument);
>+    nsresult rv;
>+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+
>+    // create the html fragment sink
>+    nsCOMPtr<nsIContentSink> sink;
>+    sink = do_CreateInstance(NS_HTMLFRAGMENTSINK2_CONTRACTID);
>+    NS_ENSURE_TRUE(sink, NS_ERROR_FAILURE);
>+    nsCOMPtr<nsIFragmentContentSink> fragSink(do_QueryInterface(sink));
>+    NS_ENSURE_TRUE(fragSink, NS_ERROR_FAILURE);

I'm not sure offhand if we really want to be creating a new instance each time; not sure if we can reuse the parser or not.  If we can, it would be better to do so.


>+    *outNode = do_QueryInterface(contextfrag);  

CallQueryInterface (contextfrag, outNode);

>+    return rv;
>+}
>+
>+// find all of the text nodes below aNode at return them as a string
>+nsresult
>+nsFeedLoadListener::FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString)
>+{
>+    NS_ENSURE_ARG(aNode);
>+
>+    nsresult rv;
>+
>+    nsCOMPtr<nsIDOMDocument> aDoc;
>+    aNode->GetOwnerDocument(getter_AddRefs(aDoc));
>+    nsCOMPtr<nsIDOMDocumentTraversal> trav = do_QueryInterface(aDoc, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+    
>+    nsCOMPtr<nsIDOMTreeWalker> walker;
>+    rv = trav->CreateTreeWalker(aNode, 
>+                                nsIDOMNodeFilter::SHOW_TEXT | nsIDOMNodeFilter::SHOW_CDATA_SECTION,
>+                                nsnull, PR_TRUE, getter_AddRefs(walker));
>+    if (NS_FAILED(rv)) return rv;
>+
>+    nsCOMPtr<nsIDOMNode> currentNode;
>+    walker->GetCurrentNode(getter_AddRefs(currentNode));
>+    nsCOMPtr<nsIDOMCharacterData> charTextNode;
>+    nsAutoString tempString;
>+    while (currentNode)
>+    {

{ on same line as while, i.e. "while (currentNode) {"

>+        charTextNode = do_QueryInterface(currentNode);
>+        if(charTextNode)
>+        { 

same line { again, space after "if"  (and in a bunch of other places -- please try to follow the coding style of whatever file you're in :)

>+            charTextNode->GetData(tempString);
>+            aString.Append(tempString);
>+        }
>+        walker->NextNode(getter_AddRefs(currentNode));
>+    }
>+
>+    if(aString.IsEmpty()){

space after "if", before {

>+        return NS_ERROR_FAILURE;
>+    }else{

space around "else"

>+        return NS_OK;
>+    }
>+}
>+
>+nsresult
> nsFeedLoadListener::HandleRDFItem (nsIRDFDataSource *aDS, nsIRDFResource *aItem,
>                                    nsIRDFResource *aLinkResource,
>                                    nsIRDFResource *aTitleResource)
> {
>     nsresult rv;
> 
>     /* We care about this item's link and title */
>     nsCOMPtr<nsIRDFNode> linkNode;
>@@ -665,17 +749,53 @@ nsFeedLoadListener::TryParseAsSimpleRSS 
>                     rv = childNode->GetNodeType(&childNtype);
>                     if (NS_FAILED(rv)) return rv;
> 
>                     if (childNtype == nsIDOMNode::ELEMENT_NODE) {
>                         nsAutoString childNname;
>                         rv = childNode->GetNodeName (childNname);
> 
>                         if (childNname.Equals(NS_LITERAL_STRING("title"))) {
>-                            rv = FindTextInNode (childNode, titleStr);
>+                            if(isAtom){
>+                                /* Atom titles can contain HTML, so we need to find and remove it */
>+                                nsCOMPtr<nsIDOMElement> titleElem = do_QueryInterface(childNode);
>+                                if (!titleElem) break; // out of while(childNode) loop
>+                                
>+                                nsAutoString titleMode; // Atom 0.3 only
>+                                nsAutoString titleType;
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("type"), titleType);
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("mode"), titleMode);
>+                                
>+                                /* Cover Atom 0.3 and RFC 4287 together */
>+                                if(titleType.Equals(NS_LITERAL_STRING("text")) ||

space after "if"; also use EqualsLiteral("text") -- I know a bunch of other places don't use it, but it wasn't available on the 1.0 branch (but is available on the trunk).  We should start using it where possible now.

>+                                   titleType.Equals(NS_LITERAL_STRING("text/plain")) ||
EqualsLiteral
>+                                   titleType.IsEmpty())
>+                                {
>+                                    rv = FindTextInNode(childNode, titleStr);
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("text/html")) ||

space around "else if"; EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("html")))
>+                                {
>+                                    nsAutoString escapedHTMLStr;
>+                                    rv = FindTextInNode(childNode, escapedHTMLStr);
>+                                    if (NS_FAILED(rv)) break;  // out of while(childNode) loop
>+
>+                                    nsCOMPtr<nsIDOMNode> newNode;
>+                                    nsCOMPtr<nsIDocument> doc = do_QueryInterface(xmldoc);
>+                                    ParseHTMLFragment(escapedHTMLStr, doc, address_of(newNode));
>+                                    rv = FindTextInChildNodes(newNode, titleStr);                                    
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("xhtml")) ||

spaces, EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("application/xhtml")))
>+                                {
>+                                    rv = FindTextInChildNodes(childNode, titleStr);
>+                                }else{

spaces

>+                                    break; // out of while(childNode) loop
>+                                }
>+                            }else{

ditto

>+                                rv = FindTextInNode(childNode, titleStr);
>+                            }
>                             if (NS_FAILED(rv)) break;
>                         } else if (childNname.Equals(NS_LITERAL_STRING("pubDate")) ||
>                                    childNname.Equals(NS_LITERAL_STRING("updated")))
>                         {
>                             rv = FindTextInNode (childNode, dateStr);
>                             if (NS_FAILED(rv)) break;
>                         } else if (!isAtom && childNname.Equals(NS_LITERAL_STRING("guid"))) {
>                             nsCOMPtr<nsIDOMElement> linkElem = do_QueryInterface(childNode);
(Assignee)

Comment 4

12 years ago
FindTextInNode seems to have problems wrt all FeedFormats. Here are some tests:
http://franklinmint.fm/2005/12/18/pathological-rss2.xml

All valid:
http://feedvalidator.org/check.cgi?url=http%3A%2F%2Ffranklinmint.fm%2F2005%2F12%2F18%2Fpathological-rss2.xml
(Assignee)

Updated

12 years ago
Attachment #205503 - Flags: review?(vladimir)
(Assignee)

Comment 5

12 years ago
Created attachment 206272 [details] [diff] [review]
fix formatting, better text handling

This patch removes FindTextInNode in favor of FindTextInChildNodes and cleans up whitespace in resulting values.

Tests nasty but valid RSS2 variants: 
http://franklinmint.fm/2005/12/18/pathological-rss2.xml

Tests real-world Atom 0.3 escaped HTML titles:
http://compulsorybloggerblog.blogspot.com/atom.xml

Phil's character and stripTags tests:
http://franklinmint.fm/2005/12/09/Atom-title-conversion-testcase.xml
Attachment #206272 - Flags: review?(vladimir)
(Assignee)

Updated

12 years ago
Attachment #205503 - Attachment is obsolete: true
(Assignee)

Comment 6

12 years ago
(In reply to comment #3)
>
> >+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
> >+    if (NS_FAILED(rv)) return rv;
> 
> I'm not sure offhand if we really want to be creating a new instance each time;
> not sure if we can reuse the parser or not.  If we can, it would be better to
> do so.

I tried reusing it, and got seg faults. I also asked if there was a way to reuse it on IRC and was told "probably not".
(Reporter)

Updated

12 years ago
Blocks: 311065
Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

Looks fine to me.
r=vladimir
Attachment #206272 - Flags: review?(vladimir) → review+
(Assignee)

Updated

12 years ago
Whiteboard: [checkin needed]
(Reporter)

Comment 8

12 years ago
*** Bug 321606 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 9

12 years ago
*** Bug 322951 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 10

12 years ago
*** Bug 253814 has been marked as a duplicate of this bug. ***

Comment 11

12 years ago
i'm not sure that the desired behavior is ever to pass escaped content through and display it to the user. if someone publishes an rss feed with a title that has escaped html entities in it, they almost certainly want it unescaped. i think firefox should make this assumption. it makes the code much simpler.

so, i think the easy fix here is to make this assumption and always pass the bookmark's title through an unescape function. this is similar to what nsBookmarksService does when it loads the bookmarks file off of the disk.

it appears that someone hacked in an unescape function in nsBookmarksService.cpp (not sure why nsEscape.h isn't used here). i was able to make this function visible to the nsBookmarksFeedLoader and use it to preprocess every live bookmark title. this totally fixed the problem.

so i would not advise employing a complicated solution (the one below) to fix this simple but high visibility bug. just unescape all of the live bookmark titles, and you're done. it takes about 5 lines of code to get this working.

Comment 12

12 years ago
I should clarify that I'm speaking specifically to the failure to unescape html in the titles. This is by far the most visible aspect of this bug. Fixing the other parts of this bug (parsing bold, italics, etc) is arguably less important, since those tags are much more rare in the live bookmark titles.
(Assignee)

Comment 13

12 years ago
I don't think we're going to fix this for 1.5, and FF2 will have a new parser. Not sure why we're talking about RSS2, but I'll gleefully note that the meaning of escaped markup in titles is undefined.
Status: ASSIGNED → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 14

12 years ago
*** Bug 311065 has been marked as a duplicate of this bug. ***
No longer blocks: 311065
Whiteboard: [checkin needed]
(Reporter)

Comment 15

12 years ago
Since it's looking like we'll be using the old parser after all for 2.0, can we please have this?
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(Assignee)

Comment 16

12 years ago
(In reply to comment #15)
> Since it's looking like we'll be using the old parser after all for 2.0, can we
> please have this?

Looks to me like this landed on trunk already:
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsBookmarksFeedHandler.cpp&branch=&root=/cvsroot&subdir=mozilla/browser/components/bookmarks/src&command=DIFF_FRAMESET&rev1=1.10&rev2=1.11
(Reporter)

Comment 17

12 years ago
Heh. So it did, though the checkin comment makes it look accidental. Okay, baked on the trunk for a month or so between accidental checkin and Places being on by default with no regressions reported, plus bug 321036 got it into the Places fork where it's also apparently working fine, so all we need is a1.8.1.
Status: REOPENED → RESOLVED
Last Resolved: 12 years ago12 years ago
Resolution: --- → FIXED
(Reporter)

Comment 18

12 years ago
Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

(Well, approval and for the branch to reopen after a2, that is...)
Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor)

Updated

12 years ago
Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor) → approval-branch-1.8.1+
(Assignee)

Comment 19

12 years ago
checked in on branch.

Checking in Makefile.in;
/cvsroot/mozilla/browser/components/bookmarks/src/Makefile.in,v  <--  Makefile.in
new revision: 1.16.8.1; previous revision: 1.16
done
Checking in nsBookmarksFeedHandler.cpp;
/cvsroot/mozilla/browser/components/bookmarks/src/nsBookmarksFeedHandler.cpp,v  <--  nsBookmarksFeedHandler.cpp
new revision: 1.8.2.2; previous revision: 1.8.2.1
done
Keywords: fixed1.8.1
(Reporter)

Comment 20

12 years ago
*** Bug 347316 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.