Closed Bug 302133 Opened 16 years ago Closed 15 years ago

Fix parsing of Atom 1.0 titles for Live Bookmarks

Categories

(Firefox :: Bookmarks & History, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: philor, Assigned: sayrer)

References

Details

(Keywords: fixed1.8.1)

Attachments

(2 files, 1 obsolete file)

Atom 1.0 allows the content of titles to be plain-text (@type="text" or no type
attribute), escaped HTML (@type="html"), or an inline XHTML div (@type="xhtml"). 

For @type="text" we're fine: we pass <b>&copy</b> through as-is,
correctly.

For @type="html" we fail to convert escaped character entity references and
escaped numeric character references to characters (© © should
display as two copyright symbol characters), and we fail to strip markup ("this
is <b>bold</b>" should be displayed as "this is bold" rather than "this is
<b>bold</b>").

For @type="xhtml" we fail completely: FindTextInNode looks like it should find
the first descendant text node (when what's needed is actually concatenating all
the text nodes), but rather than that, finding "strong" in <title
type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><strong>strong</strong>
isn't better, it's different</div></title>, or the correct result, "strong isn't
better, it's different" we get nothing: the entry fails to load as if there was
no title.
Attached file Atom title testcase
A variety-pack of Atom titles.
Assignee: nobody → sayrer
Status: NEW → ASSIGNED
Not sure if this is the optimal strategy, but it passes all of Phil's testcases.
Attachment #205503 - Flags: review?(vladimir)
Comment on attachment 205503 [details] [diff] [review]
First cut at handling escaped HTML and XML in Atom titles


Thanks for the patch!  Some comments..

>-
>+    NS_METHOD FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString);
>+    NS_METHOD ParseHTMLFragment(nsAString &aFragString, nsIDocument* aTargetDocument, nsCOMPtr<nsIDOMNode> *outNode);

nsIDOMNode **outNode; don't use nsCOMPtr<> for params

> nsresult
>+nsFeedLoadListener::ParseHTMLFragment(nsAString &aFragString,  
>+                                      nsIDocument* aTargetDocument,
>+                                      nsCOMPtr<nsIDOMNode> *outNode)

nsIDOMNode **outNode

>+{
>+    NS_ENSURE_ARG(aTargetDocument);
>+    nsresult rv;
>+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+
>+    // create the html fragment sink
>+    nsCOMPtr<nsIContentSink> sink;
>+    sink = do_CreateInstance(NS_HTMLFRAGMENTSINK2_CONTRACTID);
>+    NS_ENSURE_TRUE(sink, NS_ERROR_FAILURE);
>+    nsCOMPtr<nsIFragmentContentSink> fragSink(do_QueryInterface(sink));
>+    NS_ENSURE_TRUE(fragSink, NS_ERROR_FAILURE);

I'm not sure offhand if we really want to be creating a new instance each time; not sure if we can reuse the parser or not.  If we can, it would be better to do so.


>+    *outNode = do_QueryInterface(contextfrag);  

CallQueryInterface (contextfrag, outNode);

>+    return rv;
>+}
>+
>+// find all of the text nodes below aNode at return them as a string
>+nsresult
>+nsFeedLoadListener::FindTextInChildNodes (nsIDOMNode *aNode, nsAString &aString)
>+{
>+    NS_ENSURE_ARG(aNode);
>+
>+    nsresult rv;
>+
>+    nsCOMPtr<nsIDOMDocument> aDoc;
>+    aNode->GetOwnerDocument(getter_AddRefs(aDoc));
>+    nsCOMPtr<nsIDOMDocumentTraversal> trav = do_QueryInterface(aDoc, &rv);
>+    if (NS_FAILED(rv)) return rv;
>+    
>+    nsCOMPtr<nsIDOMTreeWalker> walker;
>+    rv = trav->CreateTreeWalker(aNode, 
>+                                nsIDOMNodeFilter::SHOW_TEXT | nsIDOMNodeFilter::SHOW_CDATA_SECTION,
>+                                nsnull, PR_TRUE, getter_AddRefs(walker));
>+    if (NS_FAILED(rv)) return rv;
>+
>+    nsCOMPtr<nsIDOMNode> currentNode;
>+    walker->GetCurrentNode(getter_AddRefs(currentNode));
>+    nsCOMPtr<nsIDOMCharacterData> charTextNode;
>+    nsAutoString tempString;
>+    while (currentNode)
>+    {

{ on same line as while, i.e. "while (currentNode) {"

>+        charTextNode = do_QueryInterface(currentNode);
>+        if(charTextNode)
>+        { 

same line { again, space after "if"  (and in a bunch of other places -- please try to follow the coding style of whatever file you're in :)

>+            charTextNode->GetData(tempString);
>+            aString.Append(tempString);
>+        }
>+        walker->NextNode(getter_AddRefs(currentNode));
>+    }
>+
>+    if(aString.IsEmpty()){

space after "if", before {

>+        return NS_ERROR_FAILURE;
>+    }else{

space around "else"

>+        return NS_OK;
>+    }
>+}
>+
>+nsresult
> nsFeedLoadListener::HandleRDFItem (nsIRDFDataSource *aDS, nsIRDFResource *aItem,
>                                    nsIRDFResource *aLinkResource,
>                                    nsIRDFResource *aTitleResource)
> {
>     nsresult rv;
> 
>     /* We care about this item's link and title */
>     nsCOMPtr<nsIRDFNode> linkNode;
>@@ -665,17 +749,53 @@ nsFeedLoadListener::TryParseAsSimpleRSS 
>                     rv = childNode->GetNodeType(&childNtype);
>                     if (NS_FAILED(rv)) return rv;
> 
>                     if (childNtype == nsIDOMNode::ELEMENT_NODE) {
>                         nsAutoString childNname;
>                         rv = childNode->GetNodeName (childNname);
> 
>                         if (childNname.Equals(NS_LITERAL_STRING("title"))) {
>-                            rv = FindTextInNode (childNode, titleStr);
>+                            if(isAtom){
>+                                /* Atom titles can contain HTML, so we need to find and remove it */
>+                                nsCOMPtr<nsIDOMElement> titleElem = do_QueryInterface(childNode);
>+                                if (!titleElem) break; // out of while(childNode) loop
>+                                
>+                                nsAutoString titleMode; // Atom 0.3 only
>+                                nsAutoString titleType;
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("type"), titleType);
>+                                titleElem->GetAttribute(NS_LITERAL_STRING("mode"), titleMode);
>+                                
>+                                /* Cover Atom 0.3 and RFC 4287 together */
>+                                if(titleType.Equals(NS_LITERAL_STRING("text")) ||

space after "if"; also use EqualsLiteral("text") -- I know a bunch of other places don't use it, but it wasn't available on the 1.0 branch (but is available on the trunk).  We should start using it where possible now.

>+                                   titleType.Equals(NS_LITERAL_STRING("text/plain")) ||
EqualsLiteral
>+                                   titleType.IsEmpty())
>+                                {
>+                                    rv = FindTextInNode(childNode, titleStr);
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("text/html")) ||

space around "else if"; EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("html")))
>+                                {
>+                                    nsAutoString escapedHTMLStr;
>+                                    rv = FindTextInNode(childNode, escapedHTMLStr);
>+                                    if (NS_FAILED(rv)) break;  // out of while(childNode) loop
>+
>+                                    nsCOMPtr<nsIDOMNode> newNode;
>+                                    nsCOMPtr<nsIDocument> doc = do_QueryInterface(xmldoc);
>+                                    ParseHTMLFragment(escapedHTMLStr, doc, address_of(newNode));
>+                                    rv = FindTextInChildNodes(newNode, titleStr);                                    
>+                                }else if(titleType.Equals(NS_LITERAL_STRING("xhtml")) ||

spaces, EqualsLiteral

>+                                         titleType.Equals(NS_LITERAL_STRING("application/xhtml")))
>+                                {
>+                                    rv = FindTextInChildNodes(childNode, titleStr);
>+                                }else{

spaces

>+                                    break; // out of while(childNode) loop
>+                                }
>+                            }else{

ditto

>+                                rv = FindTextInNode(childNode, titleStr);
>+                            }
>                             if (NS_FAILED(rv)) break;
>                         } else if (childNname.Equals(NS_LITERAL_STRING("pubDate")) ||
>                                    childNname.Equals(NS_LITERAL_STRING("updated")))
>                         {
>                             rv = FindTextInNode (childNode, dateStr);
>                             if (NS_FAILED(rv)) break;
>                         } else if (!isAtom && childNname.Equals(NS_LITERAL_STRING("guid"))) {
>                             nsCOMPtr<nsIDOMElement> linkElem = do_QueryInterface(childNode);
Attachment #205503 - Flags: review?(vladimir)
This patch removes FindTextInNode in favor of FindTextInChildNodes and cleans up whitespace in resulting values.

Tests nasty but valid RSS2 variants: 
http://franklinmint.fm/2005/12/18/pathological-rss2.xml

Tests real-world Atom 0.3 escaped HTML titles:
http://compulsorybloggerblog.blogspot.com/atom.xml

Phil's character and stripTags tests:
http://franklinmint.fm/2005/12/09/Atom-title-conversion-testcase.xml
Attachment #206272 - Flags: review?(vladimir)
Attachment #205503 - Attachment is obsolete: true
(In reply to comment #3)
>
> >+    nsCOMPtr<nsIParser> parser = do_CreateInstance(kCParserCID, &rv);
> >+    if (NS_FAILED(rv)) return rv;
> 
> I'm not sure offhand if we really want to be creating a new instance each time;
> not sure if we can reuse the parser or not.  If we can, it would be better to
> do so.

I tried reusing it, and got seg faults. I also asked if there was a way to reuse it on IRC and was told "probably not".
Blocks: 311065
Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

Looks fine to me.
r=vladimir
Attachment #206272 - Flags: review?(vladimir) → review+
Whiteboard: [checkin needed]
*** Bug 321606 has been marked as a duplicate of this bug. ***
*** Bug 322951 has been marked as a duplicate of this bug. ***
*** Bug 253814 has been marked as a duplicate of this bug. ***
i'm not sure that the desired behavior is ever to pass escaped content through and display it to the user. if someone publishes an rss feed with a title that has escaped html entities in it, they almost certainly want it unescaped. i think firefox should make this assumption. it makes the code much simpler.

so, i think the easy fix here is to make this assumption and always pass the bookmark's title through an unescape function. this is similar to what nsBookmarksService does when it loads the bookmarks file off of the disk.

it appears that someone hacked in an unescape function in nsBookmarksService.cpp (not sure why nsEscape.h isn't used here). i was able to make this function visible to the nsBookmarksFeedLoader and use it to preprocess every live bookmark title. this totally fixed the problem.

so i would not advise employing a complicated solution (the one below) to fix this simple but high visibility bug. just unescape all of the live bookmark titles, and you're done. it takes about 5 lines of code to get this working.
I should clarify that I'm speaking specifically to the failure to unescape html in the titles. This is by far the most visible aspect of this bug. Fixing the other parts of this bug (parsing bold, italics, etc) is arguably less important, since those tags are much more rare in the live bookmark titles.
I don't think we're going to fix this for 1.5, and FF2 will have a new parser. Not sure why we're talking about RSS2, but I'll gleefully note that the meaning of escaped markup in titles is undefined.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
*** Bug 311065 has been marked as a duplicate of this bug. ***
No longer blocks: 311065
Whiteboard: [checkin needed]
Since it's looking like we'll be using the old parser after all for 2.0, can we please have this?
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(In reply to comment #15)
> Since it's looking like we'll be using the old parser after all for 2.0, can we
> please have this?

Looks to me like this landed on trunk already:
http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsBookmarksFeedHandler.cpp&branch=&root=/cvsroot&subdir=mozilla/browser/components/bookmarks/src&command=DIFF_FRAMESET&rev1=1.10&rev2=1.11
Heh. So it did, though the checkin comment makes it look accidental. Okay, baked on the trunk for a month or so between accidental checkin and Places being on by default with no regressions reported, plus bug 321036 got it into the Places fork where it's also apparently working fine, so all we need is a1.8.1.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Comment on attachment 206272 [details] [diff] [review]
fix formatting, better text handling

(Well, approval and for the branch to reopen after a2, that is...)
Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor)
Attachment #206272 - Flags: approval-branch-1.8.1?(mconnor) → approval-branch-1.8.1+
checked in on branch.

Checking in Makefile.in;
/cvsroot/mozilla/browser/components/bookmarks/src/Makefile.in,v  <--  Makefile.in
new revision: 1.16.8.1; previous revision: 1.16
done
Checking in nsBookmarksFeedHandler.cpp;
/cvsroot/mozilla/browser/components/bookmarks/src/nsBookmarksFeedHandler.cpp,v  <--  nsBookmarksFeedHandler.cpp
new revision: 1.8.2.2; previous revision: 1.8.2.1
done
Keywords: fixed1.8.1
*** Bug 347316 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.