RSS feed detection detects non-feeds (and fails to detect some legitimate feeds)

RESOLVED FIXED

Status

Camino Graveyard
General
RESOLVED FIXED
11 years ago
10 years ago

People

(Reporter: bosspro, Assigned: Chris Lawson (gone))

Tracking

Details

(URL)

Attachments

(1 attachment, 4 obsolete attachments)

7.41 KB, patch
Mike Pinkerton (not reading bugmail)
: superreview+
Details | Diff | Splinter Review
(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en; rv:1.8.1.4) Gecko/20070509 Camino/1.5
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en; rv:1.8.1.4) Gecko/20070509 Camino/1.5

The RSS feed auto-detection feature of Camino detects links that are not feeds as feeds (for example, a P3P file).



Reproducible: Always

Steps to Reproduce:
1.  Go to a site that exports P3P (for example, http://home.gfc.org)
2.  Click on the feed icon
3.  Non-feeds are displayed along with the feeds

Actual Results:  
Non-feeds are displayed along with the feeds

Expected Results:  
Only real feeds are displayed.  NOTE:  Firefox implements this properly.
(Assignee)

Comment 1

11 years ago
Yeah, this definitely happens. Nick, can you take a quick look and see what the problem might be?
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 2

11 years ago
This is because we don't "sniff" hard enough.

Any <link> tag that has a type="" of:

(courtesy of CHBrowserListener:)
// Check for feed type next
  inElement->GetAttribute(NS_LITERAL_STRING("type"), attribute);
  
  // kinda ugly, but if we don't match any of these strings return
  if (attribute.Equals(NS_LITERAL_STRING("application/rssxml")) ||
      attribute.Equals(NS_LITERAL_STRING("application/rss+xml")) ||
      attribute.Equals(NS_LITERAL_STRING("application/atomxml")) ||
      attribute.Equals(NS_LITERAL_STRING("application/atom+xml")) ||
      attribute.Equals(NS_LITERAL_STRING("text/xml")) ||
      attribute.Equals(NS_LITERAL_STRING("application/xml")) ||
      attribute.Equals(NS_LITERAL_STRING("application/rdfxml")))
  {
    return eFeedType;
  }
(Assignee)

Updated

11 years ago
Depends on: 383535
(Assignee)

Comment 3

11 years ago
This will be fixed when we move to the smarter sniffing (nsIFeedSniffer). Marking as dependent on that one for now.
Actually, it won't, not unless you have absolutely no interest whatsoever in pageload perf, and you're going to sniff every single URI in a <link> (which could well be hundreds of them, of infinite size), even if it takes ten minutes before you can free up enough connections to start loading the page images and stylesheets.

The feed sniffer is a content sniffer, which looks at documents as they are loading, and determines whether or not they are a feed - it's what drives Firefox's feed preview page, not its feed icon in the addressbar. This bit of code is supposed to be looking at the <link> elements in an HTML page which is loading, to tell whether or not they are saying that some other URL is a feed, which is the equivalent of DOMLinkHandler.onLinkAdded in browser/base/content/browser.js plus isValidFeed in browser/base/content/utilityOverlay.js.

For this, you don't need some secondhand code stuffed into toolkit, you just need to decide that you want to tighten up what you autodiscover, that the best way to do that is with the only standard for autodiscovery that's likely to ever get published, HTML5, and change from that too-broad set of mime-types not broadly enough tested to:

* if one of the space-separated words in the rel attribute is "feed" it's a feed

* if one of the space-separated words in the rel attribute is "alternate" and none of them are "stylesheet", then

  * if and only if the type (after trimming leading and trailing whitespace) is either "application/rss+xml" or "application/atom+xml" it's a feed

While it's somewhat lacking in grace, http://mxr.mozilla.org/comm-central/source/mozilla/browser/base/content/test/feed_discovery.html is a good start at what you're over- and under-discovering.
(Assignee)

Comment 5

10 years ago
Thanks, Phil!

We should detect 13 feeds on that page and not detect any of the "bogus" feeds. Right now, we fail to detect feeds 3, 8, and 10-13, and we detect as feeds the bogus feeds 1-5, 8, 9, and 11-13.

I can probably take a look at this sometime soon-ish. Assigning to me so it stays on my radar.
Assignee: nobody → cl-bugs-new
No longer depends on: 383535
Hardware: Macintosh → All
Summary: RSS feed detection detects non-feeds → RSS feed detection detects non-feeds (and fails to detect some legitimate feeds)
(Assignee)

Comment 6

10 years ago
Phil tells me

http://mxr.mozilla.org/mozilla/source/browser/base/content/test/feed_discovery.html

is the version of the test used to check Firefox 3 (the URL in comment 4 is used with Fx 3.1). On that page, we fail to detect 3, 8, 10-13, and 16, while we detect bogus feeds 1-5, and 8-10.
Right: the difference is bug 451329, where I backed out the stuff which would discover a feed if it had both " rss " in the title attribute and a type of "text/xml" or "application/xml" or "application/rdf+xml".

That came from bug 257247, because at that time WordPress had a default autodiscovery link like <link rel="alternate" type="text/xml" title="RSS 0.92"> that went to a feed which was just titles, links, and a plaintext excerpt in the description, perfect for Fx's Live Bookmarks (and useless for anything else), and RDF had just gotten its shiny new mime-type approved, so some people wanted to use it for their RSS 1.0 feeds, but there wasn't any other way of knowing that a particular /rdf+xml URL was a feed. Now, WordPress has dropped that link (which IE7/8 didn't discover), and HTML5 has rel="feed" which will let you use any (or no) type attribute while still saying that it's a feed (or rel="feed alternate" if you want to be discovered by both Fx3 and more modern things).
(Assignee)

Comment 8

10 years ago
Created attachment 341636 [details] [diff] [review]
fix v1.0

This implements behaviour such that we pass all the tests that Firefox 3.0 does (the URL in comment 6) and fixes this bug as originally filed (on http://home.gfc.org/ ).

This patch allows the "bogus" feeds 10, 11, and 12 in the Firefox 3.1 tests (the URL in the URL field) due to its tolerance for the obsolete WordPress (and, IIRC, Movable Type at one point) feeds, but I think we're OK with that. I separated out that check so it'll be easy to remove if we decide we no longer want to support those particular tags later.
Attachment #341636 - Flags: review?(nick.kreeger)

Comment 9

10 years ago
Comment on attachment 341636 [details] [diff] [review]
fix v1.0

>Index: camino/src/embedding/CHBrowserListener.mm
>===================================================================
>+  // Check for feed type next. This check is sort of complicated, but the basic logic is:
>+  // � if the "rel" attribute contains the word "feed", it's a feed
>+  // � else if the "rel" attribute contains "alternate" without containing "stylesheet", then
>+  //   � iff the "type" attribute is "application/rss+xml" or "application/atom+xml", it's a feed
>+

Looks like you're using a bad character here (maybe a bullet?), don't do that, just use * or something. Also check your spelling here ('iif'..). In fact, I'd like it if the comments we next to the if statemnt:
   // Comment here.
   if (<conditional>) {
     ..
   }
   // Other comment
   else if (<conditional>) {
   }
   ... etc...


>+  NSArray* relAttributes = [[[NSString stringWith_nsAString:relAttribute] lowercaseString] componentsSeparatedByString:@" "];
>+  if ([relAttributes containsObject:@"feed"]) {
>     return eFeedType;
>+  } else if ([relAttributes containsObject:@"alternate"] && ![relAttributes containsObject:@"stylesheet"]) {

No need to follow moz style here, please put the |else if| on a new line ;-)


>+    nsAutoString typeAttribute;
>+    inElement->GetAttribute(NS_LITERAL_STRING("type"), typeAttribute);
>+    typeAttribute.Trim(" \n\r\t"); // trim whitespace before checking
>+    if (typeAttribute.EqualsIgnoreCase("application/rss+xml") ||
>+        typeAttribute.EqualsIgnoreCase("application/atom+xml")) {
>+      return eFeedType;
>+    }

Also, no need to use a |nsAutoString|, while you're correct, |nsString| and |nsAutoString| are the same thing now. Also, put the '{' curly on a new line since it is a multi-line conditional.

>+    // We also want to allow a handful of obsolete feed tags that have a type of
>+    // application/rdf+xml, appplication/xml, and text/xml as long as they have the word
>+    // "RSS" somewhere in the title. Blogging software no longer creates feeds that look
>+    // like this, so we could consider dropping this check entirely in the future.
>+    if (typeAttribute.EqualsIgnoreCase("text/xml") ||
>+        typeAttribute.EqualsIgnoreCase("application/xml") ||
>+        typeAttribute.EqualsIgnoreCase("application/rdf+xml")) {
>+      nsAutoString titleAttribute;
>+      inElement->GetAttribute(NS_LITERAL_STRING("title"), titleAttribute);
>+      NSArray* titleComponents = [[[NSString stringWith_nsAString:titleAttribute] lowercaseString] componentsSeparatedByString:@" "];
>+      if ([titleComponents containsObject:@"rss"])
>+        return eFeedType;
>+    }
>   }
>-  
>+
>   return eOtherType;
> }


I'm not a big fan of the logic for finding the substring 'rss'. Is there a specific reason why you needed to loop through a token string array? If not, I don't see why we need to use the array searching logic (searching, allocation and conversion overheads). You should use the pre-defined functions on |nsString|, such as |Find(<#string type#>aStr, PRBool aIgnoreCase)|.

Also, new line for the '{' curly as above and use |nsString| rather than |nsAutoString|.

Let's give this guy another spin ;-)
Attachment #341636 - Flags: review?(nick.kreeger) → review-
Since bug 451505 is just the sort of thing we don't get around to fixing, you should be able to test your ability to discover useless feeds from ancient software with http://developer.mozilla.org/web-tech/ for some time to come.
(Assignee)

Comment 11

10 years ago
Created attachment 341719 [details] [diff] [review]
fix v1.1

I fixed the comments to avoid the bullets entirely.

I'm sticking with nsAutoString for three reasons:

  a) it doesn't make any difference, as you said, and we use it all over the place in that file, so continuing to use it helps with consistency;
  b) nsString bears a superficial similarity to NSString (and both will be hit in a case-insensitive search of our source code), and I'd rather not introduce that additional point of confusion just yet unless Mike, Mark, and/or Stuart thinks we should; and
  c) no one has explained (or attempted to explain) to me how there would be any concrete benefit to using nsString that overcomes the objections in (a) and (b).

The } else if (foo) { thing is all over the place in our codebase in both that single-line form and the supposedly "preferred" multi-line form, and I've seen several checkins recently that used the single-line form. CHBrowserListener does seem to use the multi-line form more often, though, so I've fixed this for consistency. (As an aside, we *really* need to make sure we pick a style and then enforce it rigorously from here on out, because this gets exceedingly confusing every time it comes up.)

As I explained to Nick on IRC, searching for "rss" or "feed" as a substring isn't (by itself) a viable option. As I note in the documentation for the |containsWord:| method, breaking it up into an array of space-separated components allows us to avoid making essentially four different searches:

  1) "word " at the beginning of the string,
  2) " word" at the end of the string,
  3) " word " in the middle of the string, and
  4) "word" comprising the entire string.

All four are perfectly valid (and plausible!) situations. I don't claim to know what sort of overhead we're dealing with here in terms of doing four Find()s versus converting to an NSString, creating an NSArray, then searching it, but I do know this seems a whole lot simpler and doesn't feel like it's taking any more time (compared to an un-modified debug build) when I click on the RSS icon. I'm certainly open to other ideas here if anyone has them.
Attachment #341636 - Attachment is obsolete: true
Attachment #341719 - Flags: review?(nick.kreeger)
(Assignee)

Updated

10 years ago
Attachment #341719 - Flags: review?(nick.kreeger)
(Assignee)

Comment 12

10 years ago
Comment on attachment 341719 [details] [diff] [review]
fix v1.1

<cl> so what's the point of changing [nsAutoString to nsString], other than for the sake of change? 
<kreeger> because i want it changed ;)
<kreeger> eh
<kreeger> http://mxr.mozilla.org/seamonkey/source/xpcom/glue/nsStringAPI.h#1419
<cl> file a bug to change our whole codebase, then :-p
<kreeger> no do it right now
<kreeger> doesn't make sense to double the work
<cl> sure it does. We already use nsAutoString everywhere else in that file, and there's no point in changing it just because.
<kreeger> r- until it's changed
<cl> you're actually serious, aren't you?
<kreeger> yes

Clearing review request until someone whose opinion matters makes a decision. If we switch here, I want to see a followup bug filed to change the rest of our codebase, too.

Comment 13

10 years ago
(In reply to comment #11)
> As an aside, we *really* need to make sure we pick a style and
> then enforce it rigorously from here on out

We did pick a style:
http://wiki.caminobrowser.org/Development:Reviewing#Code_style

If recent code went in the other way, that was mistake that wasn't caught.


I don't understand what's controversial about not writing new code with deprecated APIs. Use nsString, and feel free to file a follow-up bug about changing existing code if that's important to you.
(Assignee)

Comment 14

10 years ago
Created attachment 344169 [details] [diff] [review]
fix v1.2

Re-implemented searching in Gecko and addressed other review comments.
Attachment #341719 - Attachment is obsolete: true
Attachment #344169 - Flags: review?(nick.kreeger)

Comment 15

10 years ago
Comment on attachment 344169 [details] [diff] [review]
fix v1.2


>+  // If the rel attribute contains the word "feed", it's a feed.
>+  if (GeckoUtils::StringContainsWord(relAttribute, "feed", PR_TRUE)) {
>     return eFeedType;
>+  } // Otherwise, do some more testing to see what we have here. It might still be a feed.
>+  else if (GeckoUtils::StringContainsWord(relAttribute, "alternate", PR_TRUE) && !GeckoUtils::StringContainsWord(relAttribute, "stylesheet", PR_TRUE)) {

NIT: Move this comment to the next line. Also, let's not make this line super long, move the |!GeckoUtils..| to a new line. 


>+PRBool GeckoUtils::StringContainsWord(nsAString& aString, const char* aWord, PRBool caseInsensitive)
>+{
>+  nsString source(aString);
>+  nsCString wordAsPrefix(aWord);
>+  wordAsPrefix += ' ';
>+  nsCString wordAsSuffix(aWord);
>+  wordAsSuffix.Insert(' ', 0);
>+  nsCString wordAsWord(wordAsSuffix);
>+  wordAsWord += ' ';

Isn't |wordAsWord| supposed to not have any whitespace around it???

>+
>+  if (source.EqualsIgnoreCase(aWord) ||
>+      (source.Find(wordAsPrefix, caseInsensitive, 0, wordAsPrefix.Length()) == 0) ||
>+      (source.RFind(wordAsSuffix, caseInsensitive, -1) == (PRInt32)(source.Length() - wordAsSuffix.Length())) ||
>+      (source.Find(wordAsWord, caseInsensitive) != -1))
>+  {
>+    return PR_TRUE;
>+  }
>+  return PR_FALSE;
>+}


Close, just a couple of things...
1.) When using |Find()| and |RFind()| you can use the function signature that just takes a |const char *| and a |PRBool|. No need to pass in extra params.

2.)
>+      (source.RFind(wordAsSuffix, caseInsensitive, -1) == (PRInt32)(source.Length() - wordAsSuffix.Length())) ||

Not sure what the logic is here, shouldn't you be checking for != -1

From nsStringAPI.h
   * Find the last occurrence of aStr in this string.
   *
   * @return The offset of aStr from the beginning of the string,
   *         or -1 if not found.
   */

r+ with these issues addressed.
Attachment #344169 - Flags: review?(nick.kreeger) → review+
(Assignee)

Comment 16

10 years ago
Comment on attachment 344169 [details] [diff] [review]
fix v1.2

Because it seems to be such a common and persistent point of confusion (and because Gecko uses all these magic numbers in the nsString search methods), I'll explain the logic one more time.

* If the string consists entirely of "foo", return true.
* If the string's first [n] characters are "foo ", return true (where n = the length of "foo ").
* If the string's last [n] characters are " foo", return true
* If the string contains the literal " foo " anywhere in it, return true.

Targeting pink for sr since his queue is currently empty. I'm happy to fix the spacing nits for checkin but don't see much point in doing it right now.
Attachment #344169 - Flags: superreview?(mikepinkerton)
Attachment #344169 - Flags: superreview?(mikepinkerton) → superreview+
Comment on attachment 344169 [details] [diff] [review]
fix v1.2

With this patch, we're detecting a new bogus feed at http://home.gfc.org/ : 

<link rel="home" href="?zone=home&amp;ID=1">

Also, given the resulting discussion in bug 461075, is this patch actually using the right string types for the desired memory semantics?
(Assignee)

Comment 19

10 years ago
Created attachment 345232 [details] [diff] [review]
fix v1.3

As Smokey pointed out in comment 18, the existing patch (which needed a respin for spacing anyway, and now I'm glad I waited) didn't really solve the problem on the original URL; it just traded one non-feed link for another.

After more investigation and much testing, here's a new patch that a) doesn't detect any non-feeds on the original URL; b) detects what it should at the Firefox 3 test link; c) detects what it should (including three bogus feeds, which is expected) at the Firefox 3.1 test link; d) detects all three feeds at the MDC link; and e) detects all three feeds at my blog, which I don't believe has changed its feed tags since back in the Movable Type 2 days.

I've changed enough that I'd like a re-review from someone who hasn't already seen it and who's familiar enough with Gecko strings code to know if I've done anything silly here (other than making the mistake of working with Gecko's string classes to begin with, which is not an experience I'm particularly eager to repeat).
Attachment #344169 - Attachment is obsolete: true
Attachment #345232 - Flags: review?(stuart.morgan+bugzilla)

Comment 20

10 years ago
Comment on attachment 345232 [details] [diff] [review]
fix v1.3

This looks reasonable, although I'm not really a good person to review Gecko string API usage.

>-      while ((obj = [enumerator nextObject]))
>+      while ((obj = [enumerator nextObject])) {
>         [obj onLoadingStarted];
>+      }

No random changes in other functions.

>   }
>-  
>   return eOtherType;
> }

Leave the blank line for separation, since this is a completely different case than the feed block.
Attachment #345232 - Flags: review?(stuart.morgan+bugzilla) → review+
(Assignee)

Comment 21

10 years ago
Created attachment 349919 [details] [diff] [review]
fix v1.31

Pink, can you take a look at this? Since Stuart isn't all that sure about the Gecko strings API stuff, it's be great if you could pay particular attention to whether that's right.
Attachment #345232 - Attachment is obsolete: true
Attachment #349919 - Flags: superreview?(mikepinkerton)
I'm not a fan of StringContainsWord(), but I guess there's no other way to do it. What about contributing this back to nsString so that everyone can use it? That can be done as a separate patch though, it doesn't need to hold this back.

sr=pink
Attachment #349919 - Flags: superreview?(mikepinkerton) → superreview+
Landed on cvs trunk.

Chris, don't forget to file the follow-up for comment 22 when you get back.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
(Assignee)

Comment 24

10 years ago
(In reply to comment #22)
> I'm not a fan of StringContainsWord(), but I guess there's no other way to do
> it. What about contributing this back to nsString so that everyone can use it?

Filed as bug 467425.
You need to log in before you can comment on or make changes to this bug.