Closed Bug 325080 Opened 19 years ago Closed 18 years ago

Feed Parsing Service

Categories

(Firefox Graveyard :: RSS Discovery and Preview, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugs, Assigned: sayrer)

References

Details

(Keywords: fixed1.8.1)

Attachments

(2 files, 21 obsolete files)

314.46 KB, patch
bugs
: review+
Details | Diff | Splinter Review
53.46 KB, patch
Details | Diff | Splinter Review
We need a single shared feed parsing component, e.g. under toolkit/components/feeds.

There are several application features that would use this:

- Thunderbird News & Blogs
- Firefox Live Bookmarks
- Firefox in-page feed handling

There are likely two different ways that this component would be used: 

- on demand start-stop feed parsing: Tbird and Live Bookmarks would construct a feed parser, supply a URI and then expect feed-format-agnostic results that they could use to construct a UI

- integrated with the flow of content loading. Once bug 324985 is resolved, all feed content (regardless of content type) coming into the system will have a content type of either application/rss+xml, application/rdf+xml or application/atom+xml. A channel to the feed data will already be open and will be receiving data. There needs to be a way to connect up the parser to this somehow, maybe using a stream converter ?

Anyway, Robert can you take a first pass at identifying the requirements of this system further for each of these use cases?
Thunderbird News & Blogs

- Parsing feeds concurrently (~5-10 or so at a time)
- Needs plain text titles. Current parser strips tags and makes an excerpt of
  description/content if no title is found.
- Needs to detect remote content and enclosures. In RSS, this may be a 
  function of the enclosure media type. For example, 
  <enclosure type="audio/mpeg" /> would be an 'attachment', while 
  <enclosure type='image/jpeg' /> should probably be something more like an 
  iframe or img@src.
- Generally needs a high detail level. (title, author, link, content...)
- Markup sanitization
- identifier detection (leave null or create hash for minimal RSS?)
- Doesn't care about the actual structure of the feed (e.g. it drops summary
  data if full content is available)

Firefox Live Bookmarks 

- Checking can occur at startup, needs to be fast, low-detail level. 
  Higher levels of details require a lot of string manipulation. 
- Needs plain text titles. Current parser doesn't try too hard to find 
  alternatives to title/link and Atom explicitly allows html titles, though 
  they are rare. Realistically, RSS2 titles probably also include escaped HTML,
  and the parser could excerpt <description> the way TBird does it. 

Integrated Feed View
- Markup sanitization
- Can handle minimal formatting in title (bold, italics)
- enclosure/etc count
- feed 'characterization' (podcast/photocast/etc)
- user experience could be pretty bad here if we give up on a sufficently high
  number of feeds. So, we need to show something, no matter what the error. I
  don't think we need to try and rescue all of the entry content like we would 
  for an HTML page. If we give up in the middle of a feed, we'll still have
  some of the entries and will most likely have the feed-level data.
- detail level of preview?

* Extension Developers
I added this use case because extension developers need a good feed parser
available so they can focus on their application.

- No politics. Feed data should be accessible as feed.entry.summary or 
  channel.item.description. This is a common capability in language-specific 
  libs like Universal Feedparser (Python) and FeedTools (Ruby). 
  <http://feedparser.org/docs/content-normalization.html>
- start/stop parsing  
- Access to arbitrary extension data
- needs access to actual structure of feed, so they can make decisions for 
  their app (no choosing summary or content for them).

General XML Parsing (Applies to all)

- not-so lax parsing:
  We have some good metrics from Google Reader on XML quality in feeds. 
  <http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html>
  There don't seem to be many big wins here. We could work around HTML 
  entities and undeclared namespace prefixes. Retrying parsing with 
  different encodings might work, too.

- Streaming parser, rather than DOM. This can help us with performance, 
  because we can avoid lots of intermediate junk. It will also help with 
  non-wellformed XML, since many errors occur at the end of feeds, and errors 
  often occur in HTML content. That means the streaming parser will likely 
  have seen the title for the entry, and stands a good chance of having seen
  the entries preceeding. Streaming APIs will catch new entries as they land
  in the feed.
 
- No RDF parser. It's not worth it, since the RDF formats must conform to a
  specific XML syntax, and can't be arbitrary RDF/XML. 

- Extension content prioritization. What's the search path for description vs.
  content:encoded vs. itunes:description?

- It might be desirable to leave the architecture open enough to allow a DOM
  to be built as well. (this is handy for editing protocols)

- TBird's root-element detection seems to work pretty well. 

- namespaces mapped to common prefixes, if possible.
  <http://feedparser.org/docs/namespace-handling.html>

- (nice-to-have) Allow client to register handlers for elements, either SAX
  handlers, instruct parser to build a little DOM for the extension, or 
  return the extension as a string.
Status: NEW → ASSIGNED
Robert, this is excellent. Now, can you define what the interfaces for such a component would look like? (IDL).
<http://www.intertwingly.net/blog/2006/02/02/Plain-Text>

- treat all RSS2 elements other than item/description as the plain text they are supposed to be.

Let's be compatible with IE7. (still working on that IDL...)
Attached file idl for a feed processor (obsolete) —
I don't think I have much to add to the FeedService and FeedConverter classes you made here:
<https://bugzilla.mozilla.org/attachment.cgi?id=209868&action=diff>

With the design for the FeedProcessor in the attached IDL, the converter would work much better, because it would only have to buffer until StartFeed or ReportError was called. So, the flow would be right in with loading. 

I consider this IDL a first pass; wasn't sure what to do about the collections, arrays and such, or the less-common properties of feeds and entries. Ideally, these would be available from script as object properties similar to the Python dictionaries in Universal Feedparser. I could spec out interfaces for everything down to rss2:cloud, but I'm not sure it's worth it.
Attachment #210543 - Flags: review?(bugs)
Attached file Feed Processing SAX Handler in JS (obsolete) —
Just saving my work. I'll put together a full patch later today. This parser handles RSS2 and OPML. It has explicit state transitions right now, but I'll be abbreviating it after I do Atom. After finishing RSS2, I was able to write the OPML parser without tweaking the input handlers, so I think that's a good sign (yay first class functions and closures!). 

I have about 75 unit tests covering RSS and OPML core elements and attributes, which I'll include in the patch. The parser ignores all extensions right now.
Comment on attachment 210543 [details]
idl for a feed processor

This looks like a great start Robert, but of course you'll want to correct the idl syntax, add a license, complete the comments (document the parameters in javadoc, please!)

I can't wait to see the implementation! You'll put this somewhere in toolkit/components/ right?
Attachment #210543 - Flags: review?(bugs) → review+
Attached patch updated tree w/ tests (obsolete) — Splinter Review
Attachment #211070 - Attachment is obsolete: true
Depends on: 315826
Attached patch feed interfaces (obsolete) — Splinter Review
This diff shows has a SAX FeedProcessor that handles RSS2 and OPML and sends the parse result to the dispatch XUL screen. Need to rework the unit tests for the new interface, add tables for Atom/RSS1/RSS09x, and handle feed typing as the old patch did.

Sample files this patch works with:
http://www.intertwingly.net/blog/index.rss2
http://www.intertwingly.net/blog/index.opml
Attachment #210779 - Attachment is obsolete: true
Attachment #211071 - Attachment is obsolete: true
Attachment #211644 - Attachment is obsolete: true
Robert, can you exclude my UI patches from your final patch here? I would like to see the feed parser independently of any of that. Have you thought about who is going to review the SAX pieces? They live under parser/... have you mentioned this to mrbkap? 

It looks like there will be these pieces:

parser/sax/...                    <-- sax parser (you)
toolkit/components/feeds/...      <-- feed stuff (nsIFeed*) (you)
browser/components/feeds/...      <-- browser specific fe (let me handle this)

(In reply to comment #11)
>
> Robert, can you exclude my UI patches from your final patch here? I would like
> to see the feed parser independently of any of that.

Will do.

> Have you thought about who
> is going to review the SAX pieces

peterv is next.

> They live under parser/... have you mentioned this to mrbkap?

I'm pretty sure sicking, peterv, and mrbkap know it's coming.

>
> parser/sax/...                    <-- sax parser (you)

SAX patch is separate: bug 315826. I included it here only so the
patch would work if applied. I've had an initial review on it, but
there's another bug blocking it. I'll get a new patch reviewed by
peterv asap.

> toolkit/components/feeds/...      <-- feed stuff (nsIFeed*) (you)

Right OK.

> browser/components/feeds/...      <-- browser specific fe (let me handle this)

OK, I'll cut that stuff out and test on the command line.
Attachment #210543 - Attachment is obsolete: true
Attachment #211687 - Attachment is obsolete: true
Comment on attachment 213457 [details] [diff] [review]
RSS2 and RSS1 working w/ extensions

Lots of style nits, to start with. 

>Index: Makefile.in
> DIRS += \
> 	alerts \
> 	console \
>+        feeds \

too much whitespace?

>Index: feeds/public/nsIEntry.idl

>+
>+[scriptable, uuid(ef51d60c-cad1-42ea-9f6c-c1811f733d90)]
>+interface nsIEntry : nsIFeedContainer {

nit: comments above all your interfaces to say what they represents. 

>Index: feeds/public/nsIFeed.idl

>Index: feeds/public/nsIFeedProcessor.idl
>===================================================================
>RCS file: feeds/public/nsIFeedProcessor.idl
>diff -N feeds/public/nsIFeedProcessor.idl
>--- /dev/null	1 Jan 1970 00:00:00 -0000
>+++ feeds/public/nsIFeedProcessor.idl	28 Feb 2006 15:21:56 -0000
>@@ -0,0 +1,33 @@
>+
>+#include "nsISupports.idl"
>+
>+interface nsIURI;
>+interface nsIFeedResultListener;
>+interface nsIInputStream;
>+
>+// XXXben - this should ultimately become a nifty scriptable feed processor.
>+//          usable like so:
>+//          var FeedProcessor = 
>+//            Components.constructor("@mozilla.org/browser/feed-processor;1",
>+//                                   "nsIFeedProcessor", "init");
>+//           ...
>+//          var processor = new FeedProcessor(feedDoc, feedURI);
>+//          if (processor.isFeed) 
>+//            var feed = processor.getFeed();
>+//            feed.title ...
>+//
>+//          ... we could even replace the one in live bookmarks with this.

Since you're implementing this, you can remove this XXX comment. 

>+/* Parse a feed with callbacks to listener */
>+[scriptable, uuid(8a0b2908-21b0-45d7-b14d-30df0f92afc7)]
>+interface nsIFeedProcessor : nsISupports {
>+  
>+  attribute nsIFeedResultListener listener;
>+
>+  /* Level is where to listen for the extension, a constant: FEED, ENTRY, BOTH. */
>+  // XXX todo void registerExtensionHandler(in nsIFeedExtensionHandler, in long level);
>+  
>+  void parseFromStream(in nsIInputStream stream, in nsIURI aUri);
>+  [noscript] void parseFromBuffer([const,array,size_is(bufLen)] in octet buf, in PRUint32 bufLen, in nsIURI aUri);
>+  void parseFromString(in AString str, in nsIURI aUri);

Documentation? 

>Index: feeds/src/FeedProcessor.js
>+function ASSERT(condition, message) {

Replace this by doing (at the end of the file):

#include ../../../../toolkit/content/debug.js

Replace all the call sites with NS_ASSERT()

>+var ioService = Cc[IO_CONTRACTID].getService(Ci.nsIIOService);

Globals start with 'g'

>+/***** Namespace Map *****/
>+var  namespaces =
>+  {
>+    'http://www.w3.org/2005/Atom':'atom',
>+    'http://purl.org/dc/elements/1.1/':'dc',
>+    'http://www.w3.org/1999/02/22-rdf-syntax-ns#':'rdf',
>+    'http://purl.org/rss/1.0/':'rss1',
>+    'http://wellformedweb.org/CommentAPI/':'wfw',                              
>+    'http://purl.org/rss/1.0/modules/wiki/':'wiki' 
>+  }

Quote style is double quote: " 

>+function FeedResult() {}
>+FeedResult.prototype = 
>+{
>+  bozo: false,
>+  doc: null,
>+  version: null,
>+  headers: null,
>+  uri: null,

If any of these are private, prefix with _.

>+function Feed(){

space between () and {

function Feed() {

>+  QueryInterface: function(iid) {
>+    if (iid.equals(Ci.nsIFeed) ||
>+        iid.equals(Ci.nsIFeedContainer) ||
>+        iid.equals(Ci.nsISupports))
>+    return this;

Indent return 2 more spaces.

>+  for(var key in fields){

Space between ) and {

for (var key in fields) {

>+    searchList = fields[key];
>+    for(var i=0; i<searchList.length; i++){

spaces between things:

for (var i = 0; i < searchList.length; ++i){

& use preincrement instead of post, unless you have to. 

>+      try{

try { 

>+        prop = container.fields.getProperty(field);
>+      }catch(e){}

}
catch (e) {
}

>+      if(prop){

if (prop) {

>+var rssCatTerm = function(s,cat){
>+  // add slash handling?
>+  cat.setPropertyAsAString('term',trimString(s));
>+  return cat;
>+} 

Why define anonymous functions? Why not function rssCatTerm(..)

spaces between parameters:

function rssCatTerm(s, cat) {

>+var trimString = function(s)
>+{
>+  return(s.replace(/^\s+/,'').replace(/\s+$/,''));
>+}

Opening brace on same line as function() {

>+  if(!isNaN(parseInt(date,10))) 
>+  { 

opening brace on same line as if () {

>+    var yeardiff = now.getFullYear()-d.getFullYear();
>+    if((yeardiff >= 0) && (yeardiff<3))

Spaces!

>+function ExtensionHandler(aProc){

No prefixing parameters with 'a'. Say 'processor' too, not 'proc'. That term is overloaded, and where we can be complete within text limits we should. 

>+  this.buf = '';
>+  this.mIsSimple = true;

Private data and methods prefixed with '_'.

>+    this.mDepth += 1;

++this._depth;

>+    this.mDepth -= 1;

--this._depth;
>+  endPrefixMapping: function(){},
>+  processingInstruction: function(){}, 
>+}

Semicolon after definition

};

>+  process: function(atts){

Say 'attributes'

>+    }else if(this.closeFunc){

}
else if () {

>+  this.trans = 
>+  {   
>+  'START':      {"rss"        : function() { self.docVerified('rss2');
>+                                             return 'IN_RSS2';},
>+                 "rdf:RDF"    : function() { stack.push([self.feed,'IN_RDF']);
>+                                             return 'IN_RDF'; },
>+                 "atom:feed"  : function() { self.docVerified('atom');
>+                                             stack.push([self.feed,'IN_ATOM']);
>+                                             return 'IN_ATOM';}},

This sort of pretty formatting is difficult to maintain over time and tends to make files pass the 80th column. Can you de-prettify? 

this.trans = {
  "START": { "rss": function () { ...

>+    if(trans_func){

no underscores in variable names. use interCaps. 

>+  endElement:  function(uri, localName, qName){

don't have anonymous functions, rather prefix them wiht the acronym for your object e.g.:

endElement: function FP_endElement(..)

so that you can get stack traces in NS_ASSERT.

More later!
Attached patch RSS2, RSS1, some Atom (obsolete) — Splinter Review
(In reply to comment #14)
> (From update of attachment 213457 [details] [diff] [review] [edit])

Fixed based on your comments.

> >Index: feeds/src/FeedProcessor.js
> >+function ASSERT(condition, message) {
> 
> Replace this by doing (at the end of the file):
> 
> #include ../../../../toolkit/content/debug.js
>
> you can get stack traces in NS_ASSERT.

That's not going to work with my unit tests, so I'm leaving it out for now. Last time I asked, there wasn't a good way to run them.
Attachment #213457 - Attachment is obsolete: true
Comment on attachment 216692 [details] [diff] [review]
RSS2, RSS1, some Atom

Index: components/feeds/public/nsIEntry.idl

+/**
+ * An nsIEntry represents an Atom or RSS entry/item.
+ */
+[scriptable, uuid(ef51d60c-cad1-42ea-9f6c-c1811f733d90)]
+interface nsIEntry : nsIFeedContainer {

I'm going to hazard a guess here and say that "nsIEntry" is probably too generic for a name here. You might want to try nsIFeedEntry. Rename the file, too. 

+  /**
+  * Uses description, subtitle, summary, content and extensions
+  * to generate a summary. 
+  */
+  AString summary(in boolean stripTags, in long maxLength);

You don't describe what the parameters are, their units, etc for this method or for content().

Index: components/feeds/public/nsIFeed.idl
+  /**
+  * The items or entries in feed.
+  */
+  attribute nsIArray items;

If these are nsIFeedEntrys, and "item" isn't some special word in feed parlance, then I recommend calling this property "entries". 

+  /**
+  * No one really knows what cloud is for.
+  */
+  attribute nsIPropertyBag2 cloud;

Is this something special in certain feed formats?

+  /**
+  * An image url and some metadata.
+  */
+  attribute nsIPropertyBag2 image;

What things are folk likely to find defined on this property bag? Is that documented somewhere?

+  /**
+  * No one really knows what textInput is for.
+  */
+  attribute nsIPropertyBag2 textInput;

Why not?

+  /**
+  * Days to skip fetching.
+  */
+  attribute nsIArray skipDays;
+
+ /**
+  * Hours to skip fetching.
+  */
+  attribute nsIArray skipHours;

What do these things do? Your comments don't help me much as someone who isn't a feed-fiend ;-) 

Index: components/feeds/public/nsIFeedResult.idl
+  /**
+  * HTTP response headers that accompanyed the feed. 
+  */
+  attribute nsIProperties headers;

nit: accompanied

+
+  /**
+  * Registers a prefix used to access an extensions in the feed/entry 
+  */
+  void registerExtensionPrefix(in AString namespace, in AString prefix);

nit: an extensions? do you mean "any extensions"?

btw make sure only files I created have Initial Developer set to Google Inc. The rest are Robert Sayre. 

Index: components/feeds/src/FeedProcessor.js

+/***** Namespace Map *****/
+var  gNamespaces =
+  {

nit: Forgot to mention, your indentation here is wacky. Try this instead:

var gNamespaces = {
 "foo",
 "bar",
 "goat"
};

+function Feed() {
+  this._sub = null;
+  this.items = [];
+}
+Feed.prototype = {
+  subtitle: function(aBool){
+    return aBool ? stripTags(this._sub) : this._sub;
+  },

nit: aBool = bad parameter name. shouldStripTags is better. (no 'a' prefixes). 

+    if (iid.equals(Ci.nsIFeed) ||
+        iid.equals(Ci.nsIFeedContainer) ||
+        iid.equals(Ci.nsISupports))
+    return this;

nit: Your indentation looks weird here. 2 more spaces for return this;

+    for (var i=0; i < searchList.length; ++i) {

nit:spaces around i=0.

+function rssCatTerm(s ,cat) {
+  // add slash handling?
+  cat.setPropertyAsAString("term", trimString(s));
+  return cat;
+} 
+
+function rssGuid(s, guid) {
+  guid.setPropertyAsAString("guid", trimString(s));
+  return guid;
+}

What are these for? There are no comments!

+function rssAuthor(s,author) {
+  author.QueryInterface(Ci.nsIWritablePropertyBag2);
+  var open = s.indexOf("(");
+  var close = s.indexOf(")");
+  var email = trimString(s.substring(0,open)) || null;
+  author.setPropertyAsAString("email", email);
+  var name = null; 
+  if (open >= 0 && close>open) 
+    name = trimString(s.substring(open+1,close));
+  author.setPropertyAsAString("name", name);
+  return author;
+}
+
+function rssArrayElement(s,hour) {
+  var str = Cc["@mozilla.org/supports-string;1"].
+              createInstance(Ci.nsISupportsString);
+  str.data = s;
+  str.QueryInterface(Ci.nsISupportsString);
+  return str;
+}

hour is an unused parameter? What are these functions for? You never say. 

+function isValidRFC822Date(aDateStr)
+{
+  var regex = new RegExp(FZ_RFC822_RE);
+  return regex.test(aDateStr);
+}

nit: write as: function isValidRFC822Date(dateString) {

+function dateParse(dateString)
+{

nit: function dateParse(dateString) {

How important is it that you get the fixme in the comment correct before you turn this on? The severity is not indicated. 

+
+ExtensionHandler.prototype = {
+  startDocument: function(){},
+  endDocument: function(){},
+  startElement: function(uri, localName, qName, attrs){
+  endElement: function(uri, localName, qName){

nit: space between ) { 

+    if(attrs.length > 0) 

nit: if (attrs

+      this._isSimple = false;
+    if(this._depth == 1){

nit: if (..) {

+    }else{

nit: 

}
else {

+    --this._depth;
+    if(this._depth == 0){
+      if(this._isSimple){
+        this._processor.returnExt(this._uri, this._localName, 
+                                  trimString(this._buf));
+      }else{
+        this._processor.returnExt(null,null,null);
+      }

See above for more nits!

+  characters: function(data)
+  {

nit: characters: function EH_characters(data) {

Prefix all of your anonymous methods like this, it'll make them easier to debug. 

+FieldSet.prototype = {
+  process: function(attributes){
+    var obj, key, prefix;
+    if(this._containerClass)
+      obj = this._containerClass.createInstance(Ci.nsIFeedContainer);
+    else
+      obj = Cc[BAG_CONTRACTID].createInstance(Ci.nsIWritablePropertyBag2);
+    for(var i=0; i < attributes.length; i++){

Please just run through the entire file and search for these patterns:

- function() -> replace anonymous functions with named functions!
- if( -> if (
- for( -> for(
- i=0 -> i = 0
- i++ -> ++i (except where postincrement is actually necessary)
- ){ -> ) {
- }else{ -> }\nelse {
- }else if -> }\nelse if

+  close: function(buf){

BTW what methods are these functions for/implementing? Do you say in documentation?

+    var x = this._stack.pop();
+    var y = x[0];
+    if(isIArray(y)){
+      var z = y.queryElementAt(y.length-1, Ci.nsISupports);
+      if(this._closeFunc)
+        z = this._closeFunc(buf, z);
+      y.replaceElementAt(z, y.length-1, false);
+    }else if(this._closeFunc){
+      y = this._closeFunc(buf, y);
+    }
+    var rv = this._stack[this._stack.length-1][1];
+    return rv;
+  }

You'll make it a lot easier for other folk to understand what's going on if you explain your intent in documentation. 

You have a bunch of data structures here. What are they for? Where's the d11n?

+/***** The Processor *****/
+function FeedProcessor() 
+{

nit: function FeedProcessor() {

+  this.trans = 
+  {   


nit: this.trans = {

+    "START":{

nit: "START": {

+FeedProcessor.prototype = 
+{ 

nit: FeedProcessor.prototype = {

+  startElement: function FP_startElement(uri, localName, qName, attributes) {

GOOD.. you should be writing function definitions like this everywhere. 

Some notes on how the parsing works would be appreciated. It will help make the next level of my review go more smoothly. Remember, your goal should be to make it so that your code is accessible to someone who is familiar with the basic constructs of the Mozilla codebase, but not necessarily your approach to solving the problem. 

+/**
+ * XXX Thunderbird's W3C-DTF function
+ *
+ * Converts a W3C-DTF (subset of ISO 8601) date string to an IETF date
+ * string.  W3C-DTF is described in this note:
+ * http://www.w3.org/TR/NOTE-datetime IETF is obtained via the Date
+ * object's toUTCString() method.  The object's toString() method is
+ * insufficient because it spells out timezones on Win32
+ * (f.e. "Pacific Standard Time" instead of "PST"), which Mail doesn't
+ * grok.  For info, see
+ * http://lxr.mozilla.org/mozilla/source/js/src/jsdate.c#1526.
+ */
+function W3CToIETFDate(dateString) {

nit: I'm generally loathe to put any actual content after the Module in js components. It makes finding stuff harder... more disorienting.
Comment on attachment 216692 [details] [diff] [review]
RSS2, RSS1, some Atom

>Index: components/feeds/src/FeedProcessor.js
>+function FieldSet(aStack, aFieldName, Cc, closeFunc) {
>+  this._stack = aStack;
>+  this._fieldName = aFieldName;
>+  this._containerClass = Cc;
>+  this._closeFunc = closeFunc;
>+}

It looks like you create this object all over the place with varying argument lengths. I really find that suspicious... can you explicitly pass null for the arguments you don't want to supply in each case?

Also pass 'containerClass' as an argument, not 'Cc', it masks the Cc defined in the global scope. 

>+FieldSet.prototype = {
>+  process: function(attributes){
>+    var obj, key, prefix;
>+    if(this._containerClass)
>+      obj = this._containerClass.createInstance(Ci.nsIFeedContainer);
>+    else
>+      obj = Cc[BAG_CONTRACTID].createInstance(Ci.nsIWritablePropertyBag2);
>+    for(var i=0; i < attributes.length; i++){
>+      prefix = gNamespaces[attributes.getURI(i)] ? 
>+        gNamespaces[attributes.getURI(i)] + ":" : "";
>+      key =  prefix + attributes.getLocalName(i);
>+      var val = attributes.getValue(i);
>+      obj.setPropertyAsAString(key, val);
>+    }
>+    return this.processLeaf(obj);
>+  },

Sometimes obj is a nsIFeedContainer, sometimes it's a nsIWritablePropertyBag2. In eithr case, you treat it as if it's the latter inside the for loop, calling setPropertyAsAString. If obj is a nsIFeedContainer, this will fail. If nsIFeedContainer is supposed to be a nsIWritablePropertyBag2, you need to make it one, and QI it to that in createInstance.
More comments:

>Index: components/feeds/public/nsIFeedProcessor.idl
>+  /**
>+   * Parse a feed from an nsIInputStream.
>+   *
>+   * @param stream The input stream.
>+   * @param uri The base URI.
>+   */
>+  void parseFromStream(in nsIInputStream stream, in nsIURI uri);

What is the "base URI"? baseURI means something very specific in the io-service, for instance, it means the URI that the path provided is relative to. Is that the case here, or is it something else? I don't see it being used that way...

>Index: components/feeds/public/nsIFeedService.idl

This is an application-specific piece and I will implement this in browser/components/feeds... you can remove it from your patch.

>Index: components/feeds/src/FeedProcessor.js
>+function ExtensionHandler(processor) {
>+  this._buf = "";
>+  this._isSimple = true;
>+  this._processor = processor;
>+  this._depth = 0;
>+  this._localName = null;
>+  this._uri = null;
>+  this._qName = null;
>+  this._attrs = null;
>+}

What is this object for? What are the private members?

>+function FieldSet(aStack, aFieldName, Cc, closeFunc) {

Or this one...

>+  processLeaf: function(newObj){
>+    var newState = "IN_" + this._fieldName.toUpperCase();

Move this closer to where it's used. (above this._stack.push()). 

>+function ArraySet(aStack, aFieldName, Cc, closeFunc) {

...or this one?

>+  var self = this;
>+  var stack = [];
>+  this.stack = stack;

Is this a private member? Name it as one. Also, "stack" is not a good name. What are you tracking? States? Say so.

>+      "rss": function() { self.docVerified("rss2");
>+                          return "IN_RSS2";},
>+      "rdf:RDF":function() { stack.push([self._feed,"IN_RDF"]);
>+                           return "IN_RDF"; },
>+      "atom:feed":function() { self.docVerified("atom");
>+                               stack.push([self._feed,"IN_ATOM"]);
>+                               return "IN_ATOM";}

I don't understand why you verify rss2 and atom, but not RDF-rss. There is not enough documentation here as to what these functions are supposed to do. I have a vague idea, but you should make it explicit in the way they are written and documented. 

>+    "IN_CHANNEL": {
>+      "_channel":function(){ return "IN_RSS2" },

what is "_channel"? Why is this the only field that is a function, and the rest are Array/FieldSets?

>+    "IN_ITEMS": {
>+      "author":new ArraySet(stack, "authors", null, rssAuthor),

Why is there no state function here? What differentiates all of these cases?

>+  docVerified: function FP_docVerified(version) {
>+    this._result.doc = Cc[FEED_CONTRACTID].createInstance(Ci.nsIFeed);
>+    this._result.doc.fields = this._feed;
>+    this._result.version = version;
>+  },

This and init() look like they're private. Use _init and _docVerified then. 

>+  sendResult: function FP_sendResult() {
>+    this._result.doc.normalize();
>+    if(this._listener != null){ 
>+      this._listener.handleResult(this._result);
>+    }
>+  },

Ditto.

>+  // nsIFeedProcessor
>+  get listener() { return this._listener; },
>+  set listener (val) { this._listener = val; },

If this is just a simple read/write field you can have:

listener: null,

(In reply to comment #18)
> More comments:
> 

How about you hold off on further comments until tomorrow?

> What is the "base URI"? baseURI means something very specific in the
> io-service, for instance, it means the URI that the path provided is relative
> to. Is that the case here, or is it something else? 

That's what it will be used for here, too. To resolve relative references in feed content (it provides the default value for xml:base). xml:base support and companion feed HTML sanitization is the last thing one adds to one of these things.

> 
> What is this object for? What are the private members?
>

Say you have 

<feed xmlns:foo="http://example.org/ns/foo">
 ...
 <entry>
 ...
 <foo:bar>example content</foo:bar>
 </entry>
</feed>
 
This provides an easy way to deal with those types of fields (almost all extensions). Given a decent list of extension names, one can avoid writing code for most of them. I'll document the class before I ask for review.

> I don't understand why you verify rss2 and atom, but not RDF-rss.

Bug. Testcase added.

As I said above, give me a chance to document and update the patch to the latest code before more comments.
(In reply to comment #19)
> (In reply to comment #18)
> > More comments:
> > I don't understand why you verify rss2 and atom, but not RDF-rss.
> 
> Bug. Testcase added.
> 

Ah oops, I already was verifying rss1 further down, and I shouldn't add code on the root element, because that would characterize non-RSS RDF files (e.g. FOAF) as RSS1. 

Attached patch refactored, comments, more tests (obsolete) — Splinter Review
hopefully this addresses your comments so far. I still have more work to do on HTML handling, some Atom details, entry normalization, and extension support. I'm continuing on that today.
Attachment #216692 - Attachment is obsolete: true
Attachment #218318 - Flags: review?(bugs)
Comment on attachment 218318 [details] [diff] [review]
refactored, comments, more tests


>Index: toolkit/components/feeds/src/FeedProcessor.js
...
>+ * Contributor(s):
>+ *   Ben Goodger <beng@google.com>
>+ *
...
>+function W3CToIETFDate(dateString) {

Nit: I should probably be listed as a contributor, as I'm the author of the W3CToIETFDate function.
Depends on: 334304
************************************************************
* Call to xpconnect wrapped JSObject produced this error:  *
[Exception... "'JavaScript component does not have a method named: "summary"' wh
en calling method: [nsIFeedEntry::summary]"  nsresult: "0x80570030 (NS_ERROR_XPC
_JSOBJECT_HAS_NO_FUNCTION_NAMED)"  location: "JS frame :: chrome://browser/conte
nt/feeds/feedhandler.js :: FH_init :: line 84"  data: no]
************************************************************

Entry.prototype defines no such function "summary" or "content", as defined in nsIFeedEntry.idl...
Comment on attachment 218318 [details] [diff] [review]
refactored, comments, more tests

cancelling review. this needs to be updated for the just-landed async patch.
Attachment #218318 - Flags: review?(bugs)
Comment on attachment 218318 [details] [diff] [review]
refactored, comments, more tests

>Index: toolkit/components/feeds/src/FeedProcessor.js

Robert, there are still way too many unfixed nits in this patch. Here is a more complete breakdown of them all. As you write code, try and think about these, they'll make this process go faster. 

>+function isIID(a,iid) {

spaces after commas in parameter lists...

function isIID(a, iid) {

>+  try {
>+    a.QueryInterface(iid);
>+    rv = true;
>+  } catch(e) {}

newline after } and {, and space between catch and (:

try {
 ..
}
catch (e) {
}

>+  for (var i = 0; i < array.length; i++) {

++i > i++ - Nit #0

>+
>+/***** Namespace Map *****/
>+var  gNamespaces = {

Extraneous space before gNamespaces

>+/** 
>+ * Given a list of fields and a target object, map them.

"Map a list of fields into properties on a container" is probably more accurate. 

>+function rssAuthor(s,author) {

space after comma

>+  if (open >= 0 && close>open) 

spaces around >: 

close > open

>+function trimString(s) {
>+  return(s.replace(/^\s+/,"").replace(/\s+$/,""));

space after commas in replace param list

>+  if(!isNaN(parseInt(date, 10))) { 
>+    //It's an integer, so maybe it's a timestamp
>+    var d = new Date(parseInt(date, 10) * 1000);
>+    var now = new Date();
>+    var yeardiff = now.getFullYear() - d.getFullYear();
>+    if((yeardiff >= 0) && (yeardiff < 3)) {
>+      //it's quite likely the correct date

This seems arbitrary... why 3?

>+/* The XHTMLHandler handles inline XHTML found in things like atom:summary */

If you're going to use /* */ for section comments, do

/**
 *
 */

Like everywhere else. Otherwise just use //

>+/* The fidelity can be improved here, to allow handling of stuff like
>+ * SVG and MathML. XXX */

// for non-documentation comments. ... I use:

/**
 *
 */

... for function documentation only. 

>+XHTMLHandler.prototype = {
>+  startDocument: function XH_startDocument(){},

Space after ) and before {: - I'm going to call this "nit #1"

  startDocument: function XH_startDocument() {},


>+  endDocument: function XH_endDocument(){},

#1

>+  startElement: function XH_startElement(uri, localName, qName, attributes){

#1

>+    if(this._isAtom && this._depth == 1 && localName == "div")
>+      return;

Space after if and before (: - I'm going to call this "nit #2"

if (

>+    // If it's an XHTML element, record it. Otherwise, it's ignored.
>+    if(uri == XHTML_NS) {

#2

>+      this._buf += "<" + localName;
>+      for(var i=0; i < attributes.length; i++){

#1, #0

>+        // XHTML attributes aren't in a namespace
>+        if(attributes.getURI(i) == ""){ 

#1 & #2

>+  endElement: function XH_endElement(uri, localName, qName){

#1

>+    if(this._isAtom && this._depth == 0 && localName == "div")

#2

>+    if(this._depth < 0){
>+      this._processor.returnFromXHTMLHandler(trimString(this._buf),
>+                                             uri, localName, qName);

#1 & #2

>+    // If it's an XHTML element, record it. Otherwise, it's ignored.
>+    if(uri == XHTML_NS) {

#1

>+  characters: function XH_characters(data)
>+  {

Opening brace on the same line as function... - Nit #3

>+  startPrefixMapping: function XH_startPrefixMapping(){},

#1

>+  endPrefixMapping: function XH_endPrefixMapping(){},

#1

>+  processingInstruction: function XH_processingInstruction(){}, 

#1

>+ExtensionHandler.prototype = {
>+  startDocument: function EH_startDocument(){},

#1

>+  endDocument: function EH_endDocument(){},

#1

>+  startElement: function EH_startElement(uri, localName, qName, attrs){

#1

>+  endElement: function EH_endElement(uri, localName, qName){

#1

>+  characters: function EH_characters(data)
>+  {

#3

>+    if(this._isSimple)

#2

>+  startPrefixMapping: function EH_startPrefixMapping(){},

#1

>+  endPrefixMapping: function EH_endPrefixMapping(){},

#1

>+  processingInstruction: function EH_processingInstruction(){}, 

#1

>+function ElementInfo(fieldName, containerClass, closeFunc, isArray)
>+{

#3

>+/***** The Processor *****/
>+function FeedProcessor() 
>+{

#3

>+  // These elements can contain (X)HTML or plain text.
>+  this._textConstructs = ['atom:title','atom:summary','atom:rights',
>+                          'atom:content','atom:subtitle'];

Use double quotes not single. Nit #4

>+  this.trans = 
>+  {   

#3

>+      "channel":function(){ self._stack.push([self._feed,"IN_CHANNEL"]); 

#1

Also, space between : and function... Nit #5

>+      "_channel":function(){ return "IN_RSS2" }, // close up the RSS channel

#1

>+      "item":new ElementInfo("items", Cc[ENTRY_CONTRACTID], null, true),
>+      "category":new ElementInfo("categories", null, rssCatTerm, true),
>+      "cloud": new ElementInfo("cloud", null, null, false),
>+      "image":new ElementInfo("image", null, null, false),
>+      "textInput":new ElementInfo("textInput", null, null, false),
>+      "skipDays":new ElementInfo("skipDays", null, null, false),
>+      "skipHours":new ElementInfo("skipHours", null, null, false)

#5 for all of these lines

>+    },
>+
>+    "IN_ITEMS": {
>+      "author":new ElementInfo("authors", null, rssAuthor, true),
>+      "category":new ElementInfo("categories", null, rssCatTerm, true),
>+      "enclosure":new ElementInfo("enclosure", null, null, true),
>+      "guid": new ElementInfo("guid", null, rssGuid, false)

#5 here too

>+    "IN_SKIPDAYS": {
>+      "day":new ElementInfo("days", null, rssArrayElement, true)

#5 here too

>+    "IN_SKIPHOURS":{
>+      "hour":new ElementInfo("hours", null, rssArrayElement, true)

#5 here too

>+    "IN_RDF": {
>+      // If we hit a rss1:channel, we can verify that we have RSS1
>+      "rss1:channel":function(){ self._docVerified("rss1");
>+                                 return "IN_RDF_CHAN"; },

#1, #5

>+      "rss1:image":new ElementInfo("image", null, null, false),
>+      "rss1:textinput":new ElementInfo("textInput", null, null, false),
>+      "rss1:item":new ElementInfo("items", Cc[ENTRY_CONTRACTID], null, true)

#5

>+    "IN_RDF_CHAN": {
>+      "_rss1:channel" :  function(){ return "IN_RDF" } // close the channel

No space _before_ the : now! Nit #6. Also, #1

>+    "IN_ATOM": {
>+      "atom:author":new ElementInfo("author", null, null, true),
>+      "atom:generator":new ElementInfo("generator", null,
>+                                       atomGenerator, false),
>+      "atom:contributor":new ElementInfo("contributor", null, null, true),
>+      "atom:link":new ElementInfo("links", null, null, true),
>+      "atom:entry":new ElementInfo("entries", Cc[ENTRY_CONTRACTID], null, true)
>+    },

#5 here

>+    if (this._result.version == 'atom' && 

#4

>+  startPrefixMapping: function FP_startPrefixMapping(){},
>+  endPrefixMapping: function FP_endPrefixMapping(){},
>+  processingInstruction: function FP_processingInstruction(){},

#1 on all of these lines

>+  // Handle our more complicated elements--those that contain
>+  // attributes and child elements.
>+  _processComplexElement: function FP_handleComplexElements(elementInfo,
>+                                                            attributes) {

If you like, you can make these lines wrap earlier:

_processComplexElement:
function FP__processComplexElements(elementInfo, attributes) {

Remember, make the actual name of the function(the one with the FP_ prefix) match the name of the property it is assigned to. Also, make sure the right number of underscores _ is reflected into the name. This is Nit #7

>+    for (var i = 0; i < attributes.length; i++) {

#0

>+      var lastOne = container.queryElementAt(container.length-1,

Space around operator... container.length - 1... Nit #8

>+      container.replaceElementAt(lastOne, container.length-1, false);

#8

>+    else if(elementInfo.closeFunc) {

#2

>+  returnFromXHTMLHandler: function FP_returnFromXHTML(chars, uri, 
>+                                                      localName, qName)
>+  {

#3.. wrap after the : if need be. 

>Index: toolkit/content/widgets/tabbrowser.xml

Ditch these changes, turns out they're not necessary. Look for newer patches from me in bug 324588.
Er, I meant 333751
Attached patch fix nits, add async (obsolete) — Splinter Review
async mode not tested yet
Attachment #218318 - Attachment is obsolete: true
Attachment #219798 - Attachment is obsolete: true
Attached patch add content:encoded support (obsolete) — Splinter Review
Attachment #219835 - Attachment is obsolete: true
OK, now every feed I visit on my test page is reported as being bozo...

http://weblogs.mozillazine.org/ben/index.rdf

Also, while .summary() etc no longer give a js error, they also don't work...
(In reply to comment #30)
> OK, now every feed I visit on my test page is reported as being bozo...
> 
> http://weblogs.mozillazine.org/ben/index.rdf

I just downloaded that feed and tried it in my test suite. It did not come up bozo. I believe what's happening is that your feed is being served as application/rdf+xml with no charset, which defaults to utf-8. Even though your XML file has a charset in the declaration, the headers supposedly override that. If my diagnosis is correct, we probably don't want this behavior by default.
(In reply to comment #31)
> If my diagnosis is correct,

Ah, but it isn't. I'm building with --enable-feeds right now. will investigate further.
Attached patch Fix XPConnect weirdness (obsolete) — Splinter Review
I'm not getting any bozo errors with this, or the patch before it. However, I *was* getting random bugs caused by parser internal data structures losing track of their XPConnect interface (same native address, but magically reverting to nsISupports). Filed bug 335638 on that. The workaround is to do some extra QI-ing that shouldn't be necessary.

I'm in the middle of dealing w/ entry fields so this patch is kind of half way with that stuff. Got interrupted by XPConnect. :/ More later.
Attachment #219874 - Attachment is obsolete: true
(In reply to comment #31)
> (In reply to comment #30)
> > OK, now every feed I visit on my test page is reported as being bozo...
> > 
> > http://weblogs.mozillazine.org/ben/index.rdf
> 
> I just downloaded that feed and tried it in my test suite. It did not come up
> bozo. I believe what's happening is that your feed is being served as
> application/rdf+xml with no charset, which defaults to utf-8. Even though your
> XML file has a charset in the declaration, the headers supposedly override
> that.

Your diagnosis is correct but your conclusion is not.  There is no "default" encoding implied by the lack of a charset parameter in the HTTP headers, unless the Content-type is "text/xml".

This feed is being served with "Content-type: application/rdf+xml" with no charset parameter, the XML body does not contain a BOM, and the XML body does include an explicit encoding attribute of "iso-8859-1".  Therefore it fits into section 8.11 of RFC 3023 and the final encoding should be "iso-8859-1".
(In reply to comment #34)
> Therefore it fits into section 8.11 of RFC 3023 and the final encoding 
> should be "iso-8859-1". 

Yes, you're right. I got that mixed up. The feed-validator-users list set me straight last night.
I changed toolkit/components/Makefile.in. Hope that's all that I needed to do.
Attachment #219985 - Attachment is obsolete: true
Attachment #220082 - Attachment is obsolete: true
Attachment #220279 - Attachment is obsolete: true
Attachment #220280 - Attachment is obsolete: true
Attachment #220316 - Attachment is obsolete: true
Comment on attachment 220461 [details] [diff] [review]
Add the ability to capture a stylesheet URI if there is one

r=ben@mozilla.org for checkin controlled by the MOZ_FEEDS define. We should get this in so you can make further improvements incrementally with piecemeal review.
Attachment #220461 - Flags: review+
Comment on attachment 220461 [details] [diff] [review]
Add the ability to capture a stylesheet URI if there is one

a=ben@mozilla.org for the branch too
Attachment #220461 - Flags: approval-branch-1.8.1+
Checked in on trunk, waiting for branch SAX checkin in bug 315826.
Attached patch Add an HTML entity decoder (obsolete) — Splinter Review
just making sure the approach is sane before I start chasing conformance tests.
checked in on branch
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Hopefully this is the right bug. :-) Is there a particular reason why in "Subscription Options" the "What are LiveBookmarks" is in the form of a button instead of a "normal" UI hyperlink like in other areas of FF (Add-ons->"Get more extentions" and others)?

This is a big inconsistency but isn't the only one as new functionality has been added for 2.x.

I assume I should file a new bug on this too?

~B
Attached patch entities and atom03 (obsolete) — Splinter Review
This patch covers Atom03 and gets the various text/html encodings w/ entities and markup correct
Attachment #220849 - Attachment is obsolete: true
Attached patch use the toolkitcomp module (obsolete) — Splinter Review
Attachment #221039 - Attachment is obsolete: true
Attachment #221048 - Flags: review?(bugs)
Attachment #221048 - Flags: approval-branch-1.8.1?(bugs)
Comment on attachment 221048 [details] [diff] [review]
use the toolkitcomp module

r+a=ben@mozilla.org
Attachment #221048 - Flags: review?(bugs)
Attachment #221048 - Flags: review+
Attachment #221048 - Flags: approval-branch-1.8.1?(bugs)
Attachment #221048 - Flags: approval-branch-1.8.1+
Attachment #221048 - Flags: review+
Attachment #221048 - Flags: approval-branch-1.8.1+
Attached patch don't clobber ben's patch (obsolete) — Splinter Review
Attachment #221048 - Attachment is obsolete: true
Attachment #221049 - Attachment is obsolete: true
This checkin makes toolkit-based SeaMonkey ("suiterunner") builds fail, see
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1146926100.3594.gz&fulltext=1
and
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1146945180.25884.gz&fulltext=1

How can we resolve this?
I thought feeds should only matter to the build when MOZ_FEEDS is on?
(In reply to comment #52)
> This checkin makes toolkit-based SeaMonkey ("suiterunner") builds fail, see
...
> I thought feeds should only matter to the build when MOZ_FEEDS is on?
> 

I see "-DMOZ_FEEDS=1" in those build logs. 
hrmm, yes, it is as the default is for feeds to be turned on, and http://lxr.mozilla.org/mozilla/source/toolkit/components/Makefile.in#60
excludes feeds from building, so we lack some parts and build others.
That's clearly bad, I'll look into that.

BTW, feeds being included two times in toolkit/components/Makefile.in (line 63 and 70) was erroneously introduced in bug 335443 without being part of the reviewed patch.
Depends on: 337216
Depends on: 336858
QA Contact: nobody → rss.preview
Status: ASSIGNED → RESOLVED
Closed: 18 years ago18 years ago
Resolution: --- → FIXED
Blocks: 479976
No longer blocks: 486402
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: