Closed Bug 377450 Opened 17 years ago Closed 5 years ago

HTML parser hooks for efficient microformat parsing

Categories

(Core :: DOM: HTML Parser, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: dmosedale, Unassigned)

References

Details

As we discussed in IRC a while back, if microformats support is to be shipped as part of the browser, we need to be able to look for it and parse it with only minimal Tp hit on both pages without (presumably most of the web) and pages with microformatted content. 

As we discussed in IRC a couple of weeks back, a reasonable strategy may be to allow chrome (maybe even content?) to register with the parser for post-DOM-construction callbacks so that it can get notified if any looked-for microformatted content is available in the page and then use the relevant DOM methods to get it.  

An interesting way to start might be to see what sort of hit Operator adds to Tp   in both situations today, and then try and decide if that hit is acceptable, and, if not, what would be acceptable.
Is it sufficient to inspect @class only? How much attention must be paid to @rel?
It would be nice to have rel in the future, but I don't know if we need it to start. We'll probably do just hcard and hcalendar to begin.

Based on our IRC conversations, my understanding was that class can be done "easier" than rel because class is stored in a different way?
There are two microformats based on rel that we might want to have detection of in Firefox, rel-license (http://microformats.org/wiki/rel-license) and rel-tag (http://microformats.org/wiki/rel-tag)

For rel-license, we could possible expose the license information in Page Info.  I am sure the creative commons people would love to have this exposed in primary UI (although it likely isn't worth cluttering our interface).

For rel-tag, we could use these to suggest possible tags when the user is adding tags to a page with Places.

Both of these features are of course up for debate, but those are the only rel-based microformats that we have been considering for possible native support.
I have my API done, so it's time to start thinking about this bug. Any thoughts on how we could pass data from the JS Microformats API to the parser?

New microformats are added via the Microformats.add API and the list of the class names associated with these microformats would need to somehow be passed to the parser so it would know what to look for.
Blocks: 384186
One possible solution would be to hook things up similar to the way the DOMLinkAdded/DOMLinkRemoved events work.  These events get dispatched whenever a <link> element is bound, unbound, added, or removed from the DOM.

A similar event could be added that fires whenever an element is added that has a "class" or "rel" attribute that contains one of the registered microformats (previously registered with some kind of microformats service).  I wonder if there is a way to defer this until the subtree below the root of the microformat is completely built?

Additionally, whenever the DOM node is changed, we could walk up the tree from that node looking for an element that is a registered microformat root.  If one is found, also raise an event that says the microformat was modified.

It would still be the responsibility of the microformats manager to determine if these events relate to elements that are indeed valid microformats.  However, these new events should take care of the hard (slow) stuff and let the manager deal with the details.

Thoughts?
I'd like to try to get something concrete in this bug as to what our next steps are. I know we talked a little bit on IRC about this, but I didn't feel we had anything definitive. Issues are:

1. How to get the data about which microformat classnames are available to the parser. This means passing things from a JS imported file into C code somehow.

2. How does an extension find out there are microformats on the page. Is a message sent? Is there a parameter on the document?

3. How does an extension get notification of changes in the document that relate to microformats? Like if a microformat DOM node is added.
(In reply to comment #6)
> 2. How does an extension find out there are microformats on the page. Is a
> message sent? Is there a parameter on the document?

I was thinking of a boolean on the document, but see below.

> 
> 3. How does an extension get notification of changes in the document that
> relate to microformats? Like if a microformat DOM node is added.

We can add a check for microformat classnames after an attribute has been added. This is the same place we check for contentEditable. I guess we could fire an event instead?

- Rob
Drive-by two cents flinging: DOM custom event sounds right to me.

/be
I don't think we should use custom DOM events as they come with a fairly large overhead, and raises the normal complexities of firing script at unsafe times etc. It also raises the issue of content-detectable events that ideally should be standardized etc.

So I would go for a custom notification mechanism instead. Possibly an nsIObserverService notification, or something even more custom.

As far as doing the actual detection. I suggest we have a service where anyone can register classnames that we're interested in. We then build a list of atoms that the content code check parsed class names against. We could even implement the service inside gklayout for extra speed goodness.

We can then in the various SetAttr implementations check if it's the class that's being set, check against the registered atom list, and then notify if a match is found.

This will support dynamic changes. However one tricky aspect is that during parsing, when a match is found, the element will not have any surrounding contents. It's not yet inserted into its parent, and it does not yet have any children.

Another possibility would be to attach an nsIMutationObserver on all documents, then we could look for ContentInserted/ContentAppended notifications and walk all subtrees that are inserted. We can then call GetClasses and see if it contains any of the registered classes. We'd also have to watch AttributeChanged to watch out for dynamic modifications. This seems like a cleaner approach, but I'm a little worried about performance.
(In reply to comment #9)
> 
> We can then in the various SetAttr implementations

We only need to check this for nsGenericHTMLElement::AfterSetAttr, right? Are there HTML elements that don't call this?
That part isn't that important as performance will generally be the same. Though nsGenericHTMLElement::AfterSetAttr won't work if we also want this to work on, say, svg elements.
(In reply to comment #11)
> That part isn't that important as performance will generally be the same.
> Though nsGenericHTMLElement::AfterSetAttr won't work if we also want this to
> work on, say, svg elements.

Microformats are only for HTML, afaik.
Assignee: sayrer → nobody

Marking incomplete. I don't think we're doing anything here.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.