Closed Bug 102699 Opened 23 years ago Closed 12 years ago

Implement text/html for @mozilla.org/xmlextras/domparser

Categories

(Core :: DOM: Core & HTML, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: mikeypotter, Assigned: hsivonen)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: dev-doc-complete, Whiteboard: [jetpack])

Attachments

(2 files, 8 obsolete files)

Having a string of HTML, I need to be able to create an HTML DOM Document and
walk through the HTML through the DOM.  
The following works for XML:

var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
     
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/xml" );

but returns an error 
Error: [Exception... "Component returned failure code: 0x80004001
(NS_ERROR_NOT_IMPLEMENTED) [nsIDOMParser.parseFromString]"  nsresult:
"0x80004001 (NS_ERROR_NOT_IMPLEMENTED)"  location: "JS frame ::
chrome://personalpage/content/HtmlTest.xul :: onResponse :: line 17"  data: no]
Source File: chrome://personalpage/content/HtmlTest.xul
Line: 17

with HTML used, like this:

var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
     
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/html" );

I have a test case I will attach.
Keywords: oeone
Nominating for mozilla1.0. I'm not entirely certain how ugly the Penzilla hack
was go get around this problem. Asking MikeP to comment on the importance of
this bug for mozilla1.0.
Keywords: mozilla1.0
The hack for this was pretty ugly, having to do a bunch of string manipulations
in order to get out the parts that I need.
It would be really nice to have this work for html text, I can see a lot of
benefit in having it.
Priority: -- → P4
Marking P3, not critical for 1.0 ("nice to have").
Priority: P4 → P3
Severity: normal → enhancement
OS: Linux → All
Hardware: PC → All
Mass-reassigning bugs to dom_bugs@netscape.com
Assignee: jst → dom_bugs
To get around this bug I am using a simple C++ XPCOM object that simply forwards the html string to "Tidy HTML" library and produces XHTML or XML.
This way I can handle most html documents (those that are recoverable and tidy html can handle).

This way the html can be parsed with the normal DOMParser or any other XML/XHTML parsers, eg.
var html = ... // html string
const cid = "@mipis.hr/HtmlHelpers/HtmlToXml;1"
var obj = Components.classes[cid].createInstance().QueryInterface(Components.interfaces.IHtmlToXml);
var pureHtml = obj.convertHtmlToXml(html);
// ... use as normal XHTML from here on
(In reply to comment #7)
> To get around this bug I am using a simple C++ XPCOM object that simply
> forwards the html string to "Tidy HTML" library and produces XHTML or XML.
> This way I can handle most html documents (those that are recoverable and tidy
> html can handle).

Neat idea. Would you be willing to share the implementation of your component?
> Neat idea. Would you be willing to share the implementation of your component?
> 
Sure - download from:
http://multiterra.mooload.com/file.php?file=files/210306/1142980898/htmltoxml.rar
(free file host - I couldn't attach it here)

To compile use the MS VS workspace (change the "aditional includes/libraries" settings to point to your gecko 1.7 SDK).

To install copy the IHtmlToXml.xpt and HtmlToXml.dll to the <firefox install>\components. "Touch" (open/save) the <firefox install>\.autoreg.

Test case is attached.

The extension was made by following the tutorial at http://www.iosart.com/firefox/xpcom/.
See the page if you have any problems.

Disclaimer: this is my first attempt to make something for mozilla - I have no experience with mozilla whatsoever (not counting the last ~5 days! :)). It seems to work, but I can make no guaranties. I works under win OS, but Tindy HTML is pretty portable so it should be easyto make it work on other platforms - see the project page at http://tidy.sourceforge.net/.
Pretty cool. I can imagine it would be useful to have such a component in the Mozilla/Firefox core. There are lots of other uses for a "Tidy HTML" library (rendering for visually impaired users maybe).
Patching in HTML Tidy is fine, but the point of fixing this bug is to gain access to the Mozilla rendering and HTML digestion, which is vastly superior to the parsing in HTML Tidy. The TagSoup engine seems to be commonly regarded as being better than HTML Tidy, and it also seems to be inferior to what Mozilla does; the complex examples I've been working with render far better in their original HTML than in the XHTML that TagSoup produces.
In a thread in mdt.layout <http://groups.google.com/group/mozilla.dev.tech.layout/browse_frm/thread/225287c6d3098ca5/4698d4007ab27e1f#4698d4007ab27e1f>, bzbarsky said:

> So as I see it, the steps to get this working are:
> 
> 1)  Decide what the problem we're solving is.  Specifically, how should 
> noscript, noframes, and such be parsed in these documents?  Keep in mind 
> that depending on user settings (like whether script is enabled) we 
> create different DOMs from the same source.
> 
> 2)  Decide what the plan is for charsets (currently we depend on having 
> a docshell to handle charset autodetect and in some cases <meta> tags, 
> because we have to throw away the document and reparse).
> 
> 3)  Go through the HTML content sink and HTML document, and make sure 
> all the places that use the docshell or window can survive without one.
> 
> 4)  Do whatever we decided to do for charsets.
> 
> 5)  Make DOMParser parse HTML.
> 
>> 1. Will things get better in Gecko 1.9/Firefox 3 (i.e. are there 
>> concrete plans or promising developments in this area)?
> 
> I'm not aware of significant changes in this area since 1.8, and I'm not 
> sure anyone is working on this actively.  I strongly suspect that given 
> our existing code, once item #1 above is sorted out handling item #3 and 
> item #5 should not be that bad -- a few days work at most.  Items #2 and 
> #4 I'm really not sure about; I guess in large part it depends on what 
> we decide to do about #2.

Given the value of doing this for Firefox's microsummary service, not to mention extension authors and other Gecko consumers, it seems worthy of prioritization for 1.9, so requesting blocking1.9.
Flags: blocking1.9?
If this bug doesn't get fixed in Gecko 1.9, then would it make sense for us to turn the MicrosummaryResource JS object, which is what currently parses HTML for the microsummary service, into an XPCOM component usable by other code?
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]
Bug 361449 contains a patch that I believe fixes this bug.
Attached patch WIP -- it's alive (obsolete) — Splinter Review
Assignee: general → sayrer
Status: NEW → ASSIGNED
Blocks: 370946
> 
> > So as I see it, the steps to get this working are:
> > 
> > 1)  Decide what the problem we're solving is.  Specifically, how should 
> > noscript, noframes, and such be parsed in these documents?  Keep in mind 
> > that depending on user settings (like whether script is enabled) we 
> > create different DOMs from the same source.

nsXMLDocument contains a useful concept called "LoadedAsData" that makes these decisions for XHTML and other XML dialects, and disables layout, CSSLoader, scripts, etc. I am content to mirror its decisions by hoisting that code into nsDocument. Users that need a live DOM with scripts and CSS can always use an iframe. Perhaps we can revisit the CSS capabilites after the initial landing.

> > 
> > 2)  Decide what the plan is for charsets (currently we depend on having 
> > a docshell to handle charset autodetect and in some cases <meta> tags, 
> > because we have to throw away the document and reparse).

Charset detection should be identical to the shipping browser, so we'll have to figure out a way to restart.

> > 
> > 3)  Go through the HTML content sink and HTML document, and make sure 
> > all the places that use the docshell or window can survive without one.
> > 

This hasn't been too hard.

> > 4)  Do whatever we decided to do for charsets.
> > 

So, we have a large block of code in the HTML content sink that tries different ways of getting a charset from the docshell and other sources. There are some situations where we're going to want to ignore the usual hints. For instance, parseFromString is always going to be unicode, so we should ignore meta tags in that case.

Attached patch cleaned up, still WIP (obsolete) — Splinter Review
Attachment #258028 - Attachment is obsolete: true
Attached patch updated patch (obsolete) — Splinter Review
updated to tip, unit tests coming up next
Attachment #258050 - Attachment is obsolete: true
updated to trunk, again. This patch has nsDOMParser implement nsIWebShellServices... I think that will be sufficient to get a callback from chardet and start a reparse.

I also added unit testing for the parser from XPC Shell. It doesn't check the results yet, but it does successfully parse all of test cases from htmlparser, except for the empty string, which seems to fool nsStringStream into thinking it isn't buffered and triggers an assertion.
Attachment #262013 - Attachment is obsolete: true
(In reply to comment #19)
> which seems to fool nsStringStream into thinking it isn't buffered and
> triggers an assertion.

That's bug 368328, for what it's worth.
Attached patch WIP - updated patch (obsolete) — Splinter Review
unit tests failing for the cases we would expect: script, frames, style
Attachment #264537 - Attachment is obsolete: true
Blocks: 297903
Last steps: 

* add tests for noscript
* populate the pseudo-docshell methods I've added to nsDOMParser. These will let us restart parsing based on charsets in meta tags.
Attachment #264619 - Attachment is obsolete: true
Comment on attachment 270241 [details] [diff] [review]
passing all 181 WHAT-WG HTML parser tests that our browser does

>Index: tools/test-harness/xpcshell-simple/test_all.sh
>-    NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -s $headfiles -f $t $tailfiles 2> $t.log 1>&2
>+    NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -v 170 -s $headfiles -f $t $tailfiles 2> $t.log 1>&2

What's the -v 170 for?
(In reply to comment #23)
> 
> What's the -v 170 for?

Turns on generators and such in xpcshell. Or did we land a patch to always use the newest JS version?

jsshell does (see bug 385159), but I don't know whether xpcshell does.
Is someone still working on this or is it dead again?
Flags: wanted1.9+
Whiteboard: [wanted-1.9]
Requesting wanted-1.9.1 with P1 per 1.9.1 Tracking. The addition of this bug would be vital to anyone doing webscraping. Given some of the changes going into Thunderbird 3, I think this would help the extensions of that product in some key areas and would like to see this fixed before the release of said product.
Flags: wanted1.9.1?
Priority: P3 → P1
wanted1.9.1+.  Sayre, are you still working on this?
Flags: wanted1.9.1? → wanted1.9.1+
Attached patch Unbitrotted patch (obsolete) — Splinter Review
This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get the dom parser parsing HTML. I have no idea if this is correct in all cases, only that it works for my purposes.
(In reply to comment #29)
> Created an attachment (id=325352) [details]
> Unbitrotted patch
> 
> This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get
> the dom parser parsing HTML. I have no idea if this is correct in all cases,
> only that it works for my purposes.
> 

The original patch has tests.
Why not borrowing from DOM:range.createContextualFragment() ?
createContextualFragment doesn't behave the same way as the normal HTML parser, for what it's worth.  It also _always_ operates in the context of a docshell, and _never_ in the context of a document, neither of which is the case for DOMParser.
As mentioned on public-html, it would be natural for DOMParser text/html behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does this patch make things match the current XHR2 draft?
(In reply to comment #33)
> As mentioned on public-html, it would be natural for DOMParser text/html
> behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does
> this patch make things match the current XHR2 draft?

I thought XHR2 specified <noscript> parsing behavior. No request for this feature has ever asked for that, so I think XHR2 is broken.
To underscore what Joshua mentioned, this functionality is also important for Ubiquity, both in terms of enabling non-strict xhtml command feeds (ease of development) as well as page parsing without having to load iframes in hidden windows.
QA Contact: stummala → general
Attached patch unbitrotted (obsolete) — Splinter Review
doesn't really work anymore, but applies and builds. I think I'm going to break nsHTMLDocument::StartDocumentLoad into pieces before I try again.
Attachment #270241 - Attachment is obsolete: true
Attachment #325352 - Attachment is obsolete: true
Attachment #402676 - Attachment is patch: true
Attachment #402676 - Attachment mime type: application/octet-stream → text/plain
Depends on: 518711
Depends on: 518860
Just adding [jetpack] to the whiteboard, as this will be very useful to have in Jetpack.
Whiteboard: [jetpack]
Blocks: 543856
No longer blocks: 543856
Component: DOM: Abstract Schemas → DOM: Mozilla Extensions
Fully functional workaround: https://gist.github.com/1129031
Assignee: sayrer → hsivonen
Depends on: 651072
Priority: P1 → --
Blocks: 650787
Depends on: 714777
I'm not planning to add support for HTML to methods that aren't exposed to Web content.
Attachment #402676 - Attachment is obsolete: true
Attachment #589852 - Flags: review?(bugs)
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()


>+    // XXX why do we set these again?
>+    // Make sure to give this document the right base URI
>+    document->SetBaseURI(mBaseURI);
>+    // And the right principal
>+    document->SetPrincipal(mPrincipal);
There is the comment about "Here we have to cheat a little..."
Or is you XXX about something else?

> HTMLContentSink::ProcessLINKTag(const nsIParserNode& aNode)
> {
>-  nsresult  result = NS_OK;
>-
>-  if (mCurrentContext) {
>-    // Create content object
>-    nsCOMPtr<nsIContent> element;
>-    nsCOMPtr<nsINodeInfo> nodeInfo;
>-    nodeInfo = mNodeInfoManager->GetNodeInfo(nsGkAtoms::link, nsnull,
>-                                             kNameSpaceID_XHTML,
>-                                             nsIDOMNode::ELEMENT_NODE);
>-
>-    result = NS_NewHTMLElement(getter_AddRefs(element), nodeInfo.forget(),
>-                               NOT_FROM_PARSER);
>-    NS_ENSURE_SUCCESS(result, result);
>-
>-    nsCOMPtr<nsIStyleSheetLinkingElement> ssle(do_QueryInterface(element));
>-
>-    if (ssle) {
>-      // XXX need prefs. check here.
>-      if (!mInsideNoXXXTag) {
>-        ssle->InitStyleLinkElement(false);
>-        ssle->SetEnableUpdates(false);
>-      } else {
>-        ssle->InitStyleLinkElement(true);
>-      }
>-    }
>-
>-    // Add in the attributes and add the style content object to the
>-    // head container.
>-    result = AddAttributes(aNode, element);
>-    if (NS_FAILED(result)) {
>-      return result;
>-    }
>-
>-    mCurrentContext->AddLeaf(element); // <link>s are leaves
>-
>-    if (ssle) {
>-      ssle->SetEnableUpdates(true);
>-      bool willNotify;
>-      bool isAlternate;
>-      result = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>-                                      &willNotify,
>-                                      &isAlternate);
>-      if (NS_SUCCEEDED(result) && willNotify && !isAlternate && !mFragmentMode) {
>-        ++mPendingSheetCount;
>-        mScriptLoader->AddExecuteBlocker();
>-      }
>-
>-      // look for <link rel="next" href="url">
>-      nsAutoString relVal;
>-      element->GetAttr(kNameSpaceID_None, nsGkAtoms::rel, relVal);
>-      if (!relVal.IsEmpty()) {
>-        PRUint32 linkTypes = nsStyleLinkElement::ParseLinkTypes(relVal);
>-        bool hasPrefetch = linkTypes & PREFETCH;
>-        if (hasPrefetch || (linkTypes & NEXT)) {
>-          nsAutoString hrefVal;
>-          element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>-          if (!hrefVal.IsEmpty()) {
>-            PrefetchHref(hrefVal, element, hasPrefetch);
>-          }
>-        }
>-        if (linkTypes & DNS_PREFETCH) {
>-          nsAutoString hrefVal;
>-          element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>-          if (!hrefVal.IsEmpty()) {
>-            PrefetchDNS(hrefVal);
>-          }
>-        }
>-      }
>-    }
>-  }
>-
>-  return result;
>+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
>+  return NS_ERROR_NOT_IMPLEMENTED;
> }
What is this stuff about?



> HTMLContentSink::ProcessSTYLEEndTag(nsGenericHTMLElement* content)
> {
>-  nsCOMPtr<nsIStyleSheetLinkingElement> ssle = do_QueryInterface(content);
>-
>-  NS_ASSERTION(ssle,
>-               "html:style doesn't implement nsIStyleSheetLinkingElement");
>-
>-  nsresult rv = NS_OK;
>-
>-  if (ssle) {
>-    // Note: if we are inside a noXXX tag, then we init'ed this style element
>-    // with mDontLoadStyle = true, so these two calls will have no effect.
>-    ssle->SetEnableUpdates(true);
>-    bool willNotify;
>-    bool isAlternate;
>-    rv = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>-                                &willNotify,
>-                                &isAlternate);
>-    if (NS_SUCCEEDED(rv) && willNotify && !isAlternate && !mFragmentMode) {
>-      ++mPendingSheetCount;
>-      mScriptLoader->AddExecuteBlocker();
>-    }
>-  }
>-
>-  return rv;
>+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
>+  return NS_ERROR_NOT_IMPLEMENTED;
And this.
Attachment #589852 - Flags: review?(bugs) → review-
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()

(In reply to Olli Pettay [:smaug] from comment #40)
> >+    // XXX why do we set these again?
> >+    // Make sure to give this document the right base URI
> >+    document->SetBaseURI(mBaseURI);
> >+    // And the right principal
> >+    document->SetPrincipal(mPrincipal);
> There is the comment about "Here we have to cheat a little..."
> Or is you XXX about something else?

The XXX is about setting base URI and principal again even though they were passed to nsContentSink::CreateDocument. I'm just imitating the XML code path here. Dunno why things are done that way.

> >-  return result;
> >+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
> >+  return NS_ERROR_NOT_IMPLEMENTED;
> > }
> What is this stuff about?
...
> >-  return rv;
> >+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
> >+  return NS_ERROR_NOT_IMPLEMENTED;
> And this.

Most of the old HTML content sink is now dead code. The changes to the superclass would have required tweaks to the dead code, so I figured it makes more sense to remove the affected parts of the dead code than to tweak dead code.

Setting back to review?, since your comment contained questions rather than comments obviously requiring patch changes.
Attachment #589852 - Flags: review- → review?(bugs)
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()


>+
>+    // XXX why do we set these again?
Remove this, but add perhaps a comment here and in XML
document creation that both should do the same thing.

ok, explanations got on IRC.
Attachment #589852 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #42)
> Remove this, but add perhaps a comment here and in XML
> document creation that both should do the same thing.

Done. Thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/10db8f79cd65
https://hg.mozilla.org/mozilla-central/rev/10db8f79cd65
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
This page says that text/html is not supported, which is no longer true:

https://developer.mozilla.org/en/DOM/DOMParser
Keywords: dev-doc-needed
No longer depends on: 728093
Depends on: CVE-2012-3975
Component: DOM: Mozilla Extensions → DOM
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.