Note: There are a few cases of duplicates in user autocompletion which are being worked on.

Implement text/html for @mozilla.org/xmlextras/domparser

RESOLVED FIXED in mozilla12

Status

()

Core
DOM
--
enhancement
RESOLVED FIXED
16 years ago
4 years ago

People

(Reporter: Mike Potter, Assigned: hsivonen)

Tracking

(Depends on: 1 bug, Blocks: 2 bugs, {dev-doc-complete})

Trunk
mozilla12
dev-doc-complete
Points:
---
Dependency tree / graph
Bug Flags:
wanted1.9.1 +
blocking1.9 -
wanted1.9 +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [jetpack])

Attachments

(2 attachments, 8 obsolete attachments)

(Reporter)

Description

16 years ago
Having a string of HTML, I need to be able to create an HTML DOM Document and
walk through the HTML through the DOM.  
The following works for XML:

var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
     
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/xml" );

but returns an error 
Error: [Exception... "Component returned failure code: 0x80004001
(NS_ERROR_NOT_IMPLEMENTED) [nsIDOMParser.parseFromString]"  nsresult:
"0x80004001 (NS_ERROR_NOT_IMPLEMENTED)"  location: "JS frame ::
chrome://personalpage/content/HtmlTest.xul :: onResponse :: line 17"  data: no]
Source File: chrome://personalpage/content/HtmlTest.xul
Line: 17

with HTML used, like this:

var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
     
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/html" );

I have a test case I will attach.
(Reporter)

Comment 1

16 years ago
Created attachment 51701 [details]
A Sample XUL file which produces the error. You need to run this from a chrome:// URL
(Reporter)

Updated

16 years ago
Keywords: oeone

Comment 2

16 years ago
Nominating for mozilla1.0. I'm not entirely certain how ugly the Penzilla hack
was go get around this problem. Asking MikeP to comment on the importance of
this bug for mozilla1.0.
Keywords: mozilla1.0
(Reporter)

Comment 3

16 years ago
The hack for this was pretty ugly, having to do a bunch of string manipulations
in order to get out the parts that I need.
It would be really nice to have this work for html text, I can see a lot of
benefit in having it.

Updated

16 years ago
Priority: -- → P4

Comment 4

16 years ago
Marking P3, not critical for 1.0 ("nice to have").
Priority: P4 → P3
Severity: normal → enhancement
OS: Linux → All
Hardware: PC → All
Mass-reassigning bugs to dom_bugs@netscape.com
Assignee: jst → dom_bugs

Comment 6

12 years ago
In this newsgroup thread, bz explained why this is hard:
http://groups-beta.google.com/group/netscape.public.mozilla.dom/browse_thread/thread/126739f8904f94d9/7c5cd3d4ad884074

Comment 7

12 years ago
To get around this bug I am using a simple C++ XPCOM object that simply forwards the html string to "Tidy HTML" library and produces XHTML or XML.
This way I can handle most html documents (those that are recoverable and tidy html can handle).

This way the html can be parsed with the normal DOMParser or any other XML/XHTML parsers, eg.
var html = ... // html string
const cid = "@mipis.hr/HtmlHelpers/HtmlToXml;1"
var obj = Components.classes[cid].createInstance().QueryInterface(Components.interfaces.IHtmlToXml);
var pureHtml = obj.convertHtmlToXml(html);
// ... use as normal XHTML from here on

Comment 8

12 years ago
(In reply to comment #7)
> To get around this bug I am using a simple C++ XPCOM object that simply
> forwards the html string to "Tidy HTML" library and produces XHTML or XML.
> This way I can handle most html documents (those that are recoverable and tidy
> html can handle).

Neat idea. Would you be willing to share the implementation of your component?

Comment 9

12 years ago
> Neat idea. Would you be willing to share the implementation of your component?
> 
Sure - download from:
http://multiterra.mooload.com/file.php?file=files/210306/1142980898/htmltoxml.rar
(free file host - I couldn't attach it here)

To compile use the MS VS workspace (change the "aditional includes/libraries" settings to point to your gecko 1.7 SDK).

To install copy the IHtmlToXml.xpt and HtmlToXml.dll to the <firefox install>\components. "Touch" (open/save) the <firefox install>\.autoreg.

Test case is attached.

The extension was made by following the tutorial at http://www.iosart.com/firefox/xpcom/.
See the page if you have any problems.

Disclaimer: this is my first attempt to make something for mozilla - I have no experience with mozilla whatsoever (not counting the last ~5 days! :)). It seems to work, but I can make no guaranties. I works under win OS, but Tindy HTML is pretty portable so it should be easyto make it work on other platforms - see the project page at http://tidy.sourceforge.net/.

Comment 10

12 years ago
Pretty cool. I can imagine it would be useful to have such a component in the Mozilla/Firefox core. There are lots of other uses for a "Tidy HTML" library (rendering for visually impaired users maybe).

Comment 11

12 years ago
Patching in HTML Tidy is fine, but the point of fixing this bug is to gain access to the Mozilla rendering and HTML digestion, which is vastly superior to the parsing in HTML Tidy. The TagSoup engine seems to be commonly regarded as being better than HTML Tidy, and it also seems to be inferior to what Mozilla does; the complex examples I've been working with render far better in their original HTML than in the XHTML that TagSoup produces.
In a thread in mdt.layout <http://groups.google.com/group/mozilla.dev.tech.layout/browse_frm/thread/225287c6d3098ca5/4698d4007ab27e1f#4698d4007ab27e1f>, bzbarsky said:

> So as I see it, the steps to get this working are:
> 
> 1)  Decide what the problem we're solving is.  Specifically, how should 
> noscript, noframes, and such be parsed in these documents?  Keep in mind 
> that depending on user settings (like whether script is enabled) we 
> create different DOMs from the same source.
> 
> 2)  Decide what the plan is for charsets (currently we depend on having 
> a docshell to handle charset autodetect and in some cases <meta> tags, 
> because we have to throw away the document and reparse).
> 
> 3)  Go through the HTML content sink and HTML document, and make sure 
> all the places that use the docshell or window can survive without one.
> 
> 4)  Do whatever we decided to do for charsets.
> 
> 5)  Make DOMParser parse HTML.
> 
>> 1. Will things get better in Gecko 1.9/Firefox 3 (i.e. are there 
>> concrete plans or promising developments in this area)?
> 
> I'm not aware of significant changes in this area since 1.8, and I'm not 
> sure anyone is working on this actively.  I strongly suspect that given 
> our existing code, once item #1 above is sorted out handling item #3 and 
> item #5 should not be that bad -- a few days work at most.  Items #2 and 
> #4 I'm really not sure about; I guess in large part it depends on what 
> we decide to do about #2.

Given the value of doing this for Firefox's microsummary service, not to mention extension authors and other Gecko consumers, it seems worthy of prioritization for 1.9, so requesting blocking1.9.
Flags: blocking1.9?
Blocks: 361449
If this bug doesn't get fixed in Gecko 1.9, then would it make sense for us to turn the MicrosummaryResource JS object, which is what currently parses HTML for the microsummary service, into an XPCOM component usable by other code?
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]

Comment 14

11 years ago
Bug 361449 contains a patch that I believe fixes this bug.

Comment 15

11 years ago
Created attachment 258028 [details] [diff] [review]
WIP -- it's alive
Assignee: general → sayrer
Status: NEW → ASSIGNED

Updated

11 years ago
Blocks: 370946

Comment 16

11 years ago
> 
> > So as I see it, the steps to get this working are:
> > 
> > 1)  Decide what the problem we're solving is.  Specifically, how should 
> > noscript, noframes, and such be parsed in these documents?  Keep in mind 
> > that depending on user settings (like whether script is enabled) we 
> > create different DOMs from the same source.

nsXMLDocument contains a useful concept called "LoadedAsData" that makes these decisions for XHTML and other XML dialects, and disables layout, CSSLoader, scripts, etc. I am content to mirror its decisions by hoisting that code into nsDocument. Users that need a live DOM with scripts and CSS can always use an iframe. Perhaps we can revisit the CSS capabilites after the initial landing.

> > 
> > 2)  Decide what the plan is for charsets (currently we depend on having 
> > a docshell to handle charset autodetect and in some cases <meta> tags, 
> > because we have to throw away the document and reparse).

Charset detection should be identical to the shipping browser, so we'll have to figure out a way to restart.

> > 
> > 3)  Go through the HTML content sink and HTML document, and make sure 
> > all the places that use the docshell or window can survive without one.
> > 

This hasn't been too hard.

> > 4)  Do whatever we decided to do for charsets.
> > 

So, we have a large block of code in the HTML content sink that tries different ways of getting a charset from the docshell and other sources. There are some situations where we're going to want to ignore the usual hints. For instance, parseFromString is always going to be unicode, so we should ignore meta tags in that case.

Comment 17

11 years ago
Created attachment 258050 [details] [diff] [review]
cleaned up, still WIP
Attachment #258028 - Attachment is obsolete: true

Comment 18

10 years ago
Created attachment 262013 [details] [diff] [review]
updated patch

updated to tip, unit tests coming up next
Attachment #258050 - Attachment is obsolete: true

Comment 19

10 years ago
Created attachment 264537 [details] [diff] [review]
WIP -- add unit tests and nsIWebShellServices

updated to trunk, again. This patch has nsDOMParser implement nsIWebShellServices... I think that will be sufficient to get a callback from chardet and start a reparse.

I also added unit testing for the parser from XPC Shell. It doesn't check the results yet, but it does successfully parse all of test cases from htmlparser, except for the empty string, which seems to fool nsStringStream into thinking it isn't buffered and triggers an assertion.
Attachment #262013 - Attachment is obsolete: true
(In reply to comment #19)
> which seems to fool nsStringStream into thinking it isn't buffered and
> triggers an assertion.

That's bug 368328, for what it's worth.

Comment 21

10 years ago
Created attachment 264619 [details] [diff] [review]
WIP - updated patch

unit tests failing for the cases we would expect: script, frames, style
Attachment #264537 - Attachment is obsolete: true

Updated

10 years ago
Blocks: 297903

Comment 22

10 years ago
Created attachment 270241 [details] [diff] [review]
passing all 181 WHAT-WG HTML parser tests that our browser does

Last steps: 

* add tests for noscript
* populate the pseudo-docshell methods I've added to nsDOMParser. These will let us restart parsing based on charsets in meta tags.
Attachment #264619 - Attachment is obsolete: true
Comment on attachment 270241 [details] [diff] [review]
passing all 181 WHAT-WG HTML parser tests that our browser does

>Index: tools/test-harness/xpcshell-simple/test_all.sh
>-    NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -s $headfiles -f $t $tailfiles 2> $t.log 1>&2
>+    NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -v 170 -s $headfiles -f $t $tailfiles 2> $t.log 1>&2

What's the -v 170 for?

Comment 24

10 years ago
(In reply to comment #23)
> 
> What's the -v 170 for?

Turns on generators and such in xpcshell. Or did we land a patch to always use the newest JS version?

Comment 25

10 years ago
jsshell does (see bug 385159), but I don't know whether xpcshell does.

Comment 26

10 years ago
Is someone still working on this or is it dead again?
Flags: wanted1.9+
Whiteboard: [wanted-1.9]
Requesting wanted-1.9.1 with P1 per 1.9.1 Tracking. The addition of this bug would be vital to anyone doing webscraping. Given some of the changes going into Thunderbird 3, I think this would help the extensions of that product in some key areas and would like to see this fixed before the release of said product.
Flags: wanted1.9.1?
Priority: P3 → P1
wanted1.9.1+.  Sayre, are you still working on this?
Flags: wanted1.9.1? → wanted1.9.1+
Created attachment 325352 [details] [diff] [review]
Unbitrotted patch

This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get the dom parser parsing HTML. I have no idea if this is correct in all cases, only that it works for my purposes.

Comment 30

9 years ago
(In reply to comment #29)
> Created an attachment (id=325352) [details]
> Unbitrotted patch
> 
> This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get
> the dom parser parsing HTML. I have no idea if this is correct in all cases,
> only that it works for my purposes.
> 

The original patch has tests.

Comment 31

9 years ago
Why not borrowing from DOM:range.createContextualFragment() ?

Comment 32

9 years ago
createContextualFragment doesn't behave the same way as the normal HTML parser, for what it's worth.  It also _always_ operates in the context of a docshell, and _never_ in the context of a document, neither of which is the case for DOMParser.
As mentioned on public-html, it would be natural for DOMParser text/html behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does this patch make things match the current XHR2 draft?

Comment 34

9 years ago
(In reply to comment #33)
> As mentioned on public-html, it would be natural for DOMParser text/html
> behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does
> this patch make things match the current XHR2 draft?

I thought XHR2 specified <noscript> parsing behavior. No request for this feature has ever asked for that, so I think XHR2 is broken.

Comment 35

9 years ago
To underscore what Joshua mentioned, this functionality is also important for Ubiquity, both in terms of enabling non-strict xhtml command feeds (ease of development) as well as page parsing without having to load iframes in hidden windows.
QA Contact: stummala → general

Comment 36

8 years ago
Created attachment 402676 [details] [diff] [review]
unbitrotted

doesn't really work anymore, but applies and builds. I think I'm going to break nsHTMLDocument::StartDocumentLoad into pieces before I try again.
Attachment #270241 - Attachment is obsolete: true
Attachment #325352 - Attachment is obsolete: true
Attachment #402676 - Attachment is patch: true
Attachment #402676 - Attachment mime type: application/octet-stream → text/plain

Updated

8 years ago
Depends on: 518711

Updated

8 years ago
Depends on: 518860

Comment 37

8 years ago
Just adding [jetpack] to the whiteboard, as this will be very useful to have in Jetpack.
Whiteboard: [jetpack]

Updated

8 years ago
Blocks: 543856
No longer blocks: 543856

Updated

7 years ago
Component: DOM: Abstract Schemas → DOM: Mozilla Extensions
Blocks: 596002

Comment 38

6 years ago
Fully functional workaround: https://gist.github.com/1129031

Updated

6 years ago
Assignee: sayrer → hsivonen
Depends on: 651072
Priority: P1 → --
Blocks: 650787
Depends on: 714777
Created attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()

I'm not planning to add support for HTML to methods that aren't exposed to Web content.
Attachment #402676 - Attachment is obsolete: true
Attachment #589852 - Flags: review?(bugs)
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()


>+    // XXX why do we set these again?
>+    // Make sure to give this document the right base URI
>+    document->SetBaseURI(mBaseURI);
>+    // And the right principal
>+    document->SetPrincipal(mPrincipal);
There is the comment about "Here we have to cheat a little..."
Or is you XXX about something else?

> HTMLContentSink::ProcessLINKTag(const nsIParserNode& aNode)
> {
>-  nsresult  result = NS_OK;
>-
>-  if (mCurrentContext) {
>-    // Create content object
>-    nsCOMPtr<nsIContent> element;
>-    nsCOMPtr<nsINodeInfo> nodeInfo;
>-    nodeInfo = mNodeInfoManager->GetNodeInfo(nsGkAtoms::link, nsnull,
>-                                             kNameSpaceID_XHTML,
>-                                             nsIDOMNode::ELEMENT_NODE);
>-
>-    result = NS_NewHTMLElement(getter_AddRefs(element), nodeInfo.forget(),
>-                               NOT_FROM_PARSER);
>-    NS_ENSURE_SUCCESS(result, result);
>-
>-    nsCOMPtr<nsIStyleSheetLinkingElement> ssle(do_QueryInterface(element));
>-
>-    if (ssle) {
>-      // XXX need prefs. check here.
>-      if (!mInsideNoXXXTag) {
>-        ssle->InitStyleLinkElement(false);
>-        ssle->SetEnableUpdates(false);
>-      } else {
>-        ssle->InitStyleLinkElement(true);
>-      }
>-    }
>-
>-    // Add in the attributes and add the style content object to the
>-    // head container.
>-    result = AddAttributes(aNode, element);
>-    if (NS_FAILED(result)) {
>-      return result;
>-    }
>-
>-    mCurrentContext->AddLeaf(element); // <link>s are leaves
>-
>-    if (ssle) {
>-      ssle->SetEnableUpdates(true);
>-      bool willNotify;
>-      bool isAlternate;
>-      result = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>-                                      &willNotify,
>-                                      &isAlternate);
>-      if (NS_SUCCEEDED(result) && willNotify && !isAlternate && !mFragmentMode) {
>-        ++mPendingSheetCount;
>-        mScriptLoader->AddExecuteBlocker();
>-      }
>-
>-      // look for <link rel="next" href="url">
>-      nsAutoString relVal;
>-      element->GetAttr(kNameSpaceID_None, nsGkAtoms::rel, relVal);
>-      if (!relVal.IsEmpty()) {
>-        PRUint32 linkTypes = nsStyleLinkElement::ParseLinkTypes(relVal);
>-        bool hasPrefetch = linkTypes & PREFETCH;
>-        if (hasPrefetch || (linkTypes & NEXT)) {
>-          nsAutoString hrefVal;
>-          element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>-          if (!hrefVal.IsEmpty()) {
>-            PrefetchHref(hrefVal, element, hasPrefetch);
>-          }
>-        }
>-        if (linkTypes & DNS_PREFETCH) {
>-          nsAutoString hrefVal;
>-          element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>-          if (!hrefVal.IsEmpty()) {
>-            PrefetchDNS(hrefVal);
>-          }
>-        }
>-      }
>-    }
>-  }
>-
>-  return result;
>+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
>+  return NS_ERROR_NOT_IMPLEMENTED;
> }
What is this stuff about?



> HTMLContentSink::ProcessSTYLEEndTag(nsGenericHTMLElement* content)
> {
>-  nsCOMPtr<nsIStyleSheetLinkingElement> ssle = do_QueryInterface(content);
>-
>-  NS_ASSERTION(ssle,
>-               "html:style doesn't implement nsIStyleSheetLinkingElement");
>-
>-  nsresult rv = NS_OK;
>-
>-  if (ssle) {
>-    // Note: if we are inside a noXXX tag, then we init'ed this style element
>-    // with mDontLoadStyle = true, so these two calls will have no effect.
>-    ssle->SetEnableUpdates(true);
>-    bool willNotify;
>-    bool isAlternate;
>-    rv = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>-                                &willNotify,
>-                                &isAlternate);
>-    if (NS_SUCCEEDED(rv) && willNotify && !isAlternate && !mFragmentMode) {
>-      ++mPendingSheetCount;
>-      mScriptLoader->AddExecuteBlocker();
>-    }
>-  }
>-
>-  return rv;
>+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
>+  return NS_ERROR_NOT_IMPLEMENTED;
And this.
Attachment #589852 - Flags: review?(bugs) → review-
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()

(In reply to Olli Pettay [:smaug] from comment #40)
> >+    // XXX why do we set these again?
> >+    // Make sure to give this document the right base URI
> >+    document->SetBaseURI(mBaseURI);
> >+    // And the right principal
> >+    document->SetPrincipal(mPrincipal);
> There is the comment about "Here we have to cheat a little..."
> Or is you XXX about something else?

The XXX is about setting base URI and principal again even though they were passed to nsContentSink::CreateDocument. I'm just imitating the XML code path here. Dunno why things are done that way.

> >-  return result;
> >+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
> >+  return NS_ERROR_NOT_IMPLEMENTED;
> > }
> What is this stuff about?
...
> >-  return rv;
> >+  MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
> >+  return NS_ERROR_NOT_IMPLEMENTED;
> And this.

Most of the old HTML content sink is now dead code. The changes to the superclass would have required tweaks to the dead code, so I figured it makes more sense to remove the affected parts of the dead code than to tweak dead code.

Setting back to review?, since your comment contained questions rather than comments obviously requiring patch changes.
Attachment #589852 - Flags: review- → review?(bugs)
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()


>+
>+    // XXX why do we set these again?
Remove this, but add perhaps a comment here and in XML
document creation that both should do the same thing.

ok, explanations got on IRC.
Attachment #589852 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #42)
> Remove this, but add perhaps a comment here and in XML
> document creation that both should do the same thing.

Done. Thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/10db8f79cd65
https://hg.mozilla.org/mozilla-central/rev/10db8f79cd65
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
This page says that text/html is not supported, which is no longer true:

https://developer.mozilla.org/en/DOM/DOMParser
Keywords: dev-doc-needed
https://developer.mozilla.org/en/Firefox_12_for_developers
https://developer.mozilla.org/en/DOM/DOMParser
Keywords: dev-doc-needed → dev-doc-complete
Depends on: 728093

Updated

6 years ago
No longer depends on: 728093

Updated

5 years ago
Depends on: 770684
Component: DOM: Mozilla Extensions → DOM
Product: Core → Core
You need to log in before you can comment on or make changes to this bug.