Closed
Bug 102699
Opened 23 years ago
Closed 13 years ago
Implement text/html for @mozilla.org/xmlextras/domparser
Categories
(Core :: DOM: Core & HTML, enhancement)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
mozilla12
People
(Reporter: mikeypotter, Assigned: hsivonen)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: dev-doc-complete, Whiteboard: [jetpack])
Attachments
(2 files, 8 obsolete files)
2.29 KB,
text/plain
|
Details | |
48.58 KB,
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
Having a string of HTML, I need to be able to create an HTML DOM Document and
walk through the HTML through the DOM.
The following works for XML:
var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/xml" );
but returns an error
Error: [Exception... "Component returned failure code: 0x80004001
(NS_ERROR_NOT_IMPLEMENTED) [nsIDOMParser.parseFromString]" nsresult:
"0x80004001 (NS_ERROR_NOT_IMPLEMENTED)" location: "JS frame ::
chrome://personalpage/content/HtmlTest.xul :: onResponse :: line 17" data: no]
Source File: chrome://personalpage/content/HtmlTest.xul
Line: 17
with HTML used, like this:
var domParser =
Components.classes["@mozilla.org/xmlextras/domparser;1"].getService(
Components.interfaces.nsIDOMParser );
var xmlDocument = domParser.parseFromString( ResponseObject.responseText,
"text/html" );
I have a test case I will attach.
Reporter | ||
Comment 1•23 years ago
|
||
Comment 2•23 years ago
|
||
Nominating for mozilla1.0. I'm not entirely certain how ugly the Penzilla hack
was go get around this problem. Asking MikeP to comment on the importance of
this bug for mozilla1.0.
Keywords: mozilla1.0
Reporter | ||
Comment 3•23 years ago
|
||
The hack for this was pretty ugly, having to do a bunch of string manipulations
in order to get out the parts that I need.
It would be really nice to have this work for html text, I can see a lot of
benefit in having it.
Updated•23 years ago
|
Priority: -- → P4
Updated•23 years ago
|
Severity: normal → enhancement
OS: Linux → All
Hardware: PC → All
Comment 6•20 years ago
|
||
In this newsgroup thread, bz explained why this is hard:
http://groups-beta.google.com/group/netscape.public.mozilla.dom/browse_thread/thread/126739f8904f94d9/7c5cd3d4ad884074
Comment 7•19 years ago
|
||
To get around this bug I am using a simple C++ XPCOM object that simply forwards the html string to "Tidy HTML" library and produces XHTML or XML.
This way I can handle most html documents (those that are recoverable and tidy html can handle).
This way the html can be parsed with the normal DOMParser or any other XML/XHTML parsers, eg.
var html = ... // html string
const cid = "@mipis.hr/HtmlHelpers/HtmlToXml;1"
var obj = Components.classes[cid].createInstance().QueryInterface(Components.interfaces.IHtmlToXml);
var pureHtml = obj.convertHtmlToXml(html);
// ... use as normal XHTML from here on
Comment 8•19 years ago
|
||
(In reply to comment #7)
> To get around this bug I am using a simple C++ XPCOM object that simply
> forwards the html string to "Tidy HTML" library and produces XHTML or XML.
> This way I can handle most html documents (those that are recoverable and tidy
> html can handle).
Neat idea. Would you be willing to share the implementation of your component?
Comment 9•19 years ago
|
||
> Neat idea. Would you be willing to share the implementation of your component?
>
Sure - download from:
http://multiterra.mooload.com/file.php?file=files/210306/1142980898/htmltoxml.rar
(free file host - I couldn't attach it here)
To compile use the MS VS workspace (change the "aditional includes/libraries" settings to point to your gecko 1.7 SDK).
To install copy the IHtmlToXml.xpt and HtmlToXml.dll to the <firefox install>\components. "Touch" (open/save) the <firefox install>\.autoreg.
Test case is attached.
The extension was made by following the tutorial at http://www.iosart.com/firefox/xpcom/.
See the page if you have any problems.
Disclaimer: this is my first attempt to make something for mozilla - I have no experience with mozilla whatsoever (not counting the last ~5 days! :)). It seems to work, but I can make no guaranties. I works under win OS, but Tindy HTML is pretty portable so it should be easyto make it work on other platforms - see the project page at http://tidy.sourceforge.net/.
Comment 10•19 years ago
|
||
Pretty cool. I can imagine it would be useful to have such a component in the Mozilla/Firefox core. There are lots of other uses for a "Tidy HTML" library (rendering for visually impaired users maybe).
Comment 11•19 years ago
|
||
Patching in HTML Tidy is fine, but the point of fixing this bug is to gain access to the Mozilla rendering and HTML digestion, which is vastly superior to the parsing in HTML Tidy. The TagSoup engine seems to be commonly regarded as being better than HTML Tidy, and it also seems to be inferior to what Mozilla does; the complex examples I've been working with render far better in their original HTML than in the XHTML that TagSoup produces.
Comment 12•18 years ago
|
||
In a thread in mdt.layout <http://groups.google.com/group/mozilla.dev.tech.layout/browse_frm/thread/225287c6d3098ca5/4698d4007ab27e1f#4698d4007ab27e1f>, bzbarsky said:
> So as I see it, the steps to get this working are:
>
> 1) Decide what the problem we're solving is. Specifically, how should
> noscript, noframes, and such be parsed in these documents? Keep in mind
> that depending on user settings (like whether script is enabled) we
> create different DOMs from the same source.
>
> 2) Decide what the plan is for charsets (currently we depend on having
> a docshell to handle charset autodetect and in some cases <meta> tags,
> because we have to throw away the document and reparse).
>
> 3) Go through the HTML content sink and HTML document, and make sure
> all the places that use the docshell or window can survive without one.
>
> 4) Do whatever we decided to do for charsets.
>
> 5) Make DOMParser parse HTML.
>
>> 1. Will things get better in Gecko 1.9/Firefox 3 (i.e. are there
>> concrete plans or promising developments in this area)?
>
> I'm not aware of significant changes in this area since 1.8, and I'm not
> sure anyone is working on this actively. I strongly suspect that given
> our existing code, once item #1 above is sorted out handling item #3 and
> item #5 should not be that bad -- a few days work at most. Items #2 and
> #4 I'm really not sure about; I guess in large part it depends on what
> we decide to do about #2.
Given the value of doing this for Firefox's microsummary service, not to mention extension authors and other Gecko consumers, it seems worthy of prioritization for 1.9, so requesting blocking1.9.
Flags: blocking1.9?
Comment 13•18 years ago
|
||
If this bug doesn't get fixed in Gecko 1.9, then would it make sense for us to turn the MicrosummaryResource JS object, which is what currently parses HTML for the microsummary service, into an XPCOM component usable by other code?
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]
Comment 14•18 years ago
|
||
Bug 361449 contains a patch that I believe fixes this bug.
Comment 15•18 years ago
|
||
Assignee: general → sayrer
Status: NEW → ASSIGNED
Comment 16•18 years ago
|
||
>
> > So as I see it, the steps to get this working are:
> >
> > 1) Decide what the problem we're solving is. Specifically, how should
> > noscript, noframes, and such be parsed in these documents? Keep in mind
> > that depending on user settings (like whether script is enabled) we
> > create different DOMs from the same source.
nsXMLDocument contains a useful concept called "LoadedAsData" that makes these decisions for XHTML and other XML dialects, and disables layout, CSSLoader, scripts, etc. I am content to mirror its decisions by hoisting that code into nsDocument. Users that need a live DOM with scripts and CSS can always use an iframe. Perhaps we can revisit the CSS capabilites after the initial landing.
> >
> > 2) Decide what the plan is for charsets (currently we depend on having
> > a docshell to handle charset autodetect and in some cases <meta> tags,
> > because we have to throw away the document and reparse).
Charset detection should be identical to the shipping browser, so we'll have to figure out a way to restart.
> >
> > 3) Go through the HTML content sink and HTML document, and make sure
> > all the places that use the docshell or window can survive without one.
> >
This hasn't been too hard.
> > 4) Do whatever we decided to do for charsets.
> >
So, we have a large block of code in the HTML content sink that tries different ways of getting a charset from the docshell and other sources. There are some situations where we're going to want to ignore the usual hints. For instance, parseFromString is always going to be unicode, so we should ignore meta tags in that case.
Comment 17•18 years ago
|
||
Attachment #258028 -
Attachment is obsolete: true
Comment 18•18 years ago
|
||
updated to tip, unit tests coming up next
Attachment #258050 -
Attachment is obsolete: true
Comment 19•18 years ago
|
||
updated to trunk, again. This patch has nsDOMParser implement nsIWebShellServices... I think that will be sufficient to get a callback from chardet and start a reparse.
I also added unit testing for the parser from XPC Shell. It doesn't check the results yet, but it does successfully parse all of test cases from htmlparser, except for the empty string, which seems to fool nsStringStream into thinking it isn't buffered and triggers an assertion.
Attachment #262013 -
Attachment is obsolete: true
Comment 20•18 years ago
|
||
(In reply to comment #19)
> which seems to fool nsStringStream into thinking it isn't buffered and
> triggers an assertion.
That's bug 368328, for what it's worth.
Comment 21•18 years ago
|
||
unit tests failing for the cases we would expect: script, frames, style
Attachment #264537 -
Attachment is obsolete: true
Comment 22•17 years ago
|
||
Last steps:
* add tests for noscript
* populate the pseudo-docshell methods I've added to nsDOMParser. These will let us restart parsing based on charsets in meta tags.
Attachment #264619 -
Attachment is obsolete: true
Comment 23•17 years ago
|
||
Comment on attachment 270241 [details] [diff] [review]
passing all 181 WHAT-WG HTML parser tests that our browser does
>Index: tools/test-harness/xpcshell-simple/test_all.sh
>- NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -s $headfiles -f $t $tailfiles 2> $t.log 1>&2
>+ NATIVE_TOPSRCDIR="$native_topsrcdir" TOPSRCDIR="$topsrcdir" $xpcshell -v 170 -s $headfiles -f $t $tailfiles 2> $t.log 1>&2
What's the -v 170 for?
Comment 24•17 years ago
|
||
(In reply to comment #23)
>
> What's the -v 170 for?
Turns on generators and such in xpcshell. Or did we land a patch to always use the newest JS version?
Comment 25•17 years ago
|
||
jsshell does (see bug 385159), but I don't know whether xpcshell does.
Comment 26•17 years ago
|
||
Is someone still working on this or is it dead again?
Updated•17 years ago
|
Flags: wanted1.9+
Whiteboard: [wanted-1.9]
Comment 27•16 years ago
|
||
Requesting wanted-1.9.1 with P1 per 1.9.1 Tracking. The addition of this bug would be vital to anyone doing webscraping. Given some of the changes going into Thunderbird 3, I think this would help the extensions of that product in some key areas and would like to see this fixed before the release of said product.
Flags: wanted1.9.1?
Priority: P3 → P1
Comment 28•16 years ago
|
||
wanted1.9.1+. Sayre, are you still working on this?
Flags: wanted1.9.1? → wanted1.9.1+
Comment 29•16 years ago
|
||
This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get the dom parser parsing HTML. I have no idea if this is correct in all cases, only that it works for my purposes.
Comment 30•16 years ago
|
||
(In reply to comment #29)
> Created an attachment (id=325352) [details]
> Unbitrotted patch
>
> This is a patch (derived from the 1.9.0.* CVS branch) that I locally use to get
> the dom parser parsing HTML. I have no idea if this is correct in all cases,
> only that it works for my purposes.
>
The original patch has tests.
Comment 31•16 years ago
|
||
Why not borrowing from DOM:range.createContextualFragment() ?
Comment 32•16 years ago
|
||
createContextualFragment doesn't behave the same way as the normal HTML parser, for what it's worth. It also _always_ operates in the context of a docshell, and _never_ in the context of a document, neither of which is the case for DOMParser.
Assignee | ||
Comment 33•16 years ago
|
||
As mentioned on public-html, it would be natural for DOMParser text/html behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does this patch make things match the current XHR2 draft?
Comment 34•16 years ago
|
||
(In reply to comment #33)
> As mentioned on public-html, it would be natural for DOMParser text/html
> behavior to match XHR2 text/html behavior wrt. noscript, noframes, etc.. Does
> this patch make things match the current XHR2 draft?
I thought XHR2 specified <noscript> parsing behavior. No request for this feature has ever asked for that, so I think XHR2 is broken.
Comment 35•16 years ago
|
||
To underscore what Joshua mentioned, this functionality is also important for Ubiquity, both in terms of enabling non-strict xhtml command feeds (ease of development) as well as page parsing without having to load iframes in hidden windows.
Updated•15 years ago
|
QA Contact: stummala → general
Comment 36•15 years ago
|
||
doesn't really work anymore, but applies and builds. I think I'm going to break nsHTMLDocument::StartDocumentLoad into pieces before I try again.
Attachment #270241 -
Attachment is obsolete: true
Attachment #325352 -
Attachment is obsolete: true
Updated•15 years ago
|
Attachment #402676 -
Attachment is patch: true
Attachment #402676 -
Attachment mime type: application/octet-stream → text/plain
Comment 37•15 years ago
|
||
Just adding [jetpack] to the whiteboard, as this will be very useful to have in Jetpack.
Whiteboard: [jetpack]
Updated•14 years ago
|
Component: DOM: Abstract Schemas → DOM: Mozilla Extensions
Updated•14 years ago
|
Blocks: Instantbird
Comment 38•13 years ago
|
||
Fully functional workaround: https://gist.github.com/1129031
Updated•13 years ago
|
Assignee | ||
Comment 39•13 years ago
|
||
I'm not planning to add support for HTML to methods that aren't exposed to Web content.
Attachment #402676 -
Attachment is obsolete: true
Assignee | ||
Updated•13 years ago
|
Attachment #589852 -
Flags: review?(bugs)
Comment 40•13 years ago
|
||
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()
>+ // XXX why do we set these again?
>+ // Make sure to give this document the right base URI
>+ document->SetBaseURI(mBaseURI);
>+ // And the right principal
>+ document->SetPrincipal(mPrincipal);
There is the comment about "Here we have to cheat a little..."
Or is you XXX about something else?
> HTMLContentSink::ProcessLINKTag(const nsIParserNode& aNode)
> {
>- nsresult result = NS_OK;
>-
>- if (mCurrentContext) {
>- // Create content object
>- nsCOMPtr<nsIContent> element;
>- nsCOMPtr<nsINodeInfo> nodeInfo;
>- nodeInfo = mNodeInfoManager->GetNodeInfo(nsGkAtoms::link, nsnull,
>- kNameSpaceID_XHTML,
>- nsIDOMNode::ELEMENT_NODE);
>-
>- result = NS_NewHTMLElement(getter_AddRefs(element), nodeInfo.forget(),
>- NOT_FROM_PARSER);
>- NS_ENSURE_SUCCESS(result, result);
>-
>- nsCOMPtr<nsIStyleSheetLinkingElement> ssle(do_QueryInterface(element));
>-
>- if (ssle) {
>- // XXX need prefs. check here.
>- if (!mInsideNoXXXTag) {
>- ssle->InitStyleLinkElement(false);
>- ssle->SetEnableUpdates(false);
>- } else {
>- ssle->InitStyleLinkElement(true);
>- }
>- }
>-
>- // Add in the attributes and add the style content object to the
>- // head container.
>- result = AddAttributes(aNode, element);
>- if (NS_FAILED(result)) {
>- return result;
>- }
>-
>- mCurrentContext->AddLeaf(element); // <link>s are leaves
>-
>- if (ssle) {
>- ssle->SetEnableUpdates(true);
>- bool willNotify;
>- bool isAlternate;
>- result = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>- &willNotify,
>- &isAlternate);
>- if (NS_SUCCEEDED(result) && willNotify && !isAlternate && !mFragmentMode) {
>- ++mPendingSheetCount;
>- mScriptLoader->AddExecuteBlocker();
>- }
>-
>- // look for <link rel="next" href="url">
>- nsAutoString relVal;
>- element->GetAttr(kNameSpaceID_None, nsGkAtoms::rel, relVal);
>- if (!relVal.IsEmpty()) {
>- PRUint32 linkTypes = nsStyleLinkElement::ParseLinkTypes(relVal);
>- bool hasPrefetch = linkTypes & PREFETCH;
>- if (hasPrefetch || (linkTypes & NEXT)) {
>- nsAutoString hrefVal;
>- element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>- if (!hrefVal.IsEmpty()) {
>- PrefetchHref(hrefVal, element, hasPrefetch);
>- }
>- }
>- if (linkTypes & DNS_PREFETCH) {
>- nsAutoString hrefVal;
>- element->GetAttr(kNameSpaceID_None, nsGkAtoms::href, hrefVal);
>- if (!hrefVal.IsEmpty()) {
>- PrefetchDNS(hrefVal);
>- }
>- }
>- }
>- }
>- }
>-
>- return result;
>+ MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
>+ return NS_ERROR_NOT_IMPLEMENTED;
> }
What is this stuff about?
> HTMLContentSink::ProcessSTYLEEndTag(nsGenericHTMLElement* content)
> {
>- nsCOMPtr<nsIStyleSheetLinkingElement> ssle = do_QueryInterface(content);
>-
>- NS_ASSERTION(ssle,
>- "html:style doesn't implement nsIStyleSheetLinkingElement");
>-
>- nsresult rv = NS_OK;
>-
>- if (ssle) {
>- // Note: if we are inside a noXXX tag, then we init'ed this style element
>- // with mDontLoadStyle = true, so these two calls will have no effect.
>- ssle->SetEnableUpdates(true);
>- bool willNotify;
>- bool isAlternate;
>- rv = ssle->UpdateStyleSheet(mFragmentMode ? nsnull : this,
>- &willNotify,
>- &isAlternate);
>- if (NS_SUCCEEDED(rv) && willNotify && !isAlternate && !mFragmentMode) {
>- ++mPendingSheetCount;
>- mScriptLoader->AddExecuteBlocker();
>- }
>- }
>-
>- return rv;
>+ MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
>+ return NS_ERROR_NOT_IMPLEMENTED;
And this.
Attachment #589852 -
Flags: review?(bugs) → review-
Assignee | ||
Comment 41•13 years ago
|
||
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()
(In reply to Olli Pettay [:smaug] from comment #40)
> >+ // XXX why do we set these again?
> >+ // Make sure to give this document the right base URI
> >+ document->SetBaseURI(mBaseURI);
> >+ // And the right principal
> >+ document->SetPrincipal(mPrincipal);
> There is the comment about "Here we have to cheat a little..."
> Or is you XXX about something else?
The XXX is about setting base URI and principal again even though they were passed to nsContentSink::CreateDocument. I'm just imitating the XML code path here. Dunno why things are done that way.
> >- return result;
> >+ MOZ_NOT_REACHED("Old HTMLContentSink used for processing links.");
> >+ return NS_ERROR_NOT_IMPLEMENTED;
> > }
> What is this stuff about?
...
> >- return rv;
> >+ MOZ_NOT_REACHED("Old HTMLContentSink used for processing style elements.");
> >+ return NS_ERROR_NOT_IMPLEMENTED;
> And this.
Most of the old HTML content sink is now dead code. The changes to the superclass would have required tweaks to the dead code, so I figured it makes more sense to remove the affected parts of the dead code than to tweak dead code.
Setting back to review?, since your comment contained questions rather than comments obviously requiring patch changes.
Attachment #589852 -
Flags: review- → review?(bugs)
Comment 42•13 years ago
|
||
Comment on attachment 589852 [details] [diff] [review]
Support text/html in DOMParser.parseFromString()
>+
>+ // XXX why do we set these again?
Remove this, but add perhaps a comment here and in XML
document creation that both should do the same thing.
ok, explanations got on IRC.
Attachment #589852 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 43•13 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #42)
> Remove this, but add perhaps a comment here and in XML
> document creation that both should do the same thing.
Done. Thanks.
https://hg.mozilla.org/integration/mozilla-inbound/rev/10db8f79cd65
Comment 44•13 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
Comment 45•13 years ago
|
||
This page says that text/html is not supported, which is no longer true:
https://developer.mozilla.org/en/DOM/DOMParser
Keywords: dev-doc-needed
Comment 46•13 years ago
|
||
https://developer.mozilla.org/en/Firefox_12_for_developers
https://developer.mozilla.org/en/DOM/DOMParser
Keywords: dev-doc-needed → dev-doc-complete
Updated•12 years ago
|
Depends on: CVE-2012-3975
Updated•12 years ago
|
Component: DOM: Mozilla Extensions → DOM
Updated•6 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•