Originally reported: https://github.com/webcompat/web-bugs/issues/9420 Steps to Reproduce: 1. Navigate to: https://www.economist.com/news/leaders/21727067-their-record-power-often-worrying-they-can-be-pragmatic-and-cannot-be-ignored-blanket 2. Switch to Reader Mode Error: Failed to load article from page
Priority: -- → P2
status-firefox57: --- → wontfix
status-firefox58: --- → affected
This seems to be because the XML serializes serializes the document by giving all the HTML node an "a0" namespace. Not sure why... emilio, do you have any idea? You can reproduce the serializer issue easily by following these steps: 1. open firefox 2. open https://www.economist.com/news/leaders/21727067-their-record-power-often-worrying-they-can-be-pragmatic-and-cannot-be-ignored-blanket 3. open devtools console 4. run: (new XMLSerializer()).serializeToString(document.documentElement).substring(0, 100) Note how you get stuff like: <a0:html> ... What's up with that? :-)
Huh, wonderful, I can try to take a look.
Created attachment 8984415 [details] Reduced testcase So those are generated prefixes from nsXMLContentSerializer::GenerateNewPrefix: https://searchfox.org/mozilla-central/rev/c621276fbdd9591f52009042d959b9e19b66d49f/dom/base/nsXMLContentSerializer.cpp#591 The reason they're generated is because the default namespace URI (xmlns="https://www.w3.org/1999/xhtml") is _not_ the element's namespace URI (it's an HTML element, so the URL is http://www.w3.org/1999/xhtml), not the http:// vs https://. So afaik the behavior of the serialized is correct... Boris or Peterv could confirm, but this seems to be working as expected. Why is this breaking reading mode / why is this a problem? The page roundtrips if you parse it via DOMParser afaict.
Flags: needinfo?(emilio) → needinfo?(gijskruitbosch+bugs)
That being said, in the above test-case Blink (and WebKit I assume) doesn't generate any prefix... Boris, do you know what's up with that? Should we skip the namespace generation if the namespace is HTML? I don't know much about XML serialization, but Blink's behavior doesn't look quite right to me...
Should learn how to write... (In reply to Emilio Cobos Álvarez [:em > The reason they're generated is because the default namespace URI > (xmlns="https://www.w3.org/1999/xhtml") is _not_ the element's namespace URI > (it's an HTML element, so the URL is http://www.w3.org/1999/xhtml), not the > http:// vs https://. s/not the/note the/ > So afaik the behavior of the serialized is correct... Boris [..] s/serialized/serializer/
(In reply to Emilio Cobos Álvarez [:emilio] from comment #3) > Why is this breaking reading mode / why is this a problem? The page > roundtrips if you parse it via DOMParser afaict. We parse the DOM (and do the readability work) in a worker. Which doesn't have access to DOMParser because no DOM in workers. So we have our own limited DOM implementation ( https://dxr.mozilla.org/mozilla-central/source/toolkit/components/reader/JSDOMParser.js ) that only parses from valid XHTML, which is why we're serializing at all (otherwise we could just use document.cloneNode() and pass a 'real' DOM object instead of serializing/deserializing).
To be explicit, that limited parser definitely has 0 clue about xml namespaces. And yeah, none of this ideal, but how to background the work in some way that doesn't involve a worker / can use the 'real' DOM implementation hasn't really ever been obvious to me (or the other people who've worked on reader mode in the past), esp. not if it needs to work on fennec.
Attachment #8984415 - Attachment mime type: text/plain → text/html
Blink's XML serializer had all sorts of issues with namespaces, last I checked. That said, in spec terms, it looks like Blink is following the spec, such as it is. https://w3c.github.io/DOM-Parsing/#dfn-recording-the-namespace-information ignores attributes named "xmlns" that are not in the "http://www.w3.org/2000/xmlns/" namespace and doesn't do anything special for them. This seems like a bug when the "require well-formed" flag is set. But https://w3c.github.io/DOM-Parsing/#the-xmlserializer-interface doesn't set that flag anyway. So the upshot is that per current spec draft you can totally have a DOM that serializes to a string that parses to a different DOM, or can't parse at all with an XML parser. Anne, do you know what the state of this spec is, and whether this is expected behavior? Gijs, the upshot is that per current spec you cannot ever rely on XMLSerializer producing "valid XHTML". You just need a parser that will deal with things that are not. If we keep our current behavior, you can at least assume you will have well-formed XML that preserves namespaces. But per the spec draft, the output of XMLSerializer may not even be well-formed XML... That said, if your parser is not namespace-aware, presumably it's just creating nodes with names like "a0:html", right? Is the issue then that whatever DOM manipulation happens is expecting specific tag names?
No, active maintenance stopped mid-2016 and I'm pretty sure it was not developed with tests and browser bugs as progress was made (and likely biased towards Edge).
OK. That _is_ the right canonical draft, right? http://domparsing.spec.whatwg.org/ seems to redirect there...
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #8) > So the upshot is that per current spec draft you can totally have a DOM that > serializes to a string that parses to a different DOM, or can't parse at all > with an XML parser. I... ugh. Does that actually happen with "our" xml serializer? I'm pretty sure we used to rely on some XPCOM goop to instantiate the serializer. Apparently that changed in bug 1452235. > <snip> > > Gijs, the upshot is that per current spec you cannot ever rely on > XMLSerializer producing "valid XHTML". I mean, 'valid enough' would be nice, but the current state is clearly a bit of a tip, compared to the DOM that we render and treat like a perfectly normal (X)HTML document in gecko in that we render it etc... > You just need a parser that will deal with things that are not. I think given the current constraints that type of parser is in short supply... in terms of codesize, memory usage etc., including e.g. all of jsdom ( https://github.com/jsdom/jsdom ) is not a great option, esp. on mobile. DOMParser is unavailable. Are there alternatives I'm missing here? > If we keep our current behavior, you can at > least assume you will have well-formed XML that preserves namespaces. But > per the spec draft, the output of XMLSerializer may not even be well-formed > XML... Do we have alternative chrome-only things that give us better guarantees? We don't need to rely on the spec here (reader mode isn't web-exposed, and readability-as-a-library takes a DOM, it's only desktop/fennec that do the crazy xml serialization+parsing-in-a-worker thing), just on the gecko behavior. > That said, if your parser is not namespace-aware, presumably it's just > creating nodes with names like "a0:html", right? Is the issue then that > whatever DOM manipulation happens is expecting specific tag names? I guess... I haven't checked, I just know that the result fails this check: https://github.com/mozilla/readability/blob/master/Readability.js#L29-L35 I wouldn't be surprised if JSDOMParser just chokes and outputs null or something, but JS debugging chrome workers is still impossible so I can't easily tell atm. But even if it didn't and somehow converted the namespaced tagnames to include the namespace in the tagname (ie "<a0:div>" instead of "<div>") then lots of logic in readability would break. I wouldn't even be opposed to adding a hack/workaround here, I'm just not sure what it'd be. I assume just replacing the attribute before serializing won't retroactively renamespace all the contained nodes and as a result wouldn't fix this aspect of the serialization. We could be *really* evil and just regex-replace opening and closing namespace bits on the result, I guess?
I put up a parser hackaround on https://github.com/mozilla/readability/pull/458 . :-)
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
See Also: → https://github.com/mozilla/readability/pull/458
Fixed in github and uplifted in bug 1467742, so this should work now...
Status: ASSIGNED → RESOLVED
Last Resolved: a month ago
status-firefox60: --- → wontfix
status-firefox61: --- → wontfix
status-firefox62: --- → fixed
Depends on: 1467742
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.