Closed Bug 1394941 Opened 7 years ago Closed 6 years ago

"Failed to load article from page" on economist.com

Categories

(Toolkit :: Reader Mode, defect, P2)

56 Branch
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- wontfix
firefox58 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- fixed

People

(Reporter: adamopenweb, Assigned: Gijs)

References

()

Details

(Whiteboard: [reader-mode-readability-algorithm])

Attachments

(1 file)

Originally reported: https://github.com/webcompat/web-bugs/issues/9420

Steps to Reproduce:

1. Navigate to: https://www.economist.com/news/leaders/21727067-their-record-power-often-worrying-they-can-be-pragmatic-and-cannot-be-ignored-blanket
2. Switch to Reader Mode

Error: 
Failed to load article from page
Priority: -- → P2
Whiteboard: [reader-mode-readability-algorithm]
This seems to be because the XML serializes serializes the document by giving all the HTML node an "a0" namespace. Not sure why... emilio, do you have any idea? You can reproduce the serializer issue easily by following these steps:


1. open firefox
2. open https://www.economist.com/news/leaders/21727067-their-record-power-often-worrying-they-can-be-pragmatic-and-cannot-be-ignored-blanket
3. open devtools console
4. run: (new XMLSerializer()).serializeToString(document.documentElement).substring(0, 100)


Note how you get stuff like: <a0:html> ...

What's up with that? :-)
Flags: needinfo?(emilio)
Huh, wonderful, I can try to take a look.
Attached file Reduced testcase
So those are generated prefixes from nsXMLContentSerializer::GenerateNewPrefix:

  https://searchfox.org/mozilla-central/rev/c621276fbdd9591f52009042d959b9e19b66d49f/dom/base/nsXMLContentSerializer.cpp#591

The reason they're generated is because the default namespace URI (xmlns="https://www.w3.org/1999/xhtml") is _not_ the element's namespace URI (it's an HTML element, so the URL is http://www.w3.org/1999/xhtml), not the http:// vs https://.

So afaik the behavior of the serialized is correct... Boris or Peterv could confirm, but this seems to be working as expected.

Why is this breaking reading mode / why is this a problem? The page roundtrips if you parse it via DOMParser afaict.
Flags: needinfo?(emilio) → needinfo?(gijskruitbosch+bugs)
That being said, in the above test-case Blink (and WebKit I assume) doesn't generate any prefix...

Boris, do you know what's up with that? Should we skip the namespace generation if the namespace is HTML? I don't know much about XML serialization, but Blink's behavior doesn't look quite right to me...
Flags: needinfo?(bzbarsky)
Should learn how to write...

(In reply to Emilio Cobos Álvarez [:em
> The reason they're generated is because the default namespace URI
> (xmlns="https://www.w3.org/1999/xhtml") is _not_ the element's namespace URI
> (it's an HTML element, so the URL is http://www.w3.org/1999/xhtml), not the
> http:// vs https://.

s/not the/note the/

> So afaik the behavior of the serialized is correct... Boris [..]

s/serialized/serializer/
(In reply to Emilio Cobos Álvarez [:emilio] from comment #3)
> Why is this breaking reading mode / why is this a problem? The page
> roundtrips if you parse it via DOMParser afaict.

We parse the DOM (and do the readability work) in a worker. Which doesn't have access to DOMParser because no DOM in workers. So we have our own limited DOM implementation ( https://dxr.mozilla.org/mozilla-central/source/toolkit/components/reader/JSDOMParser.js ) that only parses from valid XHTML, which is why we're serializing at all (otherwise we could just use document.cloneNode() and pass a 'real' DOM object instead of serializing/deserializing).
Flags: needinfo?(gijskruitbosch+bugs)
To be explicit, that limited parser definitely has 0 clue about xml namespaces. And yeah, none of this ideal, but how to background the work in some way that doesn't involve a worker / can use the 'real' DOM implementation hasn't really ever been obvious to me (or the other people who've worked on reader mode in the past), esp. not if it needs to work on fennec.
Attachment #8984415 - Attachment mime type: text/plain → text/html
Flags: needinfo?(bzbarsky)
Blink's XML serializer had all sorts of issues with namespaces, last I checked.

That said, in spec terms, it looks like Blink is following the spec, such as it is.  https://w3c.github.io/DOM-Parsing/#dfn-recording-the-namespace-information ignores attributes named "xmlns" that are not in the "http://www.w3.org/2000/xmlns/" namespace and doesn't do anything special for them.  This seems like a bug when the "require well-formed" flag is set.  But https://w3c.github.io/DOM-Parsing/#the-xmlserializer-interface doesn't set that flag anyway.

So the upshot is that per current spec draft you can totally have a DOM that serializes to a string that parses to a different DOM, or can't parse at all with an XML parser.

Anne, do you know what the state of this spec is, and whether this is expected behavior?

Gijs, the upshot is that per current spec you cannot ever rely on XMLSerializer producing "valid XHTML".  You just need a parser that will deal with things that are not.  If we keep our current behavior, you can at least assume you will have well-formed XML that preserves namespaces.  But per the spec draft, the output of XMLSerializer may not even be well-formed XML...

That said, if your parser is not namespace-aware, presumably it's just creating nodes with names like "a0:html", right?  Is the issue then that whatever DOM manipulation happens is expecting specific tag names?
Flags: needinfo?(annevk)
No, active maintenance stopped mid-2016 and I'm pretty sure it was not developed with tests and browser bugs as progress was made (and likely biased towards Edge).
Flags: needinfo?(annevk)
OK.  That _is_ the right canonical draft, right?  http://domparsing.spec.whatwg.org/ seems to redirect there...
Yes.
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #8)
> So the upshot is that per current spec draft you can totally have a DOM that
> serializes to a string that parses to a different DOM, or can't parse at all
> with an XML parser.

I... ugh. Does that actually happen with "our" xml serializer?

I'm pretty sure we used to rely on some XPCOM goop to instantiate the serializer. Apparently that changed in bug 1452235.

> <snip>
> 
> Gijs, the upshot is that per current spec you cannot ever rely on
> XMLSerializer producing "valid XHTML". 

I mean, 'valid enough' would be nice, but the current state is clearly a bit of a tip, compared to the DOM that we render and treat like a perfectly normal (X)HTML document in gecko in that we render it etc...

> You just need a parser that will deal with things that are not.

I think given the current constraints that type of parser is in short supply... in terms of codesize, memory usage etc., including e.g. all of jsdom ( https://github.com/jsdom/jsdom ) is not a great option, esp. on mobile. DOMParser is unavailable. Are there alternatives I'm missing here?

> If we keep our current behavior, you can at
> least assume you will have well-formed XML that preserves namespaces.  But
> per the spec draft, the output of XMLSerializer may not even be well-formed
> XML...

Do we have alternative chrome-only things that give us better guarantees? We don't need to rely on the spec here (reader mode isn't web-exposed, and readability-as-a-library takes a DOM, it's only desktop/fennec that do the crazy xml serialization+parsing-in-a-worker thing), just on the gecko behavior.

> That said, if your parser is not namespace-aware, presumably it's just
> creating nodes with names like "a0:html", right?  Is the issue then that
> whatever DOM manipulation happens is expecting specific tag names?

I guess... I haven't checked, I just know that the result fails this check:

https://github.com/mozilla/readability/blob/master/Readability.js#L29-L35

I wouldn't be surprised if JSDOMParser just chokes and outputs null or something, but JS debugging chrome workers is still impossible so I can't easily tell atm. But even if it didn't and somehow converted the namespaced tagnames to include the namespace in the tagname (ie "<a0:div>" instead of "<div>") then lots of logic in readability would break.

I wouldn't even be opposed to adding a hack/workaround here, I'm just not sure what it'd be. I assume just replacing the attribute before serializing won't retroactively renamespace all the contained nodes and as a result wouldn't fix this aspect of the serialization. We could be *really* evil and just regex-replace opening and closing namespace bits on the result, I guess?
Flags: needinfo?(bzbarsky)
> I... ugh. Does that actually happen with "our" xml serializer?

Sort of.  For example:

  new XMLSerializer().serializeToString(document.createElement("abc:def"))

will produce:

  <abc:def xmlns="http://www.w3.org/1999/xhtml"></abc:def>

which won't parse with a namespace-aware XML parser.  It'll be OK with a namespace-unaware one, of course.

I _think_ all the weirdness is around namespaces so shouldn't trip up a namespace-unaware parser per se.

> I'm pretty sure we used to rely on some XPCOM goop to instantiate the serializer.

Doesn't matter; it's the same behavior.

> compared to the DOM that we render and treat like a perfectly normal (X)HTML document in gecko in that we render it etc

Well, there we parse that with an HTML parser, not an XML parser.

> in terms of codesize, memory usage etc., including e.g. all of jsdom ( https://github.com/jsdom/jsdom ) is not a great option

So I'd like to separate two things here.  What you actually have in https://dxr.mozilla.org/mozilla-central/source/toolkit/components/reader/JSDOMParser.js is a cut-down DOM implementation _and_ an XML parser.  You could conceivably use that same cut-down DOM implementation with an HTML parser.  It's still going to be larger code, of course, since an HTML parser is just a lot more stuff than an XML parser...

Poking around a bit, https://github.com/blowsie/Pure-JavaScript-HTML5-Parser (and forks like <https://github.com/Munawwar/neutron-html5parser>) is a thing, at least.  It doesn't seem to be very actively maintained, but I dunno that much has changed in the HTML parsing algorithm recently.

If you did something like this, you'd probably want to use .outerHTML to get the string instead of XMLSerializer.

> Do we have alternative chrome-only things that give us better guarantees?

Not off-the-shelf.

We _could_ add a chromeonly thing on XMLSerializer that just doesn't even try to create namespace-well-formed documents, since this consumer clearly doesn't care about namespace-well-formedness anyway and the attempt to preserve it is tripping it up.  

> I wouldn't be surprised if JSDOMParser just chokes

At first glance, makeElementNode just grabs everything up to whitespace, '>', or '/' as a tag name.  So that part is not a problem....  I don't quite see why it would fail.

That said, we could probably just hack JSDOMParser.js some more to just strip off things that look like prefixes from tagnames.

> I assume just replacing the attribute before serializing won't retroactively renamespace
> all the contained nodes

I'm not sure what you mean.

If you remove xmlns attributes before you pass your thing through XMLSerializer, that would certainly avoid it trying to do anything interesting with prefixes; it only messes with prefixes when it needs to put nodes in namespaces that don't match their default namespace.

But that would mean mutating the page DOM; I'm not sure whether that's ok in this case.
Flags: needinfo?(bzbarsky)
I put up a parser hackaround on https://github.com/mozilla/readability/pull/458 . :-)
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Fixed in github and uplifted in bug 1467742, so this should work now...
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Depends on: 1467742
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: