Closed Bug 1240035 Opened 7 years ago Closed 6 years ago

Page from books.portonvictor.org is misparsed and everything gets stuck in an <a> tag

Categories

(Toolkit :: Reader Mode, defect, P2)

43 Branch
defect

Tracking

()

RESOLVED DUPLICATE of bug 1170955

People

(Reporter: porton, Assigned: evanxd)

References

(Blocks 2 open bugs, )

Details

(Whiteboard: [reader-mode-readability-algorithm])

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux i686; rv:43.0) Gecko/20100101 Firefox/43.0 Iceweasel/43.0.4
Build ID: 20160107230306

Steps to reproduce:

I opened http://books.portonvictor.org/binaries/nt-commentary.html and clicked "Reader View"


Actual results:

The text is bluish and underlined.


Expected results:

The text (except of links) should not be colored nor underlined.
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:46.0) Gecko/20100101 Firefox/46.0
20160115030235
Status: UNCONFIRMED → NEW
Component: Untriaged → Reader Mode
Ever confirmed: true
OS: Unspecified → All
Product: Firefox → Toolkit
Hardware: Unspecified → All
Priority: -- → P2
Summary: Certain page is displayed wrongly in Reader View → Page from books.portonvictor.org is misparsed and everything gets stuck in an <a> tag
Whiteboard: [reader-mode-readability-algorithm]
Attached file serializied.html
The root cause is about the `xmlserializer`[1] module, not related with reader mode module itself. It parse the webpage[2] wrongly and add `<a>` tags for some text paragraphs. For example, the below HTML code is from the webpage[2].

```
<h1 class="Heading_20_1"><a id="a_Chapter_2_Preface"><span>Chapter 2 </span></a><a id="__RefHeading___Toc4450_854563822"/><a id="chap_Preface"/>Preface</h1>
<p class="P3">In this book I am going to tell you my thoughts on the New Testament, so you will know the same mysteries as me.</p>
```

After parsed by the `xmlserializer`, it will become to

```
<h1 class="Heading_20_1"><a id="a_Chapter_2_Preface"><span>Chapter 2 </span></a><a id="__RefHeading___Toc4450_854563822"></a><a id="chap_Preface">Preface</a></h1>
<p class="P3"><a id="chap_Preface">In this book I am going to tell you my thoughts on the New Testament, so you will know the same mysteries as me.</a></p>
```
(check full HTML code in the attachment)

Then our Reader Mode module will generate reader mode page with wrong `<a>` tags. So we should fix this issue by fixing the `xmlserializer` module.

[1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/ReaderMode.jsm#424-426
[2]: http://books.portonvictor.org/binaries/nt-commentary.html
The xmlserializer operates on the DOM. So you should compare the serialization and the DOM, not the serialization with the HTML text as sent. AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by xmlserializer.

It feels like this will be some obscure consequence of how the content (which tries to be XHTML but is really parsed as HTML) is being parsed by the HTML parser. Chrome produces the same DOM structure based off that HTML.
That said, the <a> tag doesn't have an href, so I wouldn't expect it to be blue and underlined. Perhaps we're manually styling links in reader mode using an |a| tag selector and we should be using the :link pseudoclass with it to avoid the misstyling?
(In reply to :Gijs Kruitbosch from comment #3)
> The xmlserializer operates on the DOM. So you should compare the
> serialization and the DOM, not the serialization with the HTML text as sent.
> AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by
> xmlserializer.
> 
> It feels like this will be some obscure consequence of how the content
> (which tries to be XHTML but is really parsed as HTML) is being parsed by
> the HTML parser. Chrome produces the same DOM structure based off that HTML.
Got it. But I still worry about the DOM structure of the webpage[1] is not correct. Are you sure that the DOM tree result of it is correct by following the DOM spec?

(In reply to :Gijs Kruitbosch from comment #4)
> That said, the <a> tag doesn't have an href, so I wouldn't expect it to be
> blue and underlined. Perhaps we're manually styling links in reader mode
> using an |a| tag selector and we should be using the :link pseudoclass with
> it to avoid the misstyling?
But anyway, that is a good idea. Let's do it.

[1]: http://books.portonvictor.org/binaries/nt-commentary.html
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6)
> (In reply to :Gijs Kruitbosch from comment #3)
> > The xmlserializer operates on the DOM. So you should compare the
> > serialization and the DOM, not the serialization with the HTML text as sent.
> > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by
> > xmlserializer.
> > 
> > It feels like this will be some obscure consequence of how the content
> > (which tries to be XHTML but is really parsed as HTML) is being parsed by
> > the HTML parser. Chrome produces the same DOM structure based off that HTML.
> Got it. But I still worry about the DOM structure of the webpage[1] is not
> correct. Are you sure that the DOM tree result of it is correct by following
> the DOM spec?

I'm very much not an expert, and this is outside the scope of this bug, but it looks to me like, because "/>" is not a valid way of closing an HTML element. I believe this bit of the spec: https://dev.w3.org/html5/spec-preview/tokenization.html , see step << A start tag whose tag name is "a" >> , has an example that roughly matches what's going on here.

If you're very concerned, file a separate bug against the html5 parser and someone (probably Hixie) will have to look at it.
(In reply to :Gijs Kruitbosch from comment #7)
> (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6)
> > (In reply to :Gijs Kruitbosch from comment #3)
> > > The xmlserializer operates on the DOM. So you should compare the
> > > serialization and the DOM, not the serialization with the HTML text as sent.
> > > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by
> > > xmlserializer.
> > > 
> > > It feels like this will be some obscure consequence of how the content
> > > (which tries to be XHTML but is really parsed as HTML) is being parsed by
> > > the HTML parser. Chrome produces the same DOM structure based off that HTML.
> > Got it. But I still worry about the DOM structure of the webpage[1] is not
> > correct. Are you sure that the DOM tree result of it is correct by following
> > the DOM spec?
> 
> I'm very much not an expert, and this is outside the scope of this bug, but
> it looks to me like, because "/>" is not a valid way of closing an HTML
> element.

... the link didn't get closed before the start of the <p> and gets reconstructed afterwards.
Comment on attachment 8807089 [details]
Bug 1240035 - Do not show underline when a tag is without href,

https://reviewboard.mozilla.org/r/90366/#review90074

I just checked, and AFAICT this is already fixed on nightly - bug 1170955 fixed the selector a few lines up from here to use a:link instead of just a, so while it's still broken on 50 it's working on Nightly, and this patch is unnecessary.
Attachment #8807089 - Flags: review?(gijskruitbosch+bugs) → review-
(In reply to :Gijs Kruitbosch from comment #9)
> Comment on attachment 8807089 [details]
> Bug 1240035 - Do not show underline when a tag is without href,
> 
> https://reviewboard.mozilla.org/r/90366/#review90074
> 
> I just checked, and AFAICT this is already fixed on nightly - bug 1170955
> fixed the selector a few lines up from here to use a:link instead of just a,
> so while it's still broken on 50 it's working on Nightly, and this patch is
> unnecessary.
Yes, you're right. So this bug is just a duplicate of Bug 1170955.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1170955
Assignee: nobody → evan
(In reply to :Gijs Kruitbosch from comment #7)
> (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6)
> > (In reply to :Gijs Kruitbosch from comment #3)
> > > The xmlserializer operates on the DOM. So you should compare the
> > > serialization and the DOM, not the serialization with the HTML text as sent.
> > > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by
> > > xmlserializer.
> > > 
> > > It feels like this will be some obscure consequence of how the content
> > > (which tries to be XHTML but is really parsed as HTML) is being parsed by
> > > the HTML parser. Chrome produces the same DOM structure based off that HTML.
> > Got it. But I still worry about the DOM structure of the webpage[1] is not
> > correct. Are you sure that the DOM tree result of it is correct by following
> > the DOM spec?
> 
> I'm very much not an expert, and this is outside the scope of this bug, but
> it looks to me like, because "/>" is not a valid way of closing an HTML
> element. I believe this bit of the spec:
> https://dev.w3.org/html5/spec-preview/tokenization.html , see step << A
> start tag whose tag name is "a" >> , has an example that roughly matches
> what's going on here.
Looks like that is what's going on here.
You need to log in before you can comment on or make changes to this bug.