Closed Bug 1240035 Opened 9 years ago Closed 9 years ago

Page from books.portonvictor.org is misparsed and everything gets stuck in an <a> tag

Tracking

()

Status:

RESOLVED DUPLICATE of bug 1170955

People

(Reporter: porton, Assigned: evanxd)

References

(Blocks 2 open bugs,
URL
)

Details

(Whiteboard: [reader-mode-readability-algorithm])

Attachments

(3 files)

Screenshot from 2016-01-15 17-17-55.png 9 years ago Victor Porton 133.66 KB, image/png		Details
serializied.html 9 years ago Evan Tseng [:evanxd] 268.20 KB, text/html		Details
Bug 1240035 - Do not show underline when a tag is without href, 9 years ago Evan Tseng [:evanxd] 58 bytes, text/x-review-board-request	Gijs : review-	Details

Victor Porton

Reporter

Description

•

9 years ago

Attached image Screenshot from 2016-01-15 17-17-55.png — Details

User Agent: Mozilla/5.0 (X11; Linux i686; rv:43.0) Gecko/20100101 Firefox/43.0 Iceweasel/43.0.4 Build ID: 20160107230306 Steps to reproduce: I opened http://books.portonvictor.org/binaries/nt-commentary.html and clicked "Reader View" Actual results: The text is bluish and underlined. Expected results: The text (except of links) should not be colored nor underlined.

Victor Porton

Reporter

Updated

•

9 years ago

URL: http://books.portonvictor.org/binarie...

Gingerbread Man

Comment 1

•

9 years ago

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:46.0) Gecko/20100101 Firefox/46.0 20160115030235

Blocks: fix-readability

Status: UNCONFIRMED → NEW

Component: Untriaged → Reader Mode

Ever confirmed: true

OS: Unspecified → All

Product: Firefox → Toolkit

Hardware: Unspecified → All

:Gijs (he/him)

Updated

•

9 years ago

Priority: -- → P2

Summary: Certain page is displayed wrongly in Reader View → Page from books.portonvictor.org is misparsed and everything gets stuck in an <a> tag

:Gijs (he/him)

Updated

•

9 years ago

Whiteboard: [reader-mode-readability-algorithm]

:Gijs (he/him)

Updated

•

9 years ago

Blocks: 1286221

Evan Tseng [:evanxd]

Assignee

Comment 2

•

9 years ago

Attached file serializied.html — Details

The root cause is about the `xmlserializer`[1] module, not related with reader mode module itself. It parse the webpage[2] wrongly and add `<a>` tags for some text paragraphs. For example, the below HTML code is from the webpage[2]. ``` <h1 class="Heading_20_1"><a id="a_Chapter_2_Preface"><span>Chapter 2 </span></a><a id="__RefHeading___Toc4450_854563822"/><a id="chap_Preface"/>Preface</h1> <p class="P3">In this book I am going to tell you my thoughts on the New Testament, so you will know the same mysteries as me.</p> ``` After parsed by the `xmlserializer`, it will become to ``` <h1 class="Heading_20_1"><a id="a_Chapter_2_Preface"><span>Chapter 2 </span></a><a id="__RefHeading___Toc4450_854563822"></a><a id="chap_Preface">Preface</a></h1> <p class="P3"><a id="chap_Preface">In this book I am going to tell you my thoughts on the New Testament, so you will know the same mysteries as me.</a></p> ``` (check full HTML code in the attachment) Then our Reader Mode module will generate reader mode page with wrong `<a>` tags. So we should fix this issue by fixing the `xmlserializer` module. [1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/ReaderMode.jsm#424-426 [2]: http://books.portonvictor.org/binaries/nt-commentary.html

:Gijs (he/him)

Comment 3

•

9 years ago

The xmlserializer operates on the DOM. So you should compare the serialization and the DOM, not the serialization with the HTML text as sent. AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by xmlserializer. It feels like this will be some obscure consequence of how the content (which tries to be XHTML but is really parsed as HTML) is being parsed by the HTML parser. Chrome produces the same DOM structure based off that HTML.

:Gijs (he/him)

Comment 4

•

9 years ago

That said, the <a> tag doesn't have an href, so I wouldn't expect it to be blue and underlined. Perhaps we're manually styling links in reader mode using an |a| tag selector and we should be using the :link pseudoclass with it to avoid the misstyling?

Comment hidden (mozreview-request)

Evan Tseng [:evanxd]

Assignee

Comment 6

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #3) > The xmlserializer operates on the DOM. So you should compare the > serialization and the DOM, not the serialization with the HTML text as sent. > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by > xmlserializer. > > It feels like this will be some obscure consequence of how the content > (which tries to be XHTML but is really parsed as HTML) is being parsed by > the HTML parser. Chrome produces the same DOM structure based off that HTML. Got it. But I still worry about the DOM structure of the webpage[1] is not correct. Are you sure that the DOM tree result of it is correct by following the DOM spec? (In reply to :Gijs Kruitbosch from comment #4) > That said, the <a> tag doesn't have an href, so I wouldn't expect it to be > blue and underlined. Perhaps we're manually styling links in reader mode > using an |a| tag selector and we should be using the :link pseudoclass with > it to avoid the misstyling? But anyway, that is a good idea. Let's do it. [1]: http://books.portonvictor.org/binaries/nt-commentary.html

:Gijs (he/him)

Comment 7

•

9 years ago

(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6) > (In reply to :Gijs Kruitbosch from comment #3) > > The xmlserializer operates on the DOM. So you should compare the > > serialization and the DOM, not the serialization with the HTML text as sent. > > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by > > xmlserializer. > > > > It feels like this will be some obscure consequence of how the content > > (which tries to be XHTML but is really parsed as HTML) is being parsed by > > the HTML parser. Chrome produces the same DOM structure based off that HTML. > Got it. But I still worry about the DOM structure of the webpage[1] is not > correct. Are you sure that the DOM tree result of it is correct by following > the DOM spec? I'm very much not an expert, and this is outside the scope of this bug, but it looks to me like, because "/>" is not a valid way of closing an HTML element. I believe this bit of the spec: https://dev.w3.org/html5/spec-preview/tokenization.html , see step << A start tag whose tag name is "a" >> , has an example that roughly matches what's going on here. If you're very concerned, file a separate bug against the html5 parser and someone (probably Hixie) will have to look at it.

:Gijs (he/him)

Comment 8

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #7) > (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6) > > (In reply to :Gijs Kruitbosch from comment #3) > > > The xmlserializer operates on the DOM. So you should compare the > > > serialization and the DOM, not the serialization with the HTML text as sent. > > > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by > > > xmlserializer. > > > > > > It feels like this will be some obscure consequence of how the content > > > (which tries to be XHTML but is really parsed as HTML) is being parsed by > > > the HTML parser. Chrome produces the same DOM structure based off that HTML. > > Got it. But I still worry about the DOM structure of the webpage[1] is not > > correct. Are you sure that the DOM tree result of it is correct by following > > the DOM spec? > > I'm very much not an expert, and this is outside the scope of this bug, but > it looks to me like, because "/>" is not a valid way of closing an HTML > element. ... the link didn't get closed before the start of the <p> and gets reconstructed afterwards.

:Gijs (he/him)

Comment 9

•

9 years ago

mozreview-review

Comment on attachment 8807089 [details] Bug 1240035 - Do not show underline when a tag is without href, https://reviewboard.mozilla.org/r/90366/#review90074 I just checked, and AFAICT this is already fixed on nightly - bug 1170955 fixed the selector a few lines up from here to use a:link instead of just a, so while it's still broken on 50 it's working on Nightly, and this patch is unnecessary.

Attachment #8807089 - Flags: review?(gijskruitbosch+bugs) → review-

Evan Tseng [:evanxd]

Assignee

Comment 10

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #9) > Comment on attachment 8807089 [details] > Bug 1240035 - Do not show underline when a tag is without href, > > https://reviewboard.mozilla.org/r/90366/#review90074 > > I just checked, and AFAICT this is already fixed on nightly - bug 1170955 > fixed the selector a few lines up from here to use a:link instead of just a, > so while it's still broken on 50 it's working on Nightly, and this patch is > unnecessary. Yes, you're right. So this bug is just a duplicate of Bug 1170955.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → DUPLICATE

Evan Tseng [:evanxd]

Assignee

Updated

•

9 years ago

Assignee: nobody → evan

Evan Tseng [:evanxd]

Assignee

Comment 11

•

9 years ago

(In reply to :Gijs Kruitbosch from comment #7) > (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #6) > > (In reply to :Gijs Kruitbosch from comment #3) > > > The xmlserializer operates on the DOM. So you should compare the > > > serialization and the DOM, not the serialization with the HTML text as sent. > > > AFAICT the <a> tag is there in the DOM, so it is dutifully serialized by > > > xmlserializer. > > > > > > It feels like this will be some obscure consequence of how the content > > > (which tries to be XHTML but is really parsed as HTML) is being parsed by > > > the HTML parser. Chrome produces the same DOM structure based off that HTML. > > Got it. But I still worry about the DOM structure of the webpage[1] is not > > correct. Are you sure that the DOM tree result of it is correct by following > > the DOM spec? > > I'm very much not an expert, and this is outside the scope of this bug, but > it looks to me like, because "/>" is not a valid way of closing an HTML > element. I believe this bit of the spec: > https://dev.w3.org/html5/spec-preview/tokenization.html , see step << A > start tag whose tag name is "a" >> , has an example that roughly matches > what's going on here. Looks like that is what's going on here.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Page from books.portonvictor.org is misparsed and everything gets stuck in an <a> tag

Categories

(Toolkit :: Reader Mode, defect, P2)

Tracking

()

People

(Reporter: porton, Assigned: evanxd)

References

(Blocks 2 open bugs,
URL
)

Details

(Whiteboard: [reader-mode-readability-algorithm])

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Updated

Comment 1

Updated

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Attachment

General

Description

File Name

Content Type