Closed Bug 1153920 Opened 9 years ago Closed 3 years ago

Bring htmlparser ampersand-error-reporting into conformance with HTML spec

Tracking

()

Status:

RESOLVED FIXED

Milestone:

93 Branch

Tracking Flags:

Tracking

Status

firefox93

---

fixed

People

(Reporter: sideshowbarker, Assigned: hsivonen)

References

Details

Attachments

(1 file, 7 obsolete files)

patch-validator-ampersands 9 years ago Michael[tm] Smith [:sideshowbarker] 7.23 KB, patch		Details \| Diff \| Splinter Review
patch for Java sources 9 years ago Michael[tm] Smith [:sideshowbarker] 6.95 KB, patch		Details \| Diff \| Splinter Review
patch for Gecko sources 9 years ago Michael[tm] Smith [:sideshowbarker] 12.63 KB, patch		Details \| Diff \| Splinter Review
Patch for Java sources. 9 years ago Michael[tm] Smith [:sideshowbarker] 6.86 KB, patch		Details \| Diff \| Splinter Review
Patch for Gecko sources. 9 years ago Michael[tm] Smith [:sideshowbarker] 12.44 KB, patch		Details \| Diff \| Splinter Review
Bug 1153920 - Conform ampersand error reporting to HTML spec. 4 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1153920 addendum - Limit ambiguous-ampersand wasteful buffering. 4 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1153920 test - Avoid marking non-error ampersands as errors in View Source. 3 years ago Henri Sivonen (:hsivonen) 48 bytes, text/x-phabricator-request		Details \| Review

Michael[tm] Smith [:sideshowbarker]

Reporter

Description

•

9 years ago

see https://bugzilla.validator.nu/show_bug.cgi?id=841

This is a long-standing known issue for which we continue to get validator bug reports. This is the issue that, e.g., ampersands in URLs in href and src values etc. such as http://example.com?foo=bar&hoge=moge are not errors according to the current HTML spec. So reporting errors for them is non-conforming.

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 1

•

9 years ago

Attached patch patch-validator-ampersands (obsolete) — Details — Splinter Review

Attachment #8591755 - Flags: review?(hsivonen)

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 2

•

9 years ago

Attached patch patch for Java sources (obsolete) — Details — Splinter Review

Attachment #8591755 - Attachment is obsolete: true

Attachment #8591755 - Flags: review?(hsivonen)

Attachment #8591774 - Flags: review?(hsivonen)

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 3

•

9 years ago

Attached patch patch for Gecko sources (obsolete) — Details — Splinter Review

Attachment #8591781 - Flags: review?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

9 years ago

Comment on attachment 8591774 [details] [diff] [review]
patch for Java sources

Review of attachment 8591774 [details] [diff] [review]:
-----------------------------------------------------------------

::: src/nu/validator/htmlparser/impl/Tokenizer.java
@@ +3244,5 @@
>                      // XXX reorder point
> +                case AMBIGUOUS_AMPERSAND:
> +                    ampersandloop: for (;;) {
> +                        if (reconsume) {
> +                            reconsume = false;

It seems that this state is special in the sense that it's always entered into with reconsume set to true. Perhaps that should be asserted here and the 'else' branch be removed.

@@ +3259,5 @@
> +                                || (c >= 'a' && c <= 'z')) {
> +                            appendLongStrBuf(c);
> +                            continue;
> +                        }
> +                        pos--;

This is a pattern that doesn't occur elsewhere in the tokenizer. It seems that what's wanted here is
reconsume = true;

reconsume is set to false near the start of this state. Right now, it seems to me that this state should never consume a character and should always be entered and left with reconsume set to true. This makes this state rather unusual. I will need to think more to understand why it is OK to call appendLongStrBuf(c) in this kind of state that doesn't consume anything and exists mainly for error reporting.

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 5

•

9 years ago

Attached patch Patch for Java sources. (obsolete) — Details — Splinter Review

Attachment #8591774 - Attachment is obsolete: true

Attachment #8591774 - Flags: review?(hsivonen)

Attachment #8600565 - Flags: review?(hsivonen)

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 6

•

9 years ago

Attached patch Patch for Gecko sources. (obsolete) — Details — Splinter Review

Attachment #8591781 - Attachment is obsolete: true

Attachment #8591781 - Flags: review?(hsivonen)

Attachment #8600590 - Flags: review?(hsivonen)

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 7

•

9 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #4)
> Comment on attachment 8591774 [details] [diff] [review]
> patch for Java sources
> 
> Review of attachment 8591774 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: src/nu/validator/htmlparser/impl/Tokenizer.java
> @@ +3244,5 @@
> >                      // XXX reorder point
> > +                case AMBIGUOUS_AMPERSAND:
> > +                    ampersandloop: for (;;) {
> > +                        if (reconsume) {
> > +                            reconsume = false;
> 
> It seems that this state is special in the sense that it's always entered
> into with reconsume set to true. Perhaps that should be asserted here and
> the 'else' branch be removed.

OK, I've made it so in the updated patches most-recently attached.
 
> @@ +3259,5 @@
> > +                                || (c >= 'a' && c <= 'z')) {
> > +                            appendLongStrBuf(c);
> > +                            continue;
> > +                        }
> > +                        pos--;
> 
> This is a pattern that doesn't occur elsewhere in the tokenizer. It seems
> that what's wanted here is
> reconsume = true;
> 
> reconsume is set to false near the start of this state. Right now, it seems
> to me that this state should never consume a character and should always be
> entered and left with reconsume set to true. This makes this state rather
> unusual. I will need to think more to understand why it is OK to call
> appendLongStrBuf(c) in this kind of state that doesn't consume anything and
> exists mainly for error reporting.

For that, I've not made any change yet because I couldn't tell if you were saying that we should go ahead and make a change here or if you wanted to hold off any make a change for this until you've had a chance to consider it further.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

9 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

4 years ago

Comment on attachment 8600565 [details] [diff] [review]
Patch for Java sources.

Review moved to GitHub after extreme delay. Sorry.

Attachment #8600565 - Flags: review?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

4 years ago

Comment on attachment 8600590 [details] [diff] [review]
Patch for Gecko sources.

Review moved to GitHub after extreme delay. Sorry.

Attachment #8600590 - Flags: review?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Updated

•

4 years ago

Attachment #8600565 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Updated

•

4 years ago

Attachment #8600590 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

4 years ago

Attached file Bug 1153920 - Conform ampersand error reporting to HTML spec. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

4 years ago

Attached file Bug 1153920 addendum - Limit ambiguous-ampersand wasteful buffering. (obsolete) — Details

This refines 9ce4bd4 by only buffering if we’re actually inside an
attribute value.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

4 years ago

This patch appears to change the DOM in an unwanted way:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 13

•

4 years ago

(In reply to Henri Sivonen (:hsivonen) (not reading bugmail until 2020-08-03) from comment #12)

This patch appears to change the DOM in an unwanted way:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0

Using the current Java sources for the parser — including the https://phabricator.services.mozilla.com/D81992 patch to the Java sources — I can’t reproduce the test failures shown at https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0

For example, given this input:

<html><head><body>&ammmp;

…the Java parser produces the expected output — specifically, it outputs the &ammmp; as expected.

See https://validator.w3.org/nu/parsetree/?parser=html5&content=%3Chtml%3E%3Chead%3E%3Cbody%3E%26ammmp%3B&submit=Print+Tree

So I’m wondering if it’s possible that some difference could be getting introduced during the Java-to-C++ translation.

Michael[tm] Smith [:sideshowbarker]

Reporter

Comment 14

•

4 years ago

The Java parser can also be tested by doing this:

curl -L -O https://github.com/validator/validator/releases/download/jar/vnu.jar \
  && echo "<html><head><body>&ammmp;" > TEST.html 
  && java -cp vnu.jar nu.validator.htmlparser.test.TreePrinter TEST.html

That outputs this tree:

| <html>
|   <head>
|   <body>
|     "&ammmp;
"

Henri Sivonen (:hsivonen)

Assignee

Comment 15

•

4 years ago

For Bugzilla record: This is being discussed on GitHub. The difference is not from C++ but from where the test harness puts buffer boundaries.

Henri Sivonen (:hsivonen)

Assignee

Comment 16

•

3 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a9f43fb9c89f65549d7746c7d6a5f22b7ee6b8f

Henri Sivonen (:hsivonen)

Assignee

Comment 17

•

3 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6899e3b9cbc6854802996e0fb14370c3890ea6aa

Phabricator Automation

Updated

•

3 years ago

Attachment #9161183 - Attachment is obsolete: true

Phabricator Automation

Updated

•

3 years ago

Attachment #9160900 - Attachment description: Bug 1153920 - Conform ampersand-error reporting to HTML spec. → Bug 1153920 - Conform ampersand-error reporting to HTML spec

Henri Sivonen (:hsivonen)

Assignee

Comment 18

•

3 years ago

Attached file Bug 1153920 test - Avoid marking non-error ampersands as errors in View Source. (obsolete) — Details

Phabricator Automation

Updated

•

3 years ago

Attachment #9160900 - Attachment description: Bug 1153920 - Conform ampersand-error reporting to HTML spec → Bug 1153920 - Conform ampersand error reporting to HTML spec.

Phabricator Automation

Updated

•

3 years ago

Attachment #9237036 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 19

•

3 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=acbaa4bfc7df3ef277aec828b230c90d114e462a

Henri Sivonen (:hsivonen)

Assignee

Comment 20

•

3 years ago

https://github.com/validator/htmlparser/commit/44427c0df60ec899c52b149bba03e1abffad2f5e

Pulsebot

Comment 21

•

3 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f8b2924ea4a3
Conform ampersand error reporting to HTML spec. r=smaug

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 22

•

3 years ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/30301 for changes under testing/web-platform/tests

Alexandru Michis [:malexandru]

Comment 23

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f8b2924ea4a3

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox93: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 93 Branch

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 24

•

3 years ago

Upstream PR merged by moz-wptsync-bot

You need to log in before you can comment on or make changes to this bug.