Closed Bug 1153920 Opened 9 years ago Closed 3 years ago

Bring htmlparser ampersand-error-reporting into conformance with HTML spec

Categories

(Core :: DOM: HTML Parser, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
93 Branch
Tracking Status
firefox93 --- fixed

People

(Reporter: sideshowbarker, Assigned: hsivonen)

References

Details

Attachments

(1 file, 7 obsolete files)

see https://bugzilla.validator.nu/show_bug.cgi?id=841

This is a long-standing known issue for which we continue to get validator bug reports. This is the issue that, e.g., ampersands in URLs in href and src values etc. such as http://example.com?foo=bar&hoge=moge are not errors according to the current HTML spec. So reporting errors for them is non-conforming.
Attached patch patch-validator-ampersands (obsolete) — Splinter Review
Attachment #8591755 - Flags: review?(hsivonen)
Attached patch patch for Java sources (obsolete) — Splinter Review
Attachment #8591755 - Attachment is obsolete: true
Attachment #8591755 - Flags: review?(hsivonen)
Attachment #8591774 - Flags: review?(hsivonen)
Attached patch patch for Gecko sources (obsolete) — Splinter Review
Attachment #8591781 - Flags: review?(hsivonen)
Comment on attachment 8591774 [details] [diff] [review]
patch for Java sources

Review of attachment 8591774 [details] [diff] [review]:
-----------------------------------------------------------------

::: src/nu/validator/htmlparser/impl/Tokenizer.java
@@ +3244,5 @@
>                      // XXX reorder point
> +                case AMBIGUOUS_AMPERSAND:
> +                    ampersandloop: for (;;) {
> +                        if (reconsume) {
> +                            reconsume = false;

It seems that this state is special in the sense that it's always entered into with reconsume set to true. Perhaps that should be asserted here and the 'else' branch be removed.

@@ +3259,5 @@
> +                                || (c >= 'a' && c <= 'z')) {
> +                            appendLongStrBuf(c);
> +                            continue;
> +                        }
> +                        pos--;

This is a pattern that doesn't occur elsewhere in the tokenizer. It seems that what's wanted here is
reconsume = true;

reconsume is set to false near the start of this state. Right now, it seems to me that this state should never consume a character and should always be entered and left with reconsume set to true. This makes this state rather unusual. I will need to think more to understand why it is OK to call appendLongStrBuf(c) in this kind of state that doesn't consume anything and exists mainly for error reporting.
Attached patch Patch for Java sources. (obsolete) — Splinter Review
Attachment #8591774 - Attachment is obsolete: true
Attachment #8591774 - Flags: review?(hsivonen)
Attachment #8600565 - Flags: review?(hsivonen)
Attached patch Patch for Gecko sources. (obsolete) — Splinter Review
Attachment #8591781 - Attachment is obsolete: true
Attachment #8591781 - Flags: review?(hsivonen)
Attachment #8600590 - Flags: review?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> Comment on attachment 8591774 [details] [diff] [review]
> patch for Java sources
> 
> Review of attachment 8591774 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: src/nu/validator/htmlparser/impl/Tokenizer.java
> @@ +3244,5 @@
> >                      // XXX reorder point
> > +                case AMBIGUOUS_AMPERSAND:
> > +                    ampersandloop: for (;;) {
> > +                        if (reconsume) {
> > +                            reconsume = false;
> 
> It seems that this state is special in the sense that it's always entered
> into with reconsume set to true. Perhaps that should be asserted here and
> the 'else' branch be removed.

OK, I've made it so in the updated patches most-recently attached.
 
> @@ +3259,5 @@
> > +                                || (c >= 'a' && c <= 'z')) {
> > +                            appendLongStrBuf(c);
> > +                            continue;
> > +                        }
> > +                        pos--;
> 
> This is a pattern that doesn't occur elsewhere in the tokenizer. It seems
> that what's wanted here is
> reconsume = true;
> 
> reconsume is set to false near the start of this state. Right now, it seems
> to me that this state should never consume a character and should always be
> entered and left with reconsume set to true. This makes this state rather
> unusual. I will need to think more to understand why it is OK to call
> appendLongStrBuf(c) in this kind of state that doesn't consume anything and
> exists mainly for error reporting.

For that, I've not made any change yet because I couldn't tell if you were saying that we should go ahead and make a change here or if you wanted to hold off any make a change for this until you've had a chance to consider it further.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment on attachment 8600565 [details] [diff] [review]
Patch for Java sources.

Review moved to GitHub after extreme delay. Sorry.
Attachment #8600565 - Flags: review?(hsivonen)
Comment on attachment 8600590 [details] [diff] [review]
Patch for Gecko sources.

Review moved to GitHub after extreme delay. Sorry.
Attachment #8600590 - Flags: review?(hsivonen)
Attachment #8600565 - Attachment is obsolete: true
Attachment #8600590 - Attachment is obsolete: true

This refines 9ce4bd4 by only buffering if we’re actually inside an
attribute value.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

(In reply to Henri Sivonen (:hsivonen) (not reading bugmail until 2020-08-03) from comment #12)

This patch appears to change the DOM in an unwanted way:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0

Using the current Java sources for the parser — including the https://phabricator.services.mozilla.com/D81992 patch to the Java sources — I can’t reproduce the test failures shown at https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0

For example, given this input:

<html><head><body>&ammmp;

…the Java parser produces the expected output — specifically, it outputs the &ammmp; as expected.

See https://validator.w3.org/nu/parsetree/?parser=html5&content=%3Chtml%3E%3Chead%3E%3Cbody%3E%26ammmp%3B&submit=Print+Tree

So I’m wondering if it’s possible that some difference could be getting introduced during the Java-to-C++ translation.

The Java parser can also be tested by doing this:

curl -L -O https://github.com/validator/validator/releases/download/jar/vnu.jar \
  && echo "<html><head><body>&ammmp;" > TEST.html 
  && java -cp vnu.jar nu.validator.htmlparser.test.TreePrinter TEST.html

That outputs this tree:

| <html>
|   <head>
|   <body>
|     "&ammmp;
"

For Bugzilla record: This is being discussed on GitHub. The difference is not from C++ but from where the test harness puts buffer boundaries.

Attachment #9161183 - Attachment is obsolete: true
Attachment #9160900 - Attachment description: Bug 1153920 - Conform ampersand-error reporting to HTML spec. → Bug 1153920 - Conform ampersand-error reporting to HTML spec
Attachment #9160900 - Attachment description: Bug 1153920 - Conform ampersand-error reporting to HTML spec → Bug 1153920 - Conform ampersand error reporting to HTML spec.
Attachment #9237036 - Attachment is obsolete: true
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f8b2924ea4a3
Conform ampersand error reporting to HTML spec. r=smaug
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/30301 for changes under testing/web-platform/tests
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Upstream PR merged by moz-wptsync-bot
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: