Bring htmlparser ampersand-error-reporting into conformance with HTML spec
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox93 | --- | fixed |
People
(Reporter: sideshowbarker, Assigned: hsivonen)
References
Details
Attachments
(1 file, 7 obsolete files)
see https://bugzilla.validator.nu/show_bug.cgi?id=841 This is a long-standing known issue for which we continue to get validator bug reports. This is the issue that, e.g., ampersands in URLs in href and src values etc. such as http://example.com?foo=bar&hoge=moge are not errors according to the current HTML spec. So reporting errors for them is non-conforming.
Reporter | ||
Comment 1•9 years ago
|
||
Reporter | ||
Comment 2•9 years ago
|
||
Reporter | ||
Comment 3•9 years ago
|
||
Assignee | ||
Comment 4•9 years ago
|
||
Comment on attachment 8591774 [details] [diff] [review] patch for Java sources Review of attachment 8591774 [details] [diff] [review]: ----------------------------------------------------------------- ::: src/nu/validator/htmlparser/impl/Tokenizer.java @@ +3244,5 @@ > // XXX reorder point > + case AMBIGUOUS_AMPERSAND: > + ampersandloop: for (;;) { > + if (reconsume) { > + reconsume = false; It seems that this state is special in the sense that it's always entered into with reconsume set to true. Perhaps that should be asserted here and the 'else' branch be removed. @@ +3259,5 @@ > + || (c >= 'a' && c <= 'z')) { > + appendLongStrBuf(c); > + continue; > + } > + pos--; This is a pattern that doesn't occur elsewhere in the tokenizer. It seems that what's wanted here is reconsume = true; reconsume is set to false near the start of this state. Right now, it seems to me that this state should never consume a character and should always be entered and left with reconsume set to true. This makes this state rather unusual. I will need to think more to understand why it is OK to call appendLongStrBuf(c) in this kind of state that doesn't consume anything and exists mainly for error reporting.
Reporter | ||
Comment 5•9 years ago
|
||
Reporter | ||
Comment 6•9 years ago
|
||
Reporter | ||
Comment 7•9 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #4) > Comment on attachment 8591774 [details] [diff] [review] > patch for Java sources > > Review of attachment 8591774 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: src/nu/validator/htmlparser/impl/Tokenizer.java > @@ +3244,5 @@ > > // XXX reorder point > > + case AMBIGUOUS_AMPERSAND: > > + ampersandloop: for (;;) { > > + if (reconsume) { > > + reconsume = false; > > It seems that this state is special in the sense that it's always entered > into with reconsume set to true. Perhaps that should be asserted here and > the 'else' branch be removed. OK, I've made it so in the updated patches most-recently attached. > @@ +3259,5 @@ > > + || (c >= 'a' && c <= 'z')) { > > + appendLongStrBuf(c); > > + continue; > > + } > > + pos--; > > This is a pattern that doesn't occur elsewhere in the tokenizer. It seems > that what's wanted here is > reconsume = true; > > reconsume is set to false near the start of this state. Right now, it seems > to me that this state should never consume a character and should always be > entered and left with reconsume set to true. This makes this state rather > unusual. I will need to think more to understand why it is OK to call > appendLongStrBuf(c) in this kind of state that doesn't consume anything and > exists mainly for error reporting. For that, I've not made any change yet because I couldn't tell if you were saying that we should go ahead and make a change here or if you wanted to hold off any make a change for this until you've had a chance to consider it further.
Assignee | ||
Updated•9 years ago
|
Assignee | ||
Comment 8•4 years ago
|
||
Comment on attachment 8600565 [details] [diff] [review] Patch for Java sources. Review moved to GitHub after extreme delay. Sorry.
Assignee | ||
Comment 9•4 years ago
|
||
Comment on attachment 8600590 [details] [diff] [review] Patch for Gecko sources. Review moved to GitHub after extreme delay. Sorry.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
Assignee | ||
Comment 11•4 years ago
|
||
This refines 9ce4bd4 by only buffering if we’re actually inside an
attribute value.
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
This patch appears to change the DOM in an unwanted way:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0
Reporter | ||
Comment 13•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) (not reading bugmail until 2020-08-03) from comment #12)
This patch appears to change the DOM in an unwanted way:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0
Using the current Java sources for the parser — including the https://phabricator.services.mozilla.com/D81992 patch to the Java sources — I can’t reproduce the test failures shown at https://treeherder.mozilla.org/#/jobs?repo=try&revision=59a15640760bc1c1a4629c90b849cfa43a20c057&selectedTaskRun=GdCS5FrVRbCXm2jscZdZkw.0
For example, given this input:
<html><head><body>&ammmp;
…the Java parser produces the expected output — specifically, it outputs the &ammmp;
as expected.
So I’m wondering if it’s possible that some difference could be getting introduced during the Java-to-C++ translation.
Reporter | ||
Comment 14•4 years ago
|
||
The Java parser can also be tested by doing this:
curl -L -O https://github.com/validator/validator/releases/download/jar/vnu.jar \
&& echo "<html><head><body>&ammmp;" > TEST.html
&& java -cp vnu.jar nu.validator.htmlparser.test.TreePrinter TEST.html
That outputs this tree:
| <html>
| <head>
| <body>
| "&ammmp;
"
Assignee | ||
Comment 15•4 years ago
|
||
For Bugzilla record: This is being discussed on GitHub. The difference is not from C++ but from where the test harness puts buffer boundaries.
Assignee | ||
Comment 16•3 years ago
|
||
Assignee | ||
Comment 17•3 years ago
|
||
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 18•3 years ago
|
||
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 19•3 years ago
|
||
Assignee | ||
Comment 20•3 years ago
|
||
Comment 21•3 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f8b2924ea4a3 Conform ampersand error reporting to HTML spec. r=smaug
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/30301 for changes under testing/web-platform/tests
Comment 23•3 years ago
|
||
bugherder |
Upstream PR merged by moz-wptsync-bot
Description
•