Closed Bug 155047 Opened 23 years ago Closed 23 years ago

Character entity names missing ';' are subsituted in URLs / page requests

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows 2000
defect
Not set
major

Tracking

()

RESOLVED INVALID

People

(Reporter: lilienthal_, Assigned: harishd)

References

()

Details

Certain character entity names of the form &amp;name (missing the ';') are subsituted in URLs / page requests. Because we used character entity names (without ';') in the '&amp;' separated variable list to pass values to CGI scripts this caused serious problems. It happens for: &amp;nbsp, &amp;pound, &amp;yen, &amp;deg, &amp;cent, &amp;#123 but not for: &amp;plus, &amp;period, &amp;equals, &amp;dollar I found this note (but not the SGML or XML specification): Note: In SGML, it is possible to eliminate the final ";" after a numeric or named character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). But even if this makes sense in the normal page text, it's definitely a bug to subsitute those character entities in addresses / page requests. Examples: <a href="http://www.test.com/test.pl?&nbsp=error&pound=error&yen=error&deg=error&cent=error&plus=ok&period=ok&equals=ok&dollar=ok"> http://www.test.com/test.pl?&nbsp=error&pound=error&yen=error&deg=error&cent=error&plus=ok&period=ok&equals=ok&dollar=ok </a> http://www.test.com/test.pl?&nbsp=error&pound=error&yen=error&deg=error&cent=error&plus=ok&period=ok&equals=ok&dollar=ok
No, it's not actually a bug... You just need to write your URI's just like you wrote your bugzilla comment: <a href="http://www.test.com/test.pl?&amp;nbsp=error&amp;pound=error&amp;yen=error&amp;deg=error&amp;cent=error&amp;plus=ok&amp;period=ok&amp;equals=ok&amp;dollar=ok"> Text </a> since entity expansion should and does happen in all attribute values... The differentiation between "normal text" and "something else" is not really relevant to an SGML processor in this case.....
Assignee: Matti → harishd
Component: Browser-General → Parser
QA Contact: imajes-qa → moied
Whiteboard: DUPEME
Perfectly correct. In SGML applications permitting OMITTAG and/or SHORTTAG (such as HTML), this is an integral part of entity reference recognition, regardless of context. INVALID.
Status: UNCONFIRMED → RESOLVED
Closed: 23 years ago
Resolution: --- → INVALID
*** Bug 256264 has been marked as a duplicate of this bug. ***
*** Bug 278404 has been marked as a duplicate of this bug. ***
*** Bug 290101 has been marked as a duplicate of this bug. ***
*** Bug 293009 has been marked as a duplicate of this bug. ***
Whiteboard: DUPEME
This doesnt make sense to me. https://bugzilla.mozilla.org/show_bug.cgi?id=484389 Here I have a test case where I dont have the ";" after "lang". So If a web page just says "&language" then it will show the left angle mark? Beats common sense for me
No, &language would be an unknown entity. But you have &lang=. Please look up the SGML rules on entity name termination (hint: while a ';' does terminate entity names, it is NOT the only character that does so).
FROM https://listserv.heanet.ie/cgi-bin/wa?A3=ind9503&L=HTML-WG&E=0&P=982677&B=--&T=text%2Fplain -2.2.4 Entity References - - SGML uses entity references, indicated by an ampersand (&) and - immediately followed by a name and terminated by a semicolon (;), - to represent a named substitution of data (the entity). HTML 2.0 - only uses entity references to represent peculiar and special - characters. The reference can be used in place of a character when - the character itself would be misinterpreted as markup. The entity - sets defined for use by HTML 2.0 documents are listed in Section 13.
FROM http://en.wikipedia.org/wiki/SGML_entity "Parameter entities are referenced by placing the entity name between "%" and ";". Parsed general entities are referenced by placing the entity name between "&" and ";". Unparsed entities are referenced by placing the entity name in the value of an attribute declared as type ENTITY."
FROM http://www.w3.org/TR/html4/sgml/entities.html Here it doesnt say what are the other "termination" characters for entities references, other than the semicolon. If "=" is a termination character then please point me to the document where it lists all characters that can be considered to terminate an entity reference, in SGML or HTML.
Maybe I should have been clearer. To look up the precise rules for this (which are fairly complicated), you'll have to go to the SGML standard, not to tutorials which oversimplify. Unfortunately, this standard is not available in electronic form due to copyright issues, last I checked; you might need to visit your local library or bookstore. Here's what HTML4 has to say on the matter, in an informative note at <http://www.w3.org/TR/html4/charset.html#h-5.3>: In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present. The wikipedia article you cite simply presents the situation from the author's point of view: any time you _want_ to use an entity, you should stick a ';' on the end. If you don't, behavior becomes complicated. The UA point of view is that the complicated behavior is what you implement. Of course from the author point of view, if you do NOT want to use an entity, then you need to escape the '&' character instead of producing invalid documents and then relying on unknown entity handling...
In any case, the &lang issue is bug 107320.
Just for the record, neither IE nor Opera show the entity. You can test with an .html file that *only* has this line: pa&ge in new tab No tags, just a single string. Considering the different rendering of the existing engines Maybe the W3C should add 'entities without closing semicolon' to their list of SGML features with limited support: http://www.w3.org/TR/html4/appendix/notes.html#sgmlfeatures
Yeah, most likely. Note that Safari (and any other webkit-based browser) has the same behavior as Gecko here, and that Opera used to have this behavior as well; I don't know when they changed it.
You need to log in before you can comment on or make changes to this bug.