Closed Bug 13015 Opened 25 years ago Closed 25 years ago

Mozilla confused by comments in titles

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

VERIFIED WORKSFORME

People

(Reporter: jmorzins, Assigned: harishd)

References

()

Details

What were you trying to do?
    Read the web page http://technology.news.com.au/news/4276366.htm ,
    a page that contains a comment in its title.

What's wrong?
    Mozilla does not realize that the comment is a comment,
    and renders it as part of the page's title.

Whats should have happened:
    Mozilla should have omitted the comment before memorizing
    the page's title.

Got any documentation?
    I've put up a simple example page at
        http://web.mit.edu/jmorzins/www/netscape-title-bug.html
    The page is accepted as valid html-4.0 by validator.w3.org,
    and contains a comment inside its title.  When viewed in
    netscape, netscape is unaware of the comment, and treats
    the page as if its title were
        "<!--This title has a comment in it.--> This is the title"


Thank you,
 Jacob Morzinski
Assignee: rickg → harishd
Assiging bug to myself.
Excerpt (w3c.org):
 "Titles may contain character entities (for accented characters, special
  characters, etc.), but may not contain other markup."
Target Milestone: M14
Moving to M14.
The content model of TITLE is #PCDATA, not CDATA, so comments are allowed, I
think.  By other markup, the spec meant other elements.
OS: Linux → All
Hardware: PC → All
Yup, comments in <title>...</title> should be treated exactly like comments in,
say, <span>...</span>.
Asked the question, "HTML comments in <title> elements - valid or not?" the
www-html mailing list has replied. (See the thread of the same name in
<URL:http://lists.w3.org/Archives/Public/www-html/1999Nov/>, now dormant.)

Summarizing the responses, the important point is that comments are considered
to be markup in HTML - actually, are defined to be markup in the SGML
productions for HTML - see
<URL:http://www.w3.org/MarkUp/SGML/productions.html#prod91>.

Given that the only markup that is allowed in <title>s is character entities,
<URL:http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.2>,
comments must not occur within <title>s, despite the content model of
title being #PCDATA.

This is one of the places where the specification constrains valid HTML
beyond the constraints imposed by the DTD.  For better or for worse,
all of the HTML specification's requirements cannot be expressed in a DTD.
(Under <URL:http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.1> the
following statement about any application of SGML is applicable:
"3.A specification that describes the semantics to be ascribed to the markup.
This specification also imposes syntax restrictions that cannot be expressed
within the DTD.")

(It was pointed out that most major browsers treat the <title> element as
having a content model of #CDATA, sidestepping the issue of parsing entities
but no other markup. Mozilla gets this right: "R&eacute;n&eacute;" shows
up as four characters, two of them accented "e"s, and a title like
"Why to avoid using <FONT> tags" shows up exactly like that.)

It looks like this bug report can be marked INVALID.

(BTW, no validator that works from SGML DTDs only can catch compliance
issues with aspects of the specification that are not encoded in the DTD,
like this one. Such a validator would also allow
WIDTH="ceci n'est pas une pipe" in order to be able to allow WIDTH="50%"
"<!ENTITY % Length "CDATA" -- nn for pixels or nn% for percentage length -->")
No, we should be treating comments in <TITLE> blocks the same as in any other
#PCDATA blocks. The spec quote you give is a validity constraint, which means
that documents that break it are invalid. It does not apply to user agents (web
browsers). The part of the spec that applies to the parser is indeed the DTD,
and that says that we should parse comments in TITLE elements.
If markup in #PCDATA in <title>s *could* be honoured just as any other #PCDATA,
then it would make sense to leave it at that and the validity constraint
"Titles may contain character entities (for accented characters, special
characters, etc.), but may not contain other markup" at
<URL:http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.2> would not be
necessary or meaningful. But if #PCDATA in <title>s were parsed the same as any
other #PCDATA, the markup it contained could still not be used in the contexts
in which <title> data gets used, in the same way that same markup can be used
in other contexts.

As far as I can tell, neither the spec nor the DTD define what is to be
done by a User Agent with invalidly present comments or other markup. For
comments, the choice is between honouring it (parsing it out) or leaving it
as it is. For other markup, the choices are to try to honour it, parse it
out as if it were unknown markup, or leave it as it is.

Given that the spec *does* say that markup other than character entities
(including comments) are not to appear in <title>s, the writer who
puts them in despite that has no reason to expect any of those choices in
particular, certainly not consistently.

Given those choices, given the lack of direction from the spec,
and given that as a practical matter #PCDATA in <title> is a special case
(whatever else is done, markup other than comments (honoured by removing)
and character entities cannot be honoured in a typical application title bar)
the simplest way to handle the content of <title> elements is exactly how
Mozilla is currently handling it: replace the character entities with
the characters they represent and otherwise display as-is.

Aside from that, given a choice between an implementation that discourages
invalid HTML by putting it on display, and an implementation that honours
invalid HTML and thus does nothing discourage it, what exactly is gained
by putting in additional effort to make the latter happen?

As far as I can see, the validity of this bug report depends on which of
those two implementations is better. Is the latter truly *required* by
the DTD for <title> regardless of what the spec says, or is an implementor
free to choose an implementation truer to the intent of the whole of the
spec? This is the main question.

Finally, if #PCDATA for <title> were parsed exactly like any other #PCDATA,
what would be done with the other markup that might also be invalidly present
after it was parsed and made part of the DOM? This awkward question is
not about parsing but about what happens after parsing, and if the answer is
"nothing useful" - why parse markup (including comments) at all?

The obvious rejoinder I can see to that is "No, treat <title> content as
#PCDATA except for markup that isn't comments or character entities"
- but would that not then be just as much a violation of the DTD?

The only other rejoinder I can see to that is "Because the DTD says to"
- so, is that the absolute last word, and what then is to be done with the
parsed <I> and <B> and <FONT> tags (and these, deprecated as they are,
are about the only markup that those flouting the spec would want to put
in) that were never valid in the first place?
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → WORKSFORME
I'm closing this bug. As far as I can tell, we're behaving identically to both
navigator and IE.
Status: RESOLVED → VERIFIED
ok
You need to log in before you can comment on or make changes to this bug.