Closed Bug 44458 Opened 20 years ago Closed 18 years ago

HTML entities are not recognized in XHTML documents


(Core :: XML, defect, P2)






(Reporter: rbs, Assigned: hjtoi-bugzilla)



(4 keywords, Whiteboard: [Hixie-P4])


(4 files, 3 obsolete files)

HTML includes a number of default entities (e.g., α β, etc).
Since the XHTML specification doesn't say that these entities have been
removed, it is expetect that they should be available in XHTML as well.

Load the attachement (to follow). It contains the following document:

<?xml version="1.0"?>
 "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="">  
    <title>XHTML Document with entities</title>  
      alpha: &alpha; <br/>
      beta:  &beta;

The browser should display
alpha: [greek alpha here]
beta:  [greek beta here]

The browser is displaying
alpha: [nothing here]
beta:  [nothing here]

My tree is one week old, but looking at the check-ins and their
associated comments, I don't see a change that could have fixed this.
I am yet again having trouble attaching the test case.
If someone can do it please go ahead.
This bug is a severe problem when authoring non-English XHTML documents.
e.g., the French-accented letters (&eacute; &egrave; &ucirc; &agrave; etc.)
are lost.
Attached file testcase
Yes, this bug is happening because we don't load up the entity sets defined in 
the XHTML DTD referenced from the XHTML document.  We need to cache the 
three XHTML DTDs locally, look at the PUBLIC ID in the DOCTYPE declaration, map 
the ID into a XHTML DTD, load that DTD, and pass it to expat so that the entity 
declarations in the DTD become available to the XHTML document.

Marking nsbeta3 and adding Heikki to the cc list.
Keywords: nsbeta3
Target Milestone: --- → M18
If the attachement is saved in the "bin/dtd" directory as "htmlEntities.dtd",
then no missing entities arise when passing it as the SYSTEM ID:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "htmlEntities.dtd">

But as nisheeth pointed out, the right way to go about this is to load up
the list based on the PUBLIC ID. Fortunately, we only need one list (that of
HTML4) because XHTML relies on HTML4.

(The attached htmlEntities.dtd file is a replica in DTD form of 
Perf: would that be possible to cache/share the list? otherwise, it is removed
to just be re-loaded when navigating from document to document, and there could
even be several copies at the same time in different windows/frames. (we have
the same problem with the over 2000 mathml entities BTW). On the other hand, if
they are shared and users' scripts alter them (i.e., if the spec allows entities
to be writable?), there could be another can of worms that don't warrant such 
troubles at this stage.
1) If we don't fix this, there's no way to use HTML entities in XHTML documents, 
essentially limiting the text content of XHTML docs to straight Unicode text 
only, so we would like to fix this if possible.
2) However, there's no commitment to any XHTML support at all the the first 
release, so we're not required to fix this. We could choose just to not support 
XHTML at all in 6.0 and delay support to 6.01, for example.
3) Definitely don't implement a fix that will blow out our memory footprint on a 
per-frame or per-window basis. If we don't have time to do a memory-efficient 
fix, Future this.
4) How long do you think it will take to fix this?

I'm going to ask for a new xhtml keyword so we can track issues related to xhtml 
This bug has been marked nsbeta3- because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration, but do not clear 
the nsbeta3- nomination.
Whiteboard: [nsbeta3-]
Target Milestone: M18 → Future
*** Bug 68202 has been marked as a duplicate of this bug. ***
Nominating for beta1. HTML entities need to be supported in XHTML since they are 
commonly used by popular sites.
Keywords: nsbeta1
Updated report
Keywords: nsbeta3
Whiteboard: [nsbeta3-]
Setting target milestone to 0.9.1...
Target Milestone: Future → mozilla0.9.1
Nisheeth: This relates strongly to bug 74172. We want to be able to include
XHTML1 entities in XUL files in the chrome. The simple solution I had is to put
the entity files in the chrome (xpfe/global/resources/content) but maybe there's
a better place to put them...?
OS: Windows 98 → All
Hardware: PC → All
Blocks: 15391
dr, the real challenge is to avoid to unnecessarily clutter the memory with the
same list of entities. For example, if not done properly, all the numerous XUL
fragments will be trailing their own copies of the same thing.
Compatibility consideration:
The test case has a relative URL to the DTD. Some XML user agents (eg. IE 5 for
Mac*) might attempt to fetch the DTD and display an alert if the DTD isn't
found. Since the examples in the XHTML spec use relative URLs (a Bad Thing, IMO)
it is likely that authors will include those same relative URLs in their
documents without actually providing the DTD in the corresponding location.

If Mozilla doesn't check the existence of the DTD, a compatibility problem will
be introduced. (Author uses Mozilla but the site visitor has a browser that
attempts to fetch the DTD.)

However, generating unnecessary network traffic would be bad, too. One way to
solve this would be *not fetching* the DTD if the system identifier is an
absolute URL to the DTD hosted at the W3C and checking for existance on the
document's server in the case of a relative URL. (OK, I know this isn't very
likely to be implemented, but this really is a potential compatibility problem,
because validating parsers will have problems if they can't find the DTD.)

* IE 5 for Mac OS Classic actually tries to fetch the DTD and signals an error,
if the DTD is not found. However, the XHTML features of that browser are
otherwise too broken to be of any use.
moving to TM of 0.9.2 per PDT triage (you can check it into 0.9.1 until Friday,
18/May/01 or into 0.9.2 after the tree opens)
Target Milestone: mozilla0.9.1 → mozilla0.9.2
Priority: P3 → P2
Moving P2 and P3 bugs over to 0.9.3...
Target Milestone: mozilla0.9.2 → mozilla0.9.3
Seems like this bug would require some deeper thoughts, along the lines of
what was done for the new image lib. A DTD manager could keep remote DTDs in 
a disk DTD cache in necko, while selected local DTDs could stay in the memory
DTD cache. The manager could then be a memory pressure listener and such. Looks like
the same issues that were addressed in the new image lib may arise here.
Missed 0.9.3.
Target Milestone: mozilla0.9.3 → mozilla0.9.4
Bulk re-assign of my 0.9.4 bugs to Heikki.  I will not have the cycles to work
on these bugs while Clayton is on sabbatical for the next six weeks.
Assignee: nisheeth → heikki
Target Milestone: mozilla0.9.4 → mozilla0.9.5
Using Mozilla 0.9.4 on RedHat Linux 7.1, the testcase is displayed correctly. Is
this bug fixed?
Target Milestone: mozilla0.9.5 → mozilla0.9.6
Aleksey Nogin, you must have tried the text/html testcase which works, not the
xhtml one, which still fails. Since is attempting to be xhtml compliant,
doesn't this mean top100? Adding another bug as being blocked by this too.
Blocks: 95770
This doesn't block really bug 95770. (I wouldn't consider this as a blocker for
bug 15391, either.) Numeric character references and UTF-8 work.

MSN isn't serving real XHTML using a real XML content type. (It wouldn't work in
IE if they did.) This bug has nothing to do with MSN.

Does the XML spec really require non-validating parsers to support external
character entities?
What do you mean by "external"? External files that are linked in main DTD? Well
XHTML 1.1 is entirely built on that. Nothing is defined in it's main DTD and it
has multiple levels of referencing. So I don't see how this could be optional
for XHTML support.
Keywords: dataloss, intl
Target Milestone: mozilla0.9.6 → mozilla0.9.7
*** Bug 107736 has been marked as a duplicate of this bug. ***
By "external" I mean the XML meaning: not in the same storage object as the main
document. So a separate DTD file is external to an XML document.
Attachment #10952 - Attachment mime type: text/html → application/xhtml+xml
Attachment #49600 - Attachment is obsolete: true
Keywords: testcase
*** Bug 108079 has been marked as a duplicate of this bug. ***
A preliminary workaround that works with the current Mozilla browser could be:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "" [
<!ENTITY copy   "&#169;"> ]> 

Target Milestone: mozilla0.9.7 → mozilla0.9.9
Whiteboard: [Hixie-P4]
For Mozilla 1.0 I think it will be good enough if we just do the same thing our
MathML implementation is doing, basically adding the XHTML public ids to the
catalog table and creating xhtml10.dtd.

rbs, could you break the XHTML-only entities away from mathml.dtd and have
mathml.dtd include xhtml10.dtd entities as an external entity file?
It probably should be xhtml11.dtd, although probably not much has changed since
>rbs, could you break the XHTML-only entities away from mathml.dtd and have
>mathml.dtd include xhtml10.dtd entities as an external entity file?

Is this level of fine-grain worth the trouble at this stage? Since there are 
over 2000 MathML entities, removing a couple of hundreds wouldn't make much 
difference (except... the tediousness of trying to figure out the duplicates 
with XHTML and weeding them out :-)

It would be speedier to just save attachment 11622 [details] (which already has 
everything, I think) as xhtml11.dtd and then adding two entries in the catalog 
table for: "-//W3C//DTD XHTML 1.0 Strict//EN" and "-//W3C//DTD XHTML 1.1//EN".
With that, we will be done, and gzip will do the rest.
or use exact W3C entity DTD definitions for XHTML.
I've provided the URLs for them above in comment #24.
The interest is about efficiency. Only the entity definitions are of interest.
The comments are going to slow the parsing unnecessarily. When trimmed down, the
result is the lightweight version in attachment 11622 [details] (which could be compiled,
a la fastload, if there was a way to do that).
I can easily strip out the comments from the versions W3C supplies, and combine
them into one file to speed loading. The reason I did not do that yet was
because I was wondering some comment in the mathml.dtd:

This is a *customized* list for Mozilla: characters originally specified 
as combined pairs and plane 1 characters have been remapped to internal
code points within the Unicode's Private Use Area (PUA).

If this does not apply (or I don't need to do anything special) to the XHTML
1.0/1.1 entities, I am all set and can finish this soon. rbs, could you explain
the above comment to us (me) non-Unicode experts ;) ?

Or does attachment 11622 [details] list every entity the W3C lists include so that I
wouldn't need to do anything, basically? I am slightly concerned about just
accepting a list compiled from our HTML code...

The remapping is about plane-1 MathML characters (5-digit code points) which
don't work on any application yet -- although some work is under progress to
eventually support them in Mozilla (bug 118000). Since HTML/XHTML code points
are only 4-digit, the comments about the remapping don't apply there.

The list from where attachment 11622 [details] was compiled is meant to be indentical with
the expected list (otherwise something would be out of sync and need fixing). It
seems to be complete to me. (The other bit is to add 'dtd/*' in the packages as
I am doing over at bug 109826.)
Attached patch Proposed fix 1 (obsolete) — Splinter Review
Using the attachment 11622 [details] as the xhtml11.dtd entity list in this patch. I
would assume we need to add mappings for 4 new public IDs: 3 for XHTML 1.0
(Strict, Transitional and Frameset) and 1 for XHTML 1.1. I also made packaging
changes but I am not sure how to test them yet (MOZILLA_OFFICIAL?). This patch
is missing (a one liner?) change to make the Mac build system copy the DTD. I
think I have a fix but I want to test it first. Haven't tested on Linux yet.
Looking good. You might want to get rid of this one since it is a built-in
entity in the XML spec & parser, etc (right?)
+<!-- Navigator entity extensions; apos is from XML -->
+<!ENTITY apos "&#39;">

Also, the ordering of the catalog table could be revamped to put the most
frequent stuff first.
Doh, forgot the apos entity; you are right! The most frequently used order would
probably just translate to putting the XHTML ids first, or do you have other

Also this seems to work on Linux, now compiling on Mac.
Possible re-ordering from:
+ {"-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN", "mathml.dtd"  },
+ {"-//W3C//DTD SVG 20001102//EN",              "svg.dtd"     },
+ {"-//W3C//DTD XHTML 1.0 Strict//EN",          "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Transitional//EN",    "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Frameset//EN",        "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1//EN",                 "xhtml11.dtd" },

+ {"-//W3C//DTD XHTML 1.0 Transitional//EN",    "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1//EN",                 "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Strict//EN",          "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Frameset//EN",        "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN", "mathml.dtd"  },
+ {"-//W3C//DTD SVG 20001102//EN",              "svg.dtd"     },
XHTML Basic is missing?
"-//W3C//DTD XHTML Basic 1.0//EN" "xhtml-basic10.dtd"

I was also wondering, probably stupid question, if the current syntax includes
declarations with an URL such as 

"-//W3C//DTD XHTML Basic 1.0//EN"

Is there a difference in how Mozilla treats that?

I would think that people that author XHTML from scratch would use Strict... but
on the other hand I prefer using Transitional because I am lazy, and that would
seem the best choice if you want to do minimal job converting your old HTML to
XHTML. I am fine with your suggestion as well.

And it seems we are also missing Basic. 

And we are also missing Modularization of XHTML public IDs. They are a bit more
problematic. Basically what is required is that they conform to the Formal
Public Identifier specification and contain 'XHTML' in the description section.
For example, this would be a correct public ID: "-//Heikki Toivonen//DTD XHTML
Programming Extensions//EN". But currently some validating XML parsers seem to
have difficulty handling that DTD, and apparently there are no XML editors that
can produce documents in that specification. Based on that, I would be happy to
leave this problem until XML Catalogs are implemented.
Attachment #66202 - Attachment is obsolete: true
Attachment #66334 - Attachment is obsolete: true
Comment on attachment 66355 [details] [diff] [review]
New combined patch

Attachment #66355 - Flags: review+
Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

Attachment #66359 - Flags: review+
Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

Comment on attachment 66355 [details] [diff] [review]
New combined patch

Attachment #66355 - Flags: superreview+
Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

Attachment #66359 - Flags: superreview+
Closed: 18 years ago
Resolution: --- → FIXED
*** Bug 121808 has been marked as a duplicate of this bug. ***
verified on Win2K
works with:

XHTML 1.0 Strict
XHTML 1.0 Transitional
XHTML 1.0 Frameset
XHTML Basic 1.0

as both application/xhtml+xml and text/xml
+    MakeAlias(":mozilla:content:xml:content:src:xhtml11.dtd",                          

This makes a 'dtd' folder next to the application. On Mac OS, this adds to the 
clutter that users see when they view the Mozilla/Netscape folder in the Finder, 
which we really try to avoid.

I filed bug 122710 on this issue.
You need to log in before you can comment on or make changes to this bug.