Closed Bug 44458 Opened 24 years ago Closed 23 years ago

HTML entities are not recognized in XHTML documents

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9.9

People

(Reporter: rbs, Assigned: hjtoi-bugzilla)

References

Details

(4 keywords, Whiteboard: [Hixie-P4])

Attachments

(4 files, 3 obsolete files)

testcase 24 years ago rbs 462 bytes, application/xhtml+xml		Details
List of HTML entities in DTD form 24 years ago rbs 7.63 KB, text/plain		Details
XHTML testcase 23 years ago Heikki Toivonen (remove -bugzilla when emailing directly) 462 bytes, application/xhtml+xml		Details
Proposed fix 1 23 years ago Heikki Toivonen (remove -bugzilla when emailing directly) 14.08 KB, patch		Details \| Diff \| Splinter Review
Mac build changes 23 years ago Heikki Toivonen (remove -bugzilla when emailing directly) 783 bytes, patch		Details \| Diff \| Splinter Review
New combined patch 23 years ago Heikki Toivonen (remove -bugzilla when emailing directly) 14.98 KB, patch	rbs : review+ jst : superreview+	Details \| Diff \| Splinter Review
Embedding packager changes, want these? 23 years ago Heikki Toivonen (remove -bugzilla when emailing directly) 1.12 KB, patch	rbs : review+ jst : superreview+	Details \| Diff \| Splinter Review

rbs

Reporter

Description

•

24 years ago

HTML includes a number of default entities (e.g., &alpha; &beta;, etc).
Since the XHTML specification doesn't say that these entities have been
removed, it is expetect that they should be available in XHTML as well.

STEP TO REPRODUCE
=================
Load the attachement (to follow). It contains the following document:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC
 "-//W3C//DTD XHTML 1.0 Strict//EN"
 "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">  
  <head>  
    <title>XHTML Document with entities</title>  
  </head>  
  <body>  
    <p>
      alpha: &alpha; <br/>
      beta:  &beta;
    </p>  
  </body>  
</html>

EXPECTED RESULTS
================
The browser should display
alpha: [greek alpha here]
beta:  [greek beta here]

ACTUAL RESULTS
==============
The browser is displaying
alpha: [nothing here]
beta:  [nothing here]

ADDITIONAL DETAILS
==================
My tree is one week old, but looking at the check-ins and their
associated comments, I don't see a change that could have fixed this.

rbs

Reporter

Comment 1

•

24 years ago

I am yet again having trouble attaching the test case.
If someone can do it please go ahead.

rbs

Reporter

Comment 2

•

24 years ago

This bug is a severe problem when authoring non-English XHTML documents.
e.g., the French-accented letters (&eacute; &egrave; &ucirc; &agrave; etc.)
are lost.

rbs

Reporter

Comment 3

•

24 years ago

Attached file testcase — Details

Nisheeth Ranjan

Comment 4

•

24 years ago

Yes, this bug is happening because we don't load up the entity sets defined in 
the XHTML DTD referenced from the XHTML document.  We need to cache the 
three XHTML DTDs locally, look at the PUBLIC ID in the DOCTYPE declaration, map 
the ID into a XHTML DTD, load that DTD, and pass it to expat so that the entity 
declarations in the DTD become available to the XHTML document.

Marking nsbeta3 and adding Heikki to the cc list.

Status: NEW → ASSIGNED

Keywords: nsbeta3

Target Milestone: --- → M18

rbs

Reporter

Comment 5

•

24 years ago

Attached file List of HTML entities in DTD form — Details

rbs

Reporter

Comment 6

•

24 years ago

If the attachement is saved in the "bin/dtd" directory as "htmlEntities.dtd",
then no missing entities arise when passing it as the SYSTEM ID:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "htmlEntities.dtd">

But as nisheeth pointed out, the right way to go about this is to load up
the list based on the PUBLIC ID. Fortunately, we only need one list (that of
HTML4) because XHTML relies on HTML4.

(The attached htmlEntities.dtd file is a replica in DTD form of 
mozilla/htmlparser/src/nsHTMLEntityList.h)

rbs

Reporter

Comment 7

•

24 years ago

Perf: would that be possible to cache/share the list? otherwise, it is removed
to just be re-loaded when navigating from document to document, and there could
even be several copies at the same time in different windows/frames. (we have
the same problem with the over 2000 mathml entities BTW). On the other hand, if
they are shared and users' scripts alter them (i.e., if the spec allows entities
to be writable?), there could be another can of worms that don't warrant such 
troubles at this stage.

ekrock's old account (dead)

Updated

•

24 years ago

Keywords: correctness

ekrock's old account (dead)

Comment 8

•

24 years ago

Considerations:
1) If we don't fix this, there's no way to use HTML entities in XHTML documents, 
essentially limiting the text content of XHTML docs to straight Unicode text 
only, so we would like to fix this if possible.
2) However, there's no commitment to any XHTML support at all the the first 
release, so we're not required to fix this. We could choose just to not support 
XHTML at all in 6.0 and delay support to 6.01, for example.
3) Definitely don't implement a fix that will blow out our memory footprint on a 
per-frame or per-window basis. If we don't have time to do a memory-efficient 
fix, Future this.
4) How long do you think it will take to fix this?

I'm going to ask for a new xhtml keyword so we can track issues related to xhtml 
support.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

24 years ago

Keywords: xhtml

Nisheeth Ranjan

Comment 9

•

24 years ago

This bug has been marked nsbeta3- because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration, but do not clear 
the nsbeta3- nomination.

Whiteboard: [nsbeta3-]

Target Milestone: M18 → Future

rbs

Reporter

Comment 10

•

24 years ago

*** Bug 68202 has been marked as a duplicate of this bug. ***

Chris Petersen

Comment 11

•

24 years ago

Nominating for beta1. HTML entities need to be supported in XHTML since they are 
commonly used by popular sites.

Keywords: nsbeta1

Chris Petersen

Comment 12

•

24 years ago

Updated report

Keywords: nsbeta3

Whiteboard: [nsbeta3-]

Nisheeth Ranjan

Comment 13

•

24 years ago

Setting target milestone to 0.9.1...

Target Milestone: Future → mozilla0.9.1

Dan Rosen

Comment 14

•

23 years ago

Nisheeth: This relates strongly to bug 74172. We want to be able to include
XHTML1 entities in XUL files in the chrome. The simple solution I had is to put
the entity files in the chrome (xpfe/global/resources/content) but maybe there's
a better place to put them...?

OS: Windows 98 → All

Hardware: PC → All

rbs

Reporter

Updated

•

23 years ago

Blocks: 15391

rbs

Reporter

Comment 15

•

23 years ago

dr, the real challenge is to avoid to unnecessarily clutter the memory with the
same list of entities. For example, if not done properly, all the numerous XUL
fragments will be trailing their own copies of the same thing.

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Updated

•

23 years ago

Keywords: nsbeta1 → nsbeta1+

Henri Sivonen (:hsivonen)

Comment 16

•

23 years ago

Compatibility consideration:
The test case has a relative URL to the DTD. Some XML user agents (eg. IE 5 for
Mac*) might attempt to fetch the DTD and display an alert if the DTD isn't
found. Since the examples in the XHTML spec use relative URLs (a Bad Thing, IMO)
it is likely that authors will include those same relative URLs in their
documents without actually providing the DTD in the corresponding location.

If Mozilla doesn't check the existence of the DTD, a compatibility problem will
be introduced. (Author uses Mozilla but the site visitor has a browser that
attempts to fetch the DTD.)

However, generating unnecessary network traffic would be bad, too. One way to
solve this would be *not fetching* the DTD if the system identifier is an
absolute URL to the DTD hosted at the W3C and checking for existance on the
document's server in the case of a relative URL. (OK, I know this isn't very
likely to be implemented, but this really is a potential compatibility problem,
because validating parsers will have problems if they can't find the DTD.)

* IE 5 for Mac OS Classic actually tries to fetch the DTD and signals an error,
if the DTD is not found. However, the XHTML features of that browser are
otherwise too broken to be of any use.

Marek Z. Jeziorek

Comment 17

•

23 years ago

moving to TM of 0.9.2 per PDT triage (you can check it into 0.9.1 until Friday,
18/May/01 or into 0.9.2 after the tree opens)

Target Milestone: mozilla0.9.1 → mozilla0.9.2

Nisheeth Ranjan

Updated

•

23 years ago

Priority: P3 → P2

Nisheeth Ranjan

Comment 18

•

23 years ago

Moving P2 and P3 bugs over to 0.9.3...

Target Milestone: mozilla0.9.2 → mozilla0.9.3

rbs

Reporter

Comment 19

•

23 years ago

Seems like this bug would require some deeper thoughts, along the lines of
what was done for the new image lib. A DTD manager could keep remote DTDs in 
a disk DTD cache in necko, while selected local DTDs could stay in the memory
DTD cache. The manager could then be a memory pressure listener and such. Looks like
the same issues that were addressed in the new image lib may arise here.

Blake Ross

Comment 20

•

23 years ago

Missed 0.9.3.

Target Milestone: mozilla0.9.3 → mozilla0.9.4

Nisheeth Ranjan

Comment 21

•

23 years ago

Bulk re-assign of my 0.9.4 bugs to Heikki.  I will not have the cycles to work
on these bugs while Clayton is on sabbatical for the next six weeks.

Assignee: nisheeth → heikki

Status: ASSIGNED → NEW

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Updated

•

23 years ago

Target Milestone: mozilla0.9.4 → mozilla0.9.5

Aleksey Nogin

Comment 22

•

23 years ago

Using Mozilla 0.9.4 on RedHat Linux 7.1, the testcase is displayed correctly. Is
this bug fixed?

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 23

•

23 years ago

Attached file XHTML testcase (obsolete) — Details

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Updated

•

23 years ago

Target Milestone: mozilla0.9.5 → mozilla0.9.6

Alexey Chernyak

Comment 24

•

23 years ago

All entities for XHTML 1.0 are defined here:
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

For XHTML Basic and XHTML 1.1 here:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-special.ent
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-symbol.ent

James Green

Comment 25

•

23 years ago

Aleksey Nogin, you must have tried the text/html testcase which works, not the
xhtml one, which still fails. Since msn.com is attempting to be xhtml compliant,
doesn't this mean top100? Adding another bug as being blocked by this too.

Blocks: 95770

Henri Sivonen (:hsivonen)

Comment 26

•

23 years ago

This doesn't block really bug 95770. (I wouldn't consider this as a blocker for
bug 15391, either.) Numeric character references and UTF-8 work.

MSN isn't serving real XHTML using a real XML content type. (It wouldn't work in
IE if they did.) This bug has nothing to do with MSN.

Does the XML spec really require non-validating parsers to support external
character entities?

Alexey Chernyak

Comment 27

•

23 years ago

What do you mean by "external"? External files that are linked in main DTD? Well
XHTML 1.1 is entirely built on that. Nothing is defined in it's main DTD and it
has multiple levels of referencing. So I don't see how this could be optional
for XHTML support.

Alexey Chernyak

Updated

•

23 years ago

Keywords: dataloss, intl

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Updated

•

23 years ago

Target Milestone: mozilla0.9.6 → mozilla0.9.7

Sascha Claus

Comment 28

•

23 years ago

*** Bug 107736 has been marked as a duplicate of this bug. ***

Henri Sivonen (:hsivonen)

Comment 29

•

23 years ago

By "external" I mean the XML meaning: not in the same storage object as the main
document. So a separate DTD file is external to an XML document.

Alexey Chernyak

Updated

•

23 years ago

Attachment #10952 - Attachment mime type: text/html → application/xhtml+xml

Alexey Chernyak

Updated

•

23 years ago

Attachment #49600 - Attachment is obsolete: true

Alexey Chernyak

Updated

•

23 years ago

Keywords: testcase

Christopher Hoess (gone)

Comment 30

•

23 years ago

*** Bug 108079 has been marked as a duplicate of this bug. ***

Joerg Heber

Comment 31

•

23 years ago

A preliminary workaround that works with the current Mozilla browser could be:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [
<!ENTITY copy   "&#169;"> ]>

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Updated

•

23 years ago

Target Milestone: mozilla0.9.7 → mozilla0.9.9

Hixie (not reading bugmail)

Updated

•

23 years ago

Whiteboard: [Hixie-P4]

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 32

•

23 years ago

For Mozilla 1.0 I think it will be good enough if we just do the same thing our
MathML implementation is doing, basically adding the XHTML public ids to the
catalog table and creating xhtml10.dtd.

rbs, could you break the XHTML-only entities away from mathml.dtd and have
mathml.dtd include xhtml10.dtd entities as an external entity file?

Status: NEW → ASSIGNED

Joerg Heber

Comment 33

•

23 years ago

It probably should be xhtml11.dtd, although probably not much has changed since
XHTML1.0.

rbs

Reporter

Comment 34

•

23 years ago

>rbs, could you break the XHTML-only entities away from mathml.dtd and have
>mathml.dtd include xhtml10.dtd entities as an external entity file?

Is this level of fine-grain worth the trouble at this stage? Since there are 
over 2000 MathML entities, removing a couple of hundreds wouldn't make much 
difference (except... the tediousness of trying to figure out the duplicates 
with XHTML and weeding them out :-)

It would be speedier to just save attachment 11622 [details] (which already has 
everything, I think) as xhtml11.dtd and then adding two entries in the catalog 
table for: "-//W3C//DTD XHTML 1.0 Strict//EN" and "-//W3C//DTD XHTML 1.1//EN".
With that, we will be done, and gzip will do the rest.

Alexey Chernyak

Comment 35

•

23 years ago

or use exact W3C entity DTD definitions for XHTML.
I've provided the URLs for them above in comment #24.

rbs

Reporter

Comment 36

•

23 years ago

The interest is about efficiency. Only the entity definitions are of interest.
The comments are going to slow the parsing unnecessarily. When trimmed down, the
result is the lightweight version in attachment 11622 [details] (which could be compiled,
a la fastload, if there was a way to do that).

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 37

•

23 years ago

I can easily strip out the comments from the versions W3C supplies, and combine
them into one file to speed loading. The reason I did not do that yet was
because I was wondering some comment in the mathml.dtd:

This is a *customized* list for Mozilla: characters originally specified 
as combined pairs and plane 1 characters have been remapped to internal
code points within the Unicode's Private Use Area (PUA).

If this does not apply (or I don't need to do anything special) to the XHTML
1.0/1.1 entities, I am all set and can finish this soon. rbs, could you explain
the above comment to us (me) non-Unicode experts ;) ?

Or does attachment 11622 [details] list every entity the W3C lists include so that I
wouldn't need to do anything, basically? I am slightly concerned about just
accepting a list compiled from our HTML code...

rbs

Reporter

Comment 38

•

23 years ago

The remapping is about plane-1 MathML characters (5-digit code points) which
don't work on any application yet -- although some work is under progress to
eventually support them in Mozilla (bug 118000). Since HTML/XHTML code points
are only 4-digit, the comments about the remapping don't apply there.

The list from where attachment 11622 [details] was compiled is meant to be indentical with
the expected list (otherwise something would be out of sync and need fixing). It
seems to be complete to me. (The other bit is to add 'dtd/*' in the packages as
I am doing over at bug 109826.)

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 39

•

23 years ago

Attached patch Proposed fix 1 (obsolete) — Details — Splinter Review

Using the attachment 11622 [details] as the xhtml11.dtd entity list in this patch. I
would assume we need to add mappings for 4 new public IDs: 3 for XHTML 1.0
(Strict, Transitional and Frameset) and 1 for XHTML 1.1. I also made packaging
changes but I am not sure how to test them yet (MOZILLA_OFFICIAL?). This patch
is missing (a one liner?) change to make the Mac build system copy the DTD. I
think I have a fix but I want to test it first. Haven't tested on Linux yet.

rbs

Reporter

Comment 40

•

23 years ago

Looking good. You might want to get rid of this one since it is a built-in
entity in the XML spec & parser, etc (right?)
+
+<!-- Navigator entity extensions; apos is from XML -->
+<!ENTITY apos "&#39;">

Also, the ordering of the catalog table could be revamped to put the most
frequent stuff first.

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 41

•

23 years ago

Doh, forgot the apos entity; you are right! The most frequently used order would
probably just translate to putting the XHTML ids first, or do you have other
opinions?

Also this seems to work on Linux, now compiling on Mac.

rbs

Reporter

Comment 42

•

23 years ago

Possible re-ordering from:
+ {"-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN", "mathml.dtd"  },
+ {"-//W3C//DTD SVG 20001102//EN",              "svg.dtd"     },
+ {"-//W3C//DTD XHTML 1.0 Strict//EN",          "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Transitional//EN",    "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Frameset//EN",        "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1//EN",                 "xhtml11.dtd" },

To:
+ {"-//W3C//DTD XHTML 1.0 Transitional//EN",    "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1//EN",                 "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Strict//EN",          "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.0 Frameset//EN",        "xhtml11.dtd" },
+ {"-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN", "mathml.dtd"  },
+ {"-//W3C//DTD SVG 20001102//EN",              "svg.dtd"     },

Joerg Heber

Comment 43

•

23 years ago

XHTML Basic is missing?
"-//W3C//DTD XHTML Basic 1.0//EN" "xhtml-basic10.dtd"

I was also wondering, probably stupid question, if the current syntax includes
declarations with an URL such as 

"-//W3C//DTD XHTML Basic 1.0//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"

Is there a difference in how Mozilla treats that?

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 44

•

23 years ago

I would think that people that author XHTML from scratch would use Strict... but
on the other hand I prefer using Transitional because I am lazy, and that would
seem the best choice if you want to do minimal job converting your old HTML to
XHTML. I am fine with your suggestion as well.

And it seems we are also missing Basic. 

And we are also missing Modularization of XHTML public IDs. They are a bit more
problematic. Basically what is required is that they conform to the Formal
Public Identifier specification and contain 'XHTML' in the description section.
For example, this would be a correct public ID: "-//Heikki Toivonen//DTD XHTML
Programming Extensions//EN". But currently some validating XML parsers seem to
have difficulty handling that DTD, and apparently there are no XML editors that
can produce documents in that specification. Based on that, I would be happy to
leave this problem until XML Catalogs are implemented.

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 45

•

23 years ago

Attached patch Mac build changes (obsolete) — Details — Splinter Review

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 46

•

23 years ago

Attached patch New combined patch — Details — Splinter Review

Attachment #66202 - Attachment is obsolete: true

Attachment #66334 - Attachment is obsolete: true

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 47

•

23 years ago

Attached patch Embedding packager changes, want these? — Details — Splinter Review

rbs

Reporter

Comment 48

•

23 years ago

Comment on attachment 66355 [details] [diff] [review]
New combined patch

r=rbs

Attachment #66355 - Flags: review+

rbs

Reporter

Comment 49

•

23 years ago

Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

r=rbs

Attachment #66359 - Flags: review+

Adam Lock

Comment 50

•

23 years ago

Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

r=adamlock

Johnny Stenback (:jst)

Comment 51

•

23 years ago

Comment on attachment 66355 [details] [diff] [review]
New combined patch

sr=jst

Attachment #66355 - Flags: superreview+

Johnny Stenback (:jst)

Comment 52

•

23 years ago

Comment on attachment 66359 [details] [diff] [review]
Embedding packager changes, want these?

sr=jst

Attachment #66359 - Flags: superreview+

Heikki Toivonen (remove -bugzilla when emailing directly)

Assignee

Comment 53

•

23 years ago

Fixed.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Christopher Aillon (sabbatical, not receiving bugmail)

Comment 54

•

23 years ago

*** Bug 121808 has been marked as a duplicate of this bug. ***

Alexey Chernyak

Comment 55

•

23 years ago

verified on Win2K
works with:

XHTML 1.0 Strict
XHTML 1.0 Transitional
XHTML 1.0 Frameset
XHTML Basic 1.0
XHTML 1.1

as both application/xhtml+xml and text/xml

Status: RESOLVED → VERIFIED

Simon Fraser [no longer active]

Comment 56

•

23 years ago

+    
+    MakeAlias(":mozilla:content:xml:content:src:xhtml11.dtd",                          
"$dist_dir"."dtd:");
+    

This makes a 'dtd' folder next to the application. On Mac OS, this adds to the 
clutter that users see when they view the Mozilla/Netscape folder in the Finder, 
which we really try to avoid.

I filed bug 122710 on this issue.

You need to log in before you can comment on or make changes to this bug.