Closed Bug 22942 (entities) Opened 25 years ago Closed 13 years ago

Load external DTDs (entity/entities) (local and remote) if a pref is set

Categories

(Core :: XML, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: nisheeth_mozilla, Unassigned)

References

Details

Attachments

(2 files, 5 obsolete files)

Currently, XML DTDs are loaded if they are pointed to by a chrome url or if they
are placed in a special local directory.  We need to extend this functionality
so that all local and remote DTDs get loaded if the user sets a pref.
Status: NEW → ASSIGNED
Target Milestone: M14
Setting milestone to M14...
spam: added self to cc list as this might affect my realm.
*** Bug 11538 has been marked as a duplicate of this bug. ***
Not a beta blocker.  Setting milestone to M15...
Target Milestone: M14 → M15
...and the pref should be "on" by default.

Since the "off" mode is mostly useful to Mozilla's components like MathML (or
other corporates) that can get their "corporate" DTD auto-installed in the
special client's dtd directory. As for web-designers out there, they will not
have the privilege of getting their dtd auto-installed in the user's side. 
Moving bugs out by one milestone...
Target Milestone: M15 → M16
Will look at this post beta 2...
Target Milestone: M16 → M17
Marking M18...
Target Milestone: M17 → M18
This bug has been marked "future" because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration. 
Target Milestone: M18 → Future
In bug 11538 this problem was proposed to be fixed by a modification to the 
expat glue.

Is this still the way to fix it?

I want to be able to develop multi-lingual apps using XPFE, and since XUL loads 
over http nicely I'd like to load the DTD language specific stuff over http 
too.  I still dont understand how to get chrome://myfirstxulapp/locale/file.dtd 
to resolve to an http: address rather than a local address?

Since I have a avested interest in this bug, if it is the 'preventer', if I can 
understand the solution I can probably work on it ...

HELP!
added myself to cc:
Vidur Apparao is exploring the possibility of implementing synchronous XML 
document loading over HTTP in his XML Extras component.  Once he's done, his 
code could serve as a resource for implementing synchronous DTD loading over 
HTTP.  I suggest that you sync up with him.  Please feel free to take ownership 
of this bug if you want to retarget the milestone to somthing earlier than 
"Future" which means  post Netscape 6.0.
QA Contact: chrisd → petersen
Suggest: all/all  for platform/OS.
OS: Windows NT → All
Hardware: PC → All
adding myself to cc:
*** Bug 68615 has been marked as a duplicate of this bug. ***
Anything new on this issue? DTD over the wire would be great.
*** Bug 69799 has been marked as a duplicate of this bug. ***
For the record, what is the "special local directory"?
bin/dtd
If the external DTD isn't loaded then Mozilla won't be able to go navigate to
elements by id (e.g. http://someserver/somedir/somefile.xml#someid) for that file.
The workaround is to move the external DTD into the internal subset.
I agree that is the current workaround and I thank you for pointing it out here
and on n.p.p.browser. However, that isn't really an acceptable workaround if the
number of element declarations with ID attributes is large and/or the person
linking to the document doesn't have write access to the document.
The code for mangling dtd URLs is in
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsExpatTokenizer.cpp#815,
right?
Could we somehow fix this to make it support XML for file: URLs?
Maybe leave the code as is, and if it can't find the file in dtd, check for 
IsScheme("file", isLoadable)?
Us XSLT folks would really like to have most of mozilla XML support, at least
for local files, so we can run tests and benchmarks by third companies.
Like the docbook XSLT stylesheets, for example.
For XUL files it would also be great to load external DTDs, especially via http
to enable remote XUL apps.

The behavior for XUL should IMHO not be bound to a pref. If you think it should,
consider two flags: one for XUL (remote app) and one for arbitrary XML.
Mitch, per comment #24, am I right in that security considerations would emerge
when enabling the loading of external DTDs for XUL files via http?
Considering security for DTD via http for XUL (comments #24 and #25):

The access could be restricted (as usual) to DTDs served from the originating
server.

Though this might be a problem for "standard" DTDs eg from w3c. A remote XUL app
would have to rely on the availabilty of a local copy.

PS: What is the status of synchronous loading mentioned in comment #12?
Blocks: remote-xul
*** Bug 145507 has been marked as a duplicate of this bug. ***
*** Bug 153603 has been marked as a duplicate of this bug. ***
Re: comment #22

Workaround is for users to copy the DTD file in the special Mozilla's res/dtd
directory. More precisely, if the URI of the original DTD file is:
"protocol://long-path/to/filename.ext", then save the DTD file as "filename.ext"
(i.e., the same basename) in the local Mozilla's res/dtd. From there on, Mozilla
will re-associate the external DTD to the local copy it has whenever a DOCTYPE
with that DTD is encountered.
*** Bug 161096 has been marked as a duplicate of this bug. ***
*** Bug 181231 has been marked as a duplicate of this bug. ***
when can we work this out?
QA Contact: petersen → rakeshmishra
Blocks: 114376
Changing summary to make it easier to find this bug.
Summary: Load external DTDs (local and remote) if a pref is set → Load external DTDs (local and remote) if a pref is set/implement validating XML parser
*** Bug 178308 has been marked as a duplicate of this bug. ***
So just a question, and please, I don't intend to sound arrogant, since I have not contributed 
anything to Mozilla (except for some bug reports), but when will this bug be solved? Is it really 
so difficult to solve this little bug? 
Solving this bug requires either switching to a different XML parser completely
or rewriting the existing XML parser to be validating.  Which part of that is a
"little bug"?
bz, IIRC, this is not as bad as you indicate it is.
The main problem is a good strategy for performance.
Marking a dependency on XML catalogs, which should get rid of the requirement
to load dtds for some xml files, and getting that list to be extensible.

About validation, being non-validating just says that we are not required to
load external DTDs, not that we must not.

On the expat side of things, we may have to block the parser in the external
entity ref handler, or even cache the results of it.

All I can say, DTDs work fine from chrome, and with a little patch from file://.

(oops, just recognized that heikki made this bug about validating parsers a bit,
which is something completely different, IMHO. Shouldn't that be a futured bug
with a dependency in some way with this one, so folks finding one can get to
this one, if they just look for DTDs?)
Depends on: xmlcatalog
*** Bug 196188 has been marked as a duplicate of this bug. ***
bz, Christian asked if it is difficult bug to fix. He didn't imply he thinks it is.

Anyway I don't get what loading of external DTDs has to do with validating them.
I can see issues with blocking i/o over the net, but no reqirement to validate
the DTDs. Mozilla does not validate them for chrome, why validate them when they
come over the net?

Suggest droping the suffix from the subject.
Broke validation into bug 196355.
Summary: Load external DTDs (local and remote) if a pref is set/implement validating XML parser → Load external DTDs (local and remote) if a pref is set
*** Bug 201352 has been marked as a duplicate of this bug. ***
Now this patch looks horrible, and it violates most of our coding conventions
etc. This is just the first patch that seemed to work with trivially simple
testcase:

x.xml: <!DOCTYPE doc SYSTEM "x.dtd"><doc>&hello;</doc>
x.dtd: <!ENTITY hello "Hello There">

I fully expect this to not work if there are several XML files being loaded at
the same time, or the DTD includes other DTDs. It probably messes up
internalSubset etc. Also this approach blocks the UI completely until the DTD
load finishes. There is not even a pref to set, this tries to load remote DTDs
unconditionally.
Taking, but don't expect anything soon.
Assignee: nisheeth → heikki
Status: ASSIGNED → NEW
Target Milestone: Future → ---
Cc:ing alecf & darin, maybe it should be possible these days to use a
"background" stream that doesn't block the UI.
Not that I know of (I would have done it with XMLHttpRequest and document.load()
synchronous portions otherwise, and the XBL syncloader would do it as well).

However, I thought of another way for this case that should work without
blocking the UI. Our Expat can be blocked without blocking the UI. Therefore, it
should be possible to block the parser when we start to load external DTD
asynchronously, and unblock it once the asynchronous load finishes. Something to
keep in mind is the fact that DTDs can also load other DTDs, so we would need to
block & unblock the parser(s) loading DTDs as necessary.
Some pointers for the "background" stream: bug 11232, bug 190730.

Another thing to consider with the other alternative of blocking the parser is
some DTDs can be huge.
personally i'm strongly against having a pref. This pref will on any site that
depends on external DTDs work as a "make mozilla fail" pref, which seems pretty
useless. Which user would ever _not_ want to load external DTDs?

We could possibly have a compile-time switch for embedders
we can KIND OF do this in the background by at least letting the DTD load on a
background thread, but I don't think that buys us much since we'd still have to
block the xml parser. using NS_BackgroundInputStream in combination with fixing
bug 197114 would get you a minor win..
Blocks: 69799
Just a question. How hard would it be to implement XML catalogs in MOzilla? 
From reading all the discussion it seems that catalogs would be the way to go 
and that it would much easier to teach people about adding to a catalog than 
some of the other things suggested in this discussion
XML catalogs are bug 98413.
Alias: entities
*** Bug 207874 has been marked as a duplicate of this bug. ***
I agree with comment #48, being able to load external DTDs should not depend on
a preference setting. Are there any estimation of when this might eventually get
fixed? I am still unable to view
<http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part1.xml>, which I think is a
bit embarrassing for Mozilla (what, Mozilla cannot load a W3C document? ;-)).
QA Contact: rakeshmishra → ashishbhatt
I also agree with comment #48 and comment #53.  

Most users who have the most to gain from XML will not understand what a DTD is
or why they should have to make a change to their preferences in order to
support it.

It is now more important than ever that Mozilla is seen to be totally W3C
compliant as regards XML.  
*** Bug 202291 has been marked as a duplicate of this bug. ***
*** Bug 225949 has been marked as a duplicate of this bug. ***
(In reply to comment #29)
> Re: comment #22
> 
> Workaround is for users to copy the DTD file in the special Mozilla's res/dtd
> directory...

If I have a DTD which itself includes an external parameter entity file it
seems that only the DTD file and not the subsequent parameter entity file
can be loaded using this workaround. Does anyone have a better understanding
of what is supposed to happen in this case?
(In reply to comment #57)
> (In reply to comment #29)
> If I have a DTD which itself includes an external parameter entity file it
> seems that only the DTD file and not the subsequent parameter entity file
> can be loaded using this workaround. Does anyone have a better understanding
> of what is supposed to happen in this case?

Put the entities as well in the special local directory?
*** Bug 225487 has been marked as a duplicate of this bug. ***
Summary: Load external DTDs (local and remote) if a pref is set → Load external DTDs (entity/entities) (local and remote) if a pref is set
This blocks RSS files containing entities defined in an external dtd (RSS 0.91).
Blocks: 267375
Blocks: 267350
I'm a total n00b in the field of asynchronous xml parsing, and UI blocking
stuff. But would it be possible to have the xml document display while the
external dtd's and entities are loaded, and when an entity is loaded it replaces
the previously displayed entity 'placeholder'? Like an <img ..> tag that is
still loading?
No, because to check for well-formedness (which must be done) entities have to
be expanded as they are found. For example,

   <test>
     &foo;
   </test>

...with:

   <!ENTITY foo "<test>">

...needs to trigger a well-formedness error when &foo; is expanded. AIUI.
*** Bug 267375 has been marked as a duplicate of this bug. ***
Ian, according to the XML spec it is possible to load the xml doc without
loading the external dtd. You would just have unresolved entityrefs in the DOM.
Of course, it would require mozilla DOM support for entityrefs.
And <!ENTITY foo "<test>"> is invalid, because entities must be well-formed
independent XML.
I don't think that is true. According to
http://w3c.org/TR/2004/REC-xml11-20040204/#NT-EntityValue
entity values just can't contain % and &, for the rest anything is valid, and
Ian is right that referencing &foo; would trigger a well formedness error if
there is no &slashfoo; reference either (</foo>).

Isn't it possible to defer well-formedness and validity checking to after the
loading, while still displaying some placeholder or just &foo; while the
document is still loading. If after loading all entities there are still
undefined references or the resulting document is not well-formed, then throw an
error.
It might be confusing though to users seeing their document load and suddenly
getting an error. Maybe XML wfness and undefined entity errors should be
displayed in the same way that javascript errors are. Although wfness errors are
usually more critical than undefined enitities.
Wouldn't rule 3 from http://w3c.org/TR/2004/REC-xml11-20040204/#sec-well-formed
mean they have to be well-formed? As I understand it, unparsed entities are not
expanded, but just made available via the entity reference. (if this is wrong,
then I'm sorry, but reading too much of the XML spec gives me a headache ;) )
Silver is correct, each parsed entity must be well-formed. If this were not the
case the DOM entityrefs spec would be useless.

It is not possible to delay well-formedness checking, because it is a question
of parsing and affects document structure. We use a non-validating parser, so we
don't have to worry about validity. We *can* use non-resolved DOM entityrefs to
do delayed loading of external DTDs, but the expat parser does not currently
support this approach, and it would require significant changes in the mozilla
core DOM. Not impossible, but a serious undertaking.
I stand corrected.
*** Bug 288767 has been marked as a duplicate of this bug. ***
See bug #299682 for my suggestion about preferences unification for different
kinds of external resources, including DTDs.
*** Bug 305877 has been marked as a duplicate of this bug. ***
Assignee: hjtoi-bugzilla → peterv
Depends on: 274777
Attached patch wip (obsolete) — Splinter Review
Some edge cases probably don't work yet.
Need to extract some of the parser changes in smaller patches.
Need to add a pref.
Personally I'm against having a pref for this. Authors need to be able to depend on this feature being there otherwise it's pretty useless. Additionally, no user is going to know what the heck that pref is.

I guess we could make it a hidden pref, but I don't really see the point in that.
What's the status of this? Did it get any easier to fix after the landing of Expat 1.95.8? The successful fix of this bug would make some XML work I'm doing much easier :)
This bug still exists in Firefox 1.5.0.1.
See for example the W3C XML Namespace specification (XML version):
http://www.w3.org/TR/2004/REC-xml-names11-20040204/REC-xml-names11-20040204.xml
It's a pity to have to resort to IE.
The status is that some pieces of the patch have been landed (bug 323299) and I need to sync it with trunk.
And please, no more comments like comment 76, they're not useful.
Status: NEW → ASSIGNED
Would it be possible to make the DTD search path configurable for a single XHMHttpRequest?  The reason I ask this is regarding the Newsfox plugin http://newsfox.mozdev.org/ which breaks when an entity such as &deg; is included with a proper rss-0.91 feed.  Being able to tell it to search in chrome://newsfox/DTDs or some such would allow the problem to be overcome.

While on Win32 you can install do the res/dtd dir, in *nix you would generally require root access to install there.
I'm not sure exactly what you mean, but in any event that's a different bug.
Re: comment 79

Basically, because external DTDs aren't loaded, and we can't reliably install new ones to the res/dtd dir, some way of working around this bug would be very handy (will open a new feature request).
Any news on this bug yet?

C'mon guys, this was first logged in Jan 2000! There's plenty of votes, plenty of people commenting, and there's been plenty of patience. Can't someone change the milestone and put some effort in this?

(I know I'm going to get flamed and told that this comment "is not useful", but so far this bug has stagnated and NO comment seems to have helped)
What's the main reason for this not yet resolved?  I admit I know little about Mozilla's/expat's parsing of XML, but as *some* external entities are loaded, why can't this restriction be simply lifted so *all* external entities will be recognized?

If you all finally agree this should get in, why can't we simply load any DTDs as chrome ones and those in res/dtd are?  While there are already some patches, I'd voluteer to work on this if that's all what is needed here.
(In reply to comment #82)
> What's the main reason for this not yet resolved?  I admit I know little about
> Mozilla's/expat's parsing of XML, but as *some* external entities are loaded,
> why can't this restriction be simply lifted so *all* external entities will be
> recognized?

External DTDs need to be loaded asynchronously, so you need to block Expat while loading the DTD. Expat does not support that currently, mainly because it's non-trivial. It's unclear whether the patch I have is complete.
So the difference here between chrome/res DTDs and general external DTDs is that for the local ones you can assume they load quickly and blocking is ok while for others you can not?
Attached patch v1 (obsolete) — Splinter Review
This probably regresses bug 61630 and bug 191482, so need to figure out a solution for that.
Attachment #120224 - Attachment is obsolete: true
Attachment #202535 - Attachment is obsolete: true
Attached patch v1.1 (obsolete) — Splinter Review
Fix the two issues mentioned in comment 85.
Attachment #267628 - Attachment is obsolete: true
Attached patch v1.2 (obsolete) — Splinter Review
I've used a same origin policy for loading the DTDs for now, chrome and the known local DTDs can still be loaded by anyone as before. If a DTD can't be loaded (because of different origin, redirect denied or authentication failed) we continue parsing, though you'll probably still see an error because of missing enities then. Errors in the DTD or recursively loading the same entity do get reported and stop parsing.
I'll ask mrbkap to take a look at the nsParser changes too.
Attachment #267712 - Attachment is obsolete: true
Attachment #268004 - Flags: superreview?(jst)
Attachment #268004 - Flags: review?(jst)
Comment on attachment 268004 [details] [diff] [review]
v1.2

Will post a new patch that doesn't recurse for DTDs loading DTDs.
Attachment #268004 - Flags: superreview?(jst)
Attachment #268004 - Flags: review?(jst)
Attached patch v1.3Splinter Review
Attachment #268004 - Attachment is obsolete: true
Attachment #269859 - Flags: review?(mrbkap)
Attachment #269859 - Flags: review?(mrbkap) → review+
Quite I while ago, I submitted this bug report.  The original problem I reported was that an error occurs when viewing an XML document with an entity reference that is defined in a DTD referenced by the DTD attached to the XML document.  Secondly, if you use a workaround for the entity reference problem, the application of the XSLT stylesheet fails to produce the desired result.  I'm concerned that some of the issues raised by the initial report have been forgotten.

After expanding the attachment, point your browser at simpdoc/simpdoc.xml.

The README in this attachment includes an extended description of the problem.  The contents of the README follow.

Simpdoc is a stylesheet for using XML to produce XHTML documents with
automatically numbered sections and references, and an automatically
generated table of contents.  It was designed with the idea that the
stylesheet would be applied within a browser, and one would publish
content by making the XML document, its DTD, and the stylesheets,
accessible.  Browsers would be given the URL for the XML document.

For Firefox, there are currently two problems when attempting to view
the XML document.  The XML reader doesn't understand the &copy; entity
reference even though it is defined in files referenced by the DTD.
If one replaces the &copy; entity reference with &#0169;, one can view
the document, but the XSLT stylesheet's transformations fail to
produce the numbered sections, references, and table of contents.

The enclosed Java program validates the document and correctly applies
the transformation.  See the GNUmakefile for instructions on how to
run the program behind a proxy.
Attachment #269859 - Flags: superreview?(bzbarsky)
Comment on attachment 269859 [details] [diff] [review]
v1.3

>Index: content/base/src/nsContentSink.cpp
>   if (mCanInterruptParser) {
>-    mDocument->UnblockOnload(PR_TRUE);
>+    UnblockOnload();

So to be honest... I think we can just nuke this code, no?

nsDocument::BeginLoad and nsDocument::EndLoad block and unblock onload respectively.  These are called for everything except XUL.  XUL does its own load blocking in PrepareToWalk and EndLoad.  So I don't think you need these even there.

I meant to file a bug on removing it some time back and forgot to.

>Index: parser/expat/lib/expat.h

The new members could use some serious documenting.  What do they actually mean?  Where are the magic "2" values coming from?  This is very hard to review as-is.

I'd really like those docs before I look at this further.
Comment on attachment 269859 [details] [diff] [review]
v1.3

sr- pending that documentation
Attachment #269859 - Flags: superreview?(bzbarsky) → superreview-
Comment on attachment 269859 [details] [diff] [review]
v1.3

added approval1.9 request
Attachment #269859 - Flags: approval1.9?
Uh... you generally want to have reviews before doing that.
Blocks: tomtom
Am I right in thinking that this won't land for Fx3? If it won't, is it something that could make it into a point release, or is it something that won't be seen in the wild until Fx4?

I'm about to start work on a large remote-XUL application in the XUL dark matter world, for which Fx3 or above (or Prism) will be a requirement. This issue affects the way in which I'll handle localisation, so even a vague idea of a timescale would let me know whether external DTDs will be a practical solution, or if I'll have to do something server-side instead.
I hate to make a comment without having the ability to do something to help, but if there is one bug fix that would make a whole lot of people happy (if the 74 votes wasn't a good indication), I think it must be this one. (that and https://bugzilla.mozilla.org/show_bug.cgi?id=267350 ) There is so much data we're missing out on in our favorite browser because of this... 
Please don't WONTFIX this - or at least not without defining some useful way of handling localisation for Remote XUL (running on an intranet, in our case).

Presently our Remote XUL app deals with translations by using DTD files for entities which are inserted inline into each page on the server-side. Basically every page carries the whole DTD for the language with it, which has a huge effect on the amount of data we transfer. If this bug ever gets fixed, we'll be able to instead insert a link to the DTD, significantly reducing this burden (assuming the DTD itself gets cached).

There may be issues with DTDs on the web at large, but as far as I know there's presently no recommended way to deal with localisation of Remote XUL. Fixing this issue (for XUL, anyway) at least provides a viable option, whereas a WONTFIX would leave Remote XUL even further out in the cold than it already is.
Henri, the articles are irrelevant, because they talk about validation (and fetching the DTD instead of using built-in, local versions of the DTDs).
This bug is mainly needed for remote XUL, desperately so. We need to be able to use the same XUL locally and remotely and not entirely rewrite the XUL source when putting it on a server. Thus, in practical effect, this bug has prevented usage of remote XUL in several bigger commercial projects.
The articles don't talk about validation. They talk about loading DTDs in a Web context (mainly for character entities).

For the localization use case, the server could be made aware of the user's locale (e.g. via Accept-Language) and the server could parse the document with the locale-specific DTD and reserialize the resulting infoset as DTDless XML.

(Using XUL box model with HTML5 markup seems like an approach that would be more compatible with other HTML user agents than using XUL markup remotely.)
Well, your article does talk about validation. Usually, each remote XUL document would have its own DTD file, located on the same server - so what's the point talking about a single point of failure?

And while everybody here is aware of alternative approaches to localization (which all have their advantages and disadvantages), localizing remove XUL in the same way local XUL can be localized would still be a great improvement.
The single point of failure issue isn't about XUL. It's about enabling DTD loading for XML in general (which would include XHTML, SVG and MathML).
Note that currently the patch enforces the same origin policy on the DTD loading, making single point of failure less of an issue.
By such logic (about having no external DTDs), FF should not support stylesheets or external scripts because some browsers might not use them or know how to use them, or because some people link to the stylesheets used on external sites.

This is also relevant for allowing document creators to work within their document applications using convenient shortcuts (entities) in a language neutral way. To give an example, here is a page I created in my own translation of XHTML code into Chinese equivalent (i.e., a "Chinese XHTML", where the tags themselves have a one-to-one correspondence with XHTML (besides allowing CSS-as-XML to allow CSS to be internationalized as well) but which use the Chinese script): http://bahai-library.com/zamir/chintest9.xml . This works with a stylesheet to convert it into the "English" XHTML which browsers can render. (If CSS were comprehensive enough to cover things like forcing a tag to display as a form, CSS could be used instead of XSL.).  If DTD parsing were supported, one could conveniently also conveniently redefine XHTML entities such that entities such as &nbsp; could instead be represented with Chinese character equivalents (or other language equivalents)--without having to add all possible entities to the internal subset.

A solution for entities wishing to make DTD's available (as opposed to DTD's making entities available!) without facing undue burdens on their servers, is simply not to make their DTD(s) available as files which can be linked to (or serve them differently perhaps). And couldn't the PUBLIC identifier be relied on to preclude or limit external loading for well known dialects? Isn't that its purpose?

But in some cases, sites such as Yahoo have even actually encouraged people to point to scripts for reuse at their site, so no doubt some sites (with deep pockets or small communities) would similarly want to even encourage reuse of their own DTDs without users needing to save and define them locally.

As far as there being a problem with external DTD parsing being optional, while obviously using one will break things for some people, since people are already using them anyways, why not set the bar higher instead of lower and try to implement it so that the incentive for others to implement becomes higher?

As far as Tim Bray's statement you cite, I happen to disagree with that. Why couldn't browsers cache entity files that it finds referenced? Anyhow, if document creators put documents with DTDs on the web (which they are already using offline), it is certainly better to be able to load a document in some form than to get an error message! If the users will feel burned by having choppy rendering, they can urge the document creator to let a script preprocess and render it dynamically with entities resolved, but I don't see how it's our business to tell people (including myself and apparently a good number of others based on the popularity of this bug who'd just like to put their XML documents out there without having to rewrite them) how they must use XML and prevent them from sharing documents already in wide use off of the web. If DTD's are cached, this could even SPEED UP rendering in some cases, especially if external entities could be used.

As others have also said, I hope this will be enabled by default, and only be DISABLE-able by a preference.

This is not only about XUL either, while that is one good argument. This is about the EXTENSIBLE Markup Language that allows people to have the freedom to conveniently make their own applications and share them. This is also relevant to XML languages which are already standards, like TEI or DocBook, as well as new applications.

And why do people need to make a false dichotomy between web and non-web? If I have a document that works offline, why not be able to share it online? Respectfully, what's so difficult to understand about that? I also think that it is a very limiting assumption to say that web users only wish to use X/HTML/SVG/MathML. Let a hundred standards bloom!
(In reply to comment #97)
> I suggest WONTFIXing this.
> 
> See
> http://hsivonen.iki.fi/no-dtd/
> http://groups.google.com/group/mozilla.dev.tech.mathml/browse_thread/thread/e7f7efbb5e161348/8d64a935fe730de7

The “single point of failure” argument doesn’t seem to hold water in light of comment #103 and DTD catalogs. You could also deal with new DTDs by implementing a (DTD) update service similar to the ones used for Firefox add‐ons, browser updates, or Live Bookmarks.

The fact that the feature is optional is a problem for documents that must be used in multiple browsers, but having the feature is certainly better than nothing and implementing it may encourage other vendors to do the same. This is pretty much the same process for /anything/ a browser vendor is the first to implement (even when the feature is required, not optional, per a given specification).

Anyway, I think this feature is more useful than for just character references which can already be dealt with via numeric character references or UTF-8. One example of where I’d find this feature useful is in reducing the verbosity of repeated code; e.g., repeated occurrences of |<abbr title="Extensible Hypertext Markup Language">XHTML</abbr>| could be changed to the much less verbose and more human‐readable |&XHTML;| via |<!ENTITY XHTML "<abbr title='Extensible Hypertext Markup Language'>XHTML</abbr>">|. Another example: |<a href="&YTV;jGUQDdfr2ZQ">&YTV;jGUQDdfr2ZQ</a>| via |<!ENTITY YTV "http://www.youtube.com/watch?v=">|. I think that you could also use this feature to implement external CSS style sheets with constants/variables (until something like CSS Variables are implemented, at least).
While comment #104 is the real reason why this should be implemented, few people will use that. Comment #105 is what most web authors will use this feature for, me included, and it has awesome potential.
(In reply to comment #103)
> Note that currently the patch enforces the same origin policy on the DTD
> loading, making single point of failure less of an issue.

I didn't realize that when first commenting. Yeah, same origin takes away the DDoS and single point of failure issues. However, it also precludes the use of well-known DTDs for character entities. (Relaxing the same origin policy using Access-Control would reintroduce the DDoS and the single point of failure problems.)

Enforcing the same origin policy (and it indeed needs to be enforced in the general case to avoid data leakage) means that authors can only use DTDs as a macro mechanism where they themselves host the expansions. When the entity references and the entity definitions come from the same origin, the usefulness of late expansion in the browser is greatly diminished. It seems to me that what is left is a rather small gain at the cost of introducing an incompatibility of Fatal Error proportions with previous Gecko versions and other browsers and making it prohibitively harder to kill DTDs in a future version of XML.

When the remote XUL and the DTDs come from the same origin, the entities could be expanded by the server at the cost of breaking the cacheability of the XUL document. Doing so wouldn't cause a "Fatal Error" level of incompatibility with previous Gecko versions, though, and wouldn't make the Web dependent on external DTD processing.

(In reply to comment #104)
> By such logic (about having no external DTDs), FF should not support
> stylesheets or external scripts because some browsers might not use them or
> know how to use them, or because some people link to the stylesheets used on
> external sites.

Style sheets are optional by design. Style sheets also provide usefulness to the interoperable Web platform than external DTD processing would provide.

> This is also relevant for allowing document creators to work within their
> document applications using convenient shortcuts (entities) in a language
> neutral way. To give an example, here is a page I created in my own translation
> of XHTML code into Chinese equivalent (i.e., a "Chinese XHTML", where the tags
> themselves have a one-to-one correspondence with XHTML (besides allowing
> CSS-as-XML to allow CSS to be internationalized as well) but which use the
> Chinese script): http://bahai-library.com/zamir/chintest9.xml . This works with
> a stylesheet to convert it into the "English" XHTML which browsers can render.
> (If CSS were comprehensive enough to cover things like forcing a tag to display
> as a form, CSS could be used instead of XSL.).  If DTD parsing were supported,
> one could conveniently also conveniently redefine XHTML entities such that
> entities such as &nbsp; could instead be represented with Chinese character
> equivalents (or other language equivalents)--without having to add all possible
> entities to the internal subset.

You can do the substitutions on the server side. Sending home-grown vocabularies without well-known semantics over the public Web breaks processing based on well-known semantics, which leads to bad Babelization of markup.

> A solution for entities wishing to make DTD's available (as opposed to DTD's
> making entities available!) without facing undue burdens on their servers, is
> simply not to make their DTD(s) available as files which can be linked to (or
> serve them differently perhaps). 

I fail to see how it is an undue burder to expand entities on the server. Servers hosting Web apps do much more complex tasks all the time.

> And couldn't the PUBLIC identifier be relied
> on to preclude or limit external loading for well known dialects? Isn't that
> its purpose?

The public id is a legacy construct from pre-URI SGML era.

> But in some cases, sites such as Yahoo have even actually encouraged people to
> point to scripts for reuse at their site, so no doubt some sites (with deep
> pockets or small communities) would similarly want to even encourage reuse of
> their own DTDs without users needing to save and define them locally.

Preventing data leakage would require the use of Access-Control, which would reintroduce the DDoS on www.w3.org.
http://dev.w3.org/2006/waf/access-control/

> As far as there being a problem with external DTD parsing being optional, while
> obviously using one will break things for some people, since people are already
> using them anyways, why not set the bar higher instead of lower and try to
> implement it so that the incentive for others to implement becomes higher?

People can't be already relying on external DTDs in Web content (even if they refer to extenal DTDs due to copying and pasting from a W3C example) as the top three browsers that support XHTML and SVG don't load external DTDs.

If fetching external DTDs is introduced to the Web platform, we can never get rid of the feature once people start relying on it. It would be a shame not to be able to kill DTDs in a future version of XML, because DTDs represent the vast majority of complexity of XML (both spec and implementations) but DTDs represent the tiny minority of usefulness of XML. (Indeed, the trend is clearly away from DTDs in XML vocubulary design in the XML community just about everywhere outside the XHTML2 WG.)

Moreover, for a couple of years now, just about every proposal of what a new major revision of XML should be like makes killing DTDs or killing external DTDs a point of improvement. Clearly, the way the wind is blowing is away from DTDs.

> As far as Tim Bray's statement you cite, I happen to disagree with that. Why
> couldn't browsers cache entity files that it finds referenced?

Caching the bytes doesn't remove the perf hit when parsing. It also doesn't remove the DDoS problem when first fetching a well-known DTD.

> This is not only about XUL either, while that is one good argument. This is
> about the EXTENSIBLE Markup Language that allows people to have the freedom to
> conveniently make their own applications and share them.

Not using well-known vocabularies is bad for semantic-dependent processing e.g. for accessibility and search.

> This is also relevant
> to XML languages which are already standards, like TEI or DocBook, as well as
> new applications.

TEI and DocBook aren't Web languages. XHTML5, SVG and MathML are. New applications of XML are pretty much always DTDless (outside the XHTML2 WG).

> And why do people need to make a false dichotomy between web and non-web?

Non-Web doesn't burden browsers, so it isn't a concern when considering what browsers need to keep supporting for decades if not centuries to come.

> Let a hundred standards bloom!

That goes against the very point of having a standard.

(In reply to comment #105)
> You could also deal with new DTDs by
> implementing a (DTD) update service similar to the ones used for Firefox
> add‐ons, browser updates, or Live Bookmarks.

I think the cost/benefit ratio of introducing an update mechanism to support a sunsetting legacy feature of XML is unfavorable.
 
> The fact that the feature is optional is a problem for documents that must be
> used in multiple browsers, but having the feature is certainly better than
> nothing and implementing it may encourage other vendors to do the same. 

That's part of the problem. If other vendors implement this too, a new version of XML can't remove DTDs.
> When the remote XUL and the DTDs come from the same origin, 
> the entities could be expanded by the server at the cost 
> of breaking the cacheability of the XUL document.

You throw around the idea of expanding entities on the server as though it is a panacea that will make this bug redundant. Unfortunately this is not the case:

1) As you mention, it breaks the cacheability of the XUL document
2) It breaks code compatibility between local and remote XUL
3) It assumes the availability of a server-side processing environment, precluding the use of a static web server
4) It assumes that the developer actually knows how to code such a system
5) Even if they do, it adds an extra burden on them to write the code
6) It introduces non-portable code that needs to be re-written for each supported server-side language
7) It automatically places a larger burden on the server even if it's not always appropriate to do so (see 3, for example)


If the answer to external references is to just "expand them on the server" then why do we have separate CSS files - their content could just be injected into style attributes on the page. Separate Javascript files? Let's just inject them directly into <script> blocks. Heck, while we're at it we may as well convert all image references into their corresponding data URLs and inject them as well.


"Fix it on the server" should be an option, not a requirement.
> (In reply to comment #103)
> > Note that currently the patch enforces the same origin policy on the DTD
> > loading, making single point of failure less of an issue.
>
> I didn't realize that when first commenting. Yeah, same origin takes away the
> DDoS and single point of failure issues. However, it also precludes the use of
> well-known DTDs for character entities. (Relaxing the same origin policy using
> Access-Control would reintroduce the DDoS and the single point of failure
> problems.)
>
> Enforcing the same origin policy (and it indeed needs to be enforced in the
> general case to avoid data leakage) means that authors can only use DTDs as a
> macro mechanism where they themselves host the expansions. When the entity
> references and the entity definitions come from the same origin, the usefulness
> of late expansion in the browser is greatly diminished. It seems to me that
> what is left is a rather small gain at the cost of introducing an
> incompatibility of Fatal Error proportions with previous Gecko versions and
> other browsers and making it prohibitively harder to kill DTDs in a future
> version of XML.

As far as incompatibilities with previous Gecko versions, it seems most Gecko users upgrade eventually anyways ( http://en.wikipedia.org/wiki/Mozilla_Firefox#Market_adoption ), so I don't see this as a very big issue, especially when we're talking about documents which are likely to be a relatively niche interest. While this might affect access to new XUL applications from older Gecko-based browsers, so is also the case with new extensions which only work with the latest Gecko version.

As far as incompatibilities with other browsers, yes they won't work in some of them--until other browsers decide to support the feature. It should be noted that one other important browser, IE, already DOES by default allow parsing of entities in a referenced DTD (for XML), so the "other browsers" argument is, I think, made a little weaker thereby. Also, allowing it also enables people to easily access documents which were created and consumed originally for off-web uses (or for use in some environments with server-side preprocessing but which are shared in other environments); to those who have made such documents--this is hardly a small gain. And, as far as XUL applications which rely on them, these won't work in non-Gecko browsers anyways.

Nevertheless, I do concede there is this significant paragraph from the spec:

"For maximum reliability in interoperating between different XML processors, applications which use non-validating processors SHOULD NOT rely on any behaviors not required of such processors. Applications which require DTD facilities not related to validation (such as the declaration of default attributes and internal entities that are or may be specified in external entities) SHOULD use validating XML processors."

    (Section 5.2 at http://www.w3.org/TR/xml/#safe-behavior )

Some might, however, take this as an argument to also add DTD validation (for XML), as IE can also be made to do... And these are not "MUST"'s either.

The most persuasive argument here against DTD support I think is what you say about it being prohibitively harder to kill DTDs in a future version of XML, or rather the corollary to that, that if they are killed off, document creators who've taken the time to put their documents on the web may suffer, as support for them gets dropped (assuming a new XML would be backwards compatible). However, I tend to think that once people become familiar with the concept of a technology, they may find it EASIER to move to a new way of doing it, if they've had a chance to use it in some form already (and a new XML could still use the same syntax for entities potentially), which is not to speak of the benefits now.

(Hopefully a future version of XML would not preclude external entities in some form--whether integrated within attached schemas or not, and have it be required or at least encouraged by the spec--entities/macros/aliases seem to have little relation to validation while still being very helpful to those preparing documents for quick sharing and who don't like typing a lot.)

> When the remote XUL and the DTDs come from the same origin, the entities could
> be expanded by the server at the cost of breaking the cacheability of the XUL
> document. Doing so wouldn't cause a "Fatal Error" level of incompatibility with
> previous Gecko versions, though, and wouldn't make the Web dependent on
> external DTD processing.

I heartily agree with the commenter for #108 on this one. Whole XML languages like XSL and XForms were designed in part to make things simpler for average people to have to avoid the _dependency_ on a server or server-side scripting languages.


> (In reply to comment #104)

>> By such logic (about having no external DTDs), FF should not support
>> stylesheets or external scripts because some browsers might not use them or
>> know how to use them, or because some people link to the stylesheets used on
>> external sites.
>
> Style sheets are optional by design.

Sorry I came off a little stronger in my emphasis than I meant to on this; I do understand your reasoning, even while I still disagree strongly with the conclusion. While  a DTD failure is more serious (though presumably a user of pure XML is going to be more savvy when they see such a failure than the average HTML user) and style sheets are indeed optional by design, please also consider that even stylesheets (not to mention scripts), while they are, per accessibility guidelines, supposed to enhance functionality, often inevitably end up being created by some users in such a way as to make them essential for understanding a page (such as when using absolute positioning); by allowing this technology, it also allows for the possibility of a single point of failure.

While this sometimes dependency may be relatively infrequent with stylesheets (not to mention a practice to be discouraged), potential dependencies on JavaScript being created as a result of it being allowed with XHTML is much harder to avoid. Even accessibility guidelines speak about trying to improve accessibility of scripts and not unconditionally forego them if a dependency may be created. Given that not every user agent can implement every feature one might think suitable for the web, sometimes modularity will mean that some things do not work everywhere. The important thing is open standards for new features and not preventing the possibility of single points of failure at any cost (if that were the case, SVG should never have been added to Firefox or even remote XUL, XForms should not have been introduced, etc.).

> Style sheets also provide usefulness to
> the interoperable Web platform than external DTD processing would provide.

For most document viewers (besides those who like to learn from source code or those who could take advantage of browser caching), the benefits would probably be less than for stylesheets, but for document creators, especially those who simply want put an XML document online which could be viewed as a tree (while being made discoverable and processable via the web by remote applications in the process) or for users of pure XHTML familiar with entities who want to write quickly, they may be more concerned with DTDs than stylesheets.

And besides for XUL localization, I see some admittedly smaller spillover benefits that might occur if this bug were resolved, as DTD-related DOM functions could be meaningfully implemented to allow developers to work with available DTD entities or create or identify entity references in a standard way (e.g., bug 9850).

>> This is also relevant for allowing document creators to work within their
>> document applications using convenient shortcuts (entities) in a language
>> neutral way. To give an example, here is a page I created in my own translation
>> of XHTML code into Chinese equivalent (i.e., a "Chinese XHTML", where the tags
>> themselves have a one-to-one correspondence with XHTML (besides allowing
>> CSS-as-XML to allow CSS to be internationalized as well) but which use the
>> Chinese script): http://bahai-library.com/zamir/chintest9.xml . This works with
>> a stylesheet to convert it into the "English" XHTML which browsers can render.
>> (If CSS were comprehensive enough to cover things like forcing a tag to display
>> as a form, CSS could be used instead of XSL.). If DTD parsing were supported,
>> one could conveniently also conveniently redefine XHTML entities such that
>> entities such as &nbsp; could instead be represented with Chinese character
>> equivalents (or other language equivalents)--without having to add all possible
>> entities to the internal subset.
>
> You can do the substitutions on the server side. Sending home-grown
> vocabularies without well-known semantics over the public Web breaks processing
> based on well-known semantics, which leads to bad Babelization of markup.

Look--if one does not learn English (or latin scripts) fairly early in life, English code is itself a Babel-ish (or babble, but not babelfish). Home-grown vocabularies can very easily become standardized, as so frequently happens. Someone innovates, and others may revise or take it up. Should we reject all of the features that came out of the Netscape-Explorer wars if they weren't originally designed in a W3C committee and subsequently implemented universally? Yes, now things are mature enough that the benefits of reaching a consensus before adding markup for new FEATURES is more recognized, but that still doesn't hold back subcommunities (including very large ones such as WHATWG) to work on alternative standards which eventually do get adopted across the board. Things happen like this on a smaller scale all the time as well, and good, useful vocabularies come out of this process.

XML can already be styled on the web with CSS. While there is some validity I think to being concerned about a Babelization occurring with people using their own markup, especially when multiple ones for the SAME functionality become more popular (i.e., "Which standard do you pick?"), I think the average XML document creator understands that if they use vocabularies which do not have some following, they are risking some mild interoperability issues, such as a presumably potentially lesser prominence in some search engine results.

As long as the code can be rendered in a standard way, interoperability is not a serious issue (except perhaps where multiple similar standards crop up though I think this problem is inevitable where innovation is allowed to take place). As long as people are not constantly adding STYLING languages, one can choose from a conveniently small number of languages like CSS, XSL, or just XHTML and make any semantic markup work without there being serious problems.

>> A solution for entities wishing to make DTD's available (as opposed to DTD's
>> making entities available!) without facing undue burdens on their servers, is
>> simply not to make their DTD(s) available as files which can be linked to (or
>> serve them differently perhaps).
> I fail to see how it is an undue burder to expand entities on the server.
> Servers hosting Web apps do much more complex tasks all the time.

Sorry, I think I meant to say here that the undue burdens were faced by the individuals, not the servers (though in taking advantage of browser caching, it could also relieve the server).

And as far as the burden on document creators, it is only not a burden for those:
1) who have access to a server
2) who wish to distribute their documents over a server.
3) who are familiar enough with scripting languages to perform entity expansions
4) who consider the benefits great enough to overcome the hassle of coding the expansions if they don't have a library to do it for them
5) who don't desire to give a convenient way to others viewing their output's source code to be able to see or reuse their translations (e.g. to contextually discover useful entity files in the vein of http://www.w3.org/TR/xml-entity-names/ )

>> And couldn't the PUBLIC identifier be relied
>> on to preclude or limit external loading for well known dialects? Isn't that
>> its purpose?
>The public id is a legacy construct from pre-URI SGML era.

Regardless of its origins, it became and is a part of the XML spec.

>> But in some cases, sites such as Yahoo have even actually encouraged people to
>> point to scripts for reuse at their site, so no doubt some sites (with deep
>> pockets or small communities) would similarly want to even encourage reuse of
>> their own DTDs without users needing to save and define them locally.
>
> Preventing data leakage would require the use of Access-Control, which would
> reintroduce the DDoS on www.w3.org.
> http://dev.w3.org/2006/waf/access-control/

As I mentioned, sites can do what they wish--not include DTDs at public URLs at all (telling people how to create a SYSTEM identifier if the users will wish to use one to get access to their DTD's defined entities) or hope user agents rely on PUBLIC identifiers. What I was saying above is that some sites are willing to take the risk (they don't want to prevent data leakage at all), either due to having deep pockets like Yahoo or having a small community base.

>> As far as there being a problem with external DTD parsing being optional, while
>> obviously using one will break things for some people, since people are already
>> using them anyways, why not set the bar higher instead of lower and try to
>> implement it so that the incentive for others to implement becomes higher?
>
> People can't be already relying on external DTDs in Web content (even if they
> refer to extenal DTDs due to copying and pasting from a W3C example) as the top
> three browsers that support XHTML and SVG don't load external DTDs.

Who says we're only talking about XHTML and SVG? As I mentioned, Explorer (at least as of v7) does grab external DTDs for entity parsing in XML. Anyways, the idea about setting the bar higher still applies to being able to use entities in XHTML and SVG in the future.

>> As far as Tim Bray's statement you cite, I happen to disagree with that. Why
>> couldn't browsers cache entity files that it finds referenced?
>
> Caching the bytes doesn't remove the perf hit when parsing.

No, but transmission over the net is by far the bigger bottleneck to worry about.

> It also doesn't remove the DDoS problem when first fetching a well-known DTD.

Again, let browsers rely on PUBLIC identifiers for those DTDs which are assigned a live URL, and let people choose whether they want to take that risk on making their own DTDs available.

>> This is not only about XUL either, while that is one good argument. This is
>> about the EXTENSIBLE Markup Language that allows people to have the freedom to
>> conveniently make their own applications and share them.
>
> Not using well-known vocabularies is bad for semantic-dependent processing e.g.
> for accessibility and search.

How so for accessibility? Don't accessibility applications already interpret CSS (e.g., that display:none means don't read)? If non-disabled users are unaware of the semantics, why should it bother disabled users (e.g., whether it is <p> or <foo>, it can still be display:block and that's all that will be noticed by most people).

Even for search of an unknown vocabulary, text can still be indexed and even tagging features potentially supported (e.g., let the user specify things in a generic way like inNamespace:"myNS" inelement:"foo" text:"bar").

>> This is also relevant
>> to XML languages which are already standards, like TEI or DocBook, as well as
>> new applications.
>
> TEI and DocBook aren't Web languages. XHTML5, SVG and MathML are. New
> applications of XML are pretty much always DTDless (outside the XHTML2 WG).

TEI and DocBook can already be used on the web, except when used in Firefox (and some other browsers but not Explorer) with external DTDs. They can be used with our without stylesheets that render them into X/HTML. Just because they are, unaided by stylesheets, not directly interpreted into a (non-tree) visual presentation (as you are apparently defining "web languages"), does not mean they have no use being shared over the web. If you really don't like browsers supporting generic XML, you can always report a separate bug to try to get Firefox to remove its present support for pure/plain XML (a bug I will not be voting for!). But for those of us who create XML documents and enjoy its availability in Firefox (as the numerous others voting for this bug apparently), we do also want to be able to take advantage of DTDs!

>> And why do people need to make a false dichotomy between web and non-web?
>
> Non-Web doesn't burden browsers, so it isn't a concern when considering what
> browsers need to keep supporting for decades if not centuries to come.

Pure XML is pretty well supported in FF as it is. This is mostly about one missing feature, for which there is apparently already a working patch here. Since XML is well used off the web (in documents that may be around for decades or centuries to come), I still fail to see why these cannot be made supportable over the web.

To my mind, if I want to put an XML document over the web, it becomes a web document and if XML is supported in browsers, it already is a web language.

>> Let a hundred standards bloom!
>
> That goes against the very point of having a standard.

C'mon now. It only goes against the point of having a standard if the uses will be identical (clearly myself and others here are not talking about things like OOXML and ODF, assuming that is even a fair example). We're talking about the freedom for entirely new uses (admittedly more semantic than visual) to be devised. Otherwise you are left with the choice of:
1) Freezing the web entirely (except to say implement the rest of CSS 2.1)
2) Making one giant spec which encompasses XHTML, SVG, X3D, etc. etc. and every foreseeable general multi-purpose viewing scheme we could have, so we only have one standard.

And even #2 if it were possible would still not meet the specialized (semantic) interests of niche communities.

> (In reply to comment #105)
>> You could also deal with new DTDs by
>> implementing a (DTD) update service similar to the ones used for Firefox
>> add‐ons, browser updates, or Live Bookmarks.
>
> I think the cost/benefit ratio of introducing an update mechanism to support a
> sunsetting legacy feature of XML is unfavorable.

I think the above is an excellent idea since while DTD as validating tool may be going out the door (though some may prefer its relative simplicity, at least compared to W3C XML Schema), there is no other way at present to define entities in attached external schemas (at least for XML as it is now).

>> The fact that the feature is optional is a problem for documents that must be
>> used in multiple browsers, but having the feature is certainly better than
>> nothing and implementing it may encourage other vendors to do the same.
>
> That's part of the problem. If other vendors implement this too, a new version
> of XML can't remove DTDs.

The XML spec, as it is now, requires an internal document subset to work and currently such documents (with some exceptions like bug 267350) do work already in FF. Thus if internal document subsets are disallowed, those adding DTDs to FF (if a patch will be applied) will need to update later to be current, but so will those who rely on the internal subset now. While a future XML spec might conceivably disallow external DTDs and merely deprecate internal document subsets (since it would very odd to perpetually allow them and not DTDs), we're still talking about internal subsets eventually being dropped anyways (and thus backwards compatibility broken), no?
I'd also like to make one more set of comments related to the TEI/DocBook discussion.  Let's say I come to a website which shares some XML files as XML. Why allow such direct access to these files if the user may only see them as trees or rendered the same as XHTML? Because one can potentially do specialized processing immediately on such XML over the web which DOES take advantage of rich semantics.

To take one example, my FF extension XqUSEme, to offer just one of my personal practical interests in this bug,
https://addons.mozilla.org/en-US/firefox/addon/5515 , lets you perform XQueries on XML (or HTML) files to which one navigates over the web (admittedly post-stylesheet processing only, at present, though I have plans to fix that). (For those who are not aware, XQuery lets you query XML, similarly to how SQL queries relational databases, but taking advantage of XML's more hierarchic nature.)

So, rather than having to load the document in an external program (more and more people talk about enjoying "living" in their browsers, esp. FF), you can immediately begin semantic-rich queries of such documents, such as to extract all of the unique salutations in a list of letters, sort a group of letters by date, find all uses of foreign words in a book, etc. (Or perform XSL: https://addons.mozilla.org/en-US/firefox/addon/5023 ). Queries might similarly be performed against web data stores; applications may be unwilling to grant public SQL access to let others take advantage of their live data, but XML output may be more feasible (at least if the data files are not extremely large).

These kinds of XML applications cannot be used meaningfully and conveniently if they are only shared as downloads or are converted server-side into so-called "web languages".

I also have plans to make the above extensions work Greasemonkey-like so as to allow one to apply XQueries or XSL to XML files as they are loaded, thus giving users a chance to also take advantage of numerous semantic hooks for styling purposes (which can be harder to utilize with non-semantically rich "web languages" at least those whose document creators wish to avoid the extra bandwidth required and trouble involved in providing numerous and consistent classes).

I offer the above to point out that while historically on the web people have seen little use for semantic markup (beyond simple stylistic hooks, somewhat controversial and undeniably clunky overloading of class tags in microformatting, and primitive meta-data) especially given the paucity of semantic tags in XHTML (how common is it that people want to search for or extract <em> contents only, for example?), I see no good reason why such a divide must continue to exist between web and non-web, especially if this bug could be resolved.
The silence here is deafening.  Could we get a status update, please?  This is the single most infuriating/frustrating bug in Firefox for me as a user -- it's the _only_ thing that ever causes me to voluntarily launch IE.  _Please_ could we get Firefox in line with the XML spec on this one!
I should point out that I have experienced this bug in other browsers as well. I think I had it in Safari, Opera, and Google Chrome (although it may have only been in two of those browsers).
(In reply to comment #111)
> The silence here is deafening.  Could we get a status update, please?

There's not much to tell really, but if noise makes people happy: it's currently not a priority for me so I'll reassign to nobody.
There's a working patch. Probably needs updating, more documentation (see comment 91) and reviews.
Assignee: peterv → nobody
I'm wondering if there might be enough interest to take up a collection plate for this one if we could pay for Peter's patch to get implemented in trunk in short order?  If this is kosher with Mozilla policies, and if we could get a cost estimate of what might be expected from him (or whoever else could legitimately deliver), I'd be happy to coordinate communication from any other potential donors watching this feature request here about what we could chip in to make it happen (e.g., if people only would contribute if a certain number of others were giving a certain amount)... (I'm not a deep pockets guy myself, but I figure we might be able to collect enough together...)
Also, I'd like to know what it would take (as I presume the patch doesn't solve this) to be able to handle XML with external DTD's within extension code (e.g., document.load, document.normalizeDocument/document.createEntityReference)...
I see Peter said, "Errors in the DTD or recursively loading the same entity do get reported and stop parsing", but I'm not clear on whether that addresses the "billion laugh attacks" issue: e.g., http://www.ruby-lang.org/en/news/2008/08/23/dos-vulnerability-in-rexml/.
Poking at this again... The same-origin restriction is still going to be too limiting for some cases. I've been writing some IETF documents in XML and it would be nice to be able to proofread them (as local files) in the browser (using rfc2629.xslt). But, since I'm using the entity references on xml.resource.org all of my xref targets are broken since Mozilla doesn't load them. I suppose I could try downloading all of those references to my local machine, but then I'd have the burden of checking for updated revisions myself. That seems pretty silly when Mozilla already has code to handle locally cached copies of remote documents, along with automatically updating (using HTTP If-Modified-Since header, etc.) as needed.

The whole web vs non-web discussion seems out of place; the implication is that if I save a local copy of a document I originally viewed on the web then it will break when I try to view the local copy. I've been writing HTML documents for years, and all of them work whether I view them on my local hard drive or when uploaded to my web servers. It would certainly break expectations if our XML documents weren't equally transportable.

As for the blocking / rendering issue, this doesn't seem much different than rendering a page with multiple frames, or loading a page with lots of embedded images and other objects. In that case, you still have to wait until everything is retrieved before you can complete the rendering.

Why is external entity retrieval more of a privacy risk than image retrieval?
Status: ASSIGNED → NEW
QA Contact: ashshbhatt → xml
Cross-posting on relevant bug pages:

For this bug, and a number of other associated bugs (Bug 204102, Bug 267350, Bug 22942, and to a lesser extent Bug 196355), I've started a pledge drive at http://pledgie.com/campaigns/7732 to try to hire a developer(s) who can work with the Mozilla devs (if they are ineligible themselves) to get these long-standing and niche but important-to-XML-users bugs fixed. Feel free to make a pledge to donate toward these fixes or, if you are a developer, make a bid in the comments there to offer to fix, in conjunction with Mozilla devs, this or any of the other aforementioned XML-related bugs/feature requests. 

(If we can get enough momentum, Bug 234485, Bug 98413, Bug 275196, and Bug 94270 might be also nice candidates to get addressed too, but I've started with the (single-point-of-failure-causing) DTD issues.)
Unsure if this is where I should post this...

using firefox 3.6.2

When loading internal subsets on for the DTD on the html page I'm getting a flaw.  You can see the "]>" output to the screen because the internal subset isn't being loaded since it appears to stop at the first ">" it finds... Here is an example of the HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[
  <!ATTLIST img hsrc CDATA #IMPLIED >
]>

I added "hsrc" attribute to "img" here, but the parser is stopping at the ">" after the tag.   According to w3c (http://www.w3.org/TR/REC-xml/#dtd *NOTE: "'[' intSubset  ']'" is optional, but declared) what I have written should work, but doesn't appear to.

Not sure if this is the code for it, but nsParser (https://hg.mozilla.org/mozilla-central/file/e9312d05488f/parser/htmlparser/src/nsParser.cpp) in the function "ParseDocTypeDecl" line 1072 through 1091 appear to be the state engine to search for the Doctype.  It ignores "[" entirely after getting to the point it should be looking.  The issue is probably not a small fix though, because the content found would have to be parsed and used at the end when the DTD is loaded.  Maybe I'm way off...  Oh, here's the state engine in nsParser:

  1072   do {
  1073     theIndex = aBuffer.FindChar('<', theIndex);
  1074     if (theIndex == kNotFound) break;
  1075     PRUnichar nextChar = aBuffer.CharAt(theIndex+1);
  1076     if (nextChar == PRUnichar('!')) {
  1077       PRInt32 tmpIndex = theIndex + 2;
  1078       if (kNotFound !=
  1079           (theIndex=aBuffer.Find("DOCTYPE", PR_TRUE, tmpIndex, 0))) {
  1080         haveDoctype = PR_TRUE;
  1081         theIndex += 7; // skip "DOCTYPE"
  1082         break;
  1083       }
  1084       theIndex = ParsePS(aBuffer, tmpIndex);
  1085       theIndex = aBuffer.FindChar('>', theIndex);
  1086     } else if (nextChar == PRUnichar('?')) {
  1087       theIndex = aBuffer.FindChar('>', theIndex);
  1088     } else {
  1089       break;
  1090     }
  1091   } while (theIndex != kNotFound);
(In reply to comment #119)

Oh, I forgot the xml declaration at the top

<?xml version="1.1" encoding="utf-8"?>
(In reply to comment #119)

Oh, I forgot the xml declaration at the top

<?xml version="1.0" encoding="utf-8"?>
(In reply to comment #121)

Ok, not that anyone will do it, but it seems the code is in the tokenizers (the big state machine like the one in https://hg.mozilla.org/mozilla-central/file/e9312d05488f/parser/html/javasrc/Tokenizer.java) for the pages. After the AFTER_DOCTYPE_NAME state there should already be a search for "[" and then another state added of a "doctype internal subset" or something until the last "]" is found (but that might be a problem itself since there would have to be parsing for the data in between or else other problems might arise) and then into BOGUS_DOCTYPE (because system or public should have already been called) unless the ">" is found.  Obviously there is allowed white-space after the "]".  I suppose all the data could just go to an empty variable or dumped.

I think the only browsers that handle this are Opera and Lynx (I think they catch and dump) that I can tell, so it is obviously not really a problem.  Plus, it can be put into a local DTD.
(In reply to comment #119)
> Unsure if this is where I should post this...

It's most certainly not. Please file a separate bug.
Someone should change the title of the bug to "Load external DTDs *in XML*..." or something similar to avoid confusion. Also not sure that "if a pref is set" is really required either assuming that a same-origin restriction is used in any patch for this bug.

@Joseph and Peter Van der Beken: This isn't a bug so I wouldn't waste time filing it. Referencing the XML specification when you are serving an XHTML document with the HTML MIME-type doesn't make any sense; when you serve it with an XML MIME-type, Firefox should exhibit the expected behavior.

Under an HTML MIME-type, the rules of the current HTML 4.01 or upcoming HTML5 specs would apply:

The construct you described is forbidden by HTML 4.01 Section 7.2 which only allows three very specific |DOCTYPE| declarations (and none of these include an internal DTD subset); the spec doesn't define error handling, so outputting the characters is a valid interpretation. (I'm a bit curious why the spec documents use |DOCTYPE| declarations that violate the rules of the spec though.) This might have been considered an enhancement bug to HTML 4.01 anyway except that:

HTML5 (2010-03-24) specifically requires this behavior per Sections 10.2.4.67, 10.2.4.68, and 10.2.4.1: when |<| of |<!ATTLIST| is encountered, the document enters an error state in which |>| ends the |DOCTYPE| declaration; so, the |>| at the end of the |ATTLIST| declaration ends the *entire* |DOCTYPE| declaration. Then, |]>| still remains and is output as character data.
(In reply to comment #123)
> (In reply to comment #119)
> > Unsure if this is where I should post this...
> 
> It's most certainly not. Please file a separate bug.

No need to file a bug about the ]> in text/html. It's per spec. Authors must not use what looks like an internal subset in text/html.

As for this bug itself, I'm not the module owner but I think this should be WONTFIX even if a patch was paid for. Loading external DTDS in XML in Web content would tie Gecko to a complex legacy feature for little gain. In addition to the having to implement the feature in the first place, having the feature would make it harder to maneuver in the area of XML support in the future (e.g. if moving to off-the-main-thread XML parsing, moving to XML5 parsing if XML5 happens or moving to another XML parser for performance reasons).
Note that since the remove of the support of remote XUL is planed (bug 546857), adding the support of external DTD is less relevant.
> for little gain.

Sorry, but far from that. This bug is *the* main reason why remote XUL is unusable. And remote XUL was once proclaimed to be one of the cool features of Mozilla, and would still be very useful for company-internal applications, XULRunner apps loading remote content etc..

This has been one of the reasons why TomTom originally chose XULRunner, and the fact that this is not practically possible has been a major bummer for TomTom, who distributes a XULRunner app to about 10 million people.

In other words, this bug hurts the "XUL dark matter" tremendously.
And, and yes, I'm aware that you can't and shouldn't do privileges JS from remote XUL. The point of remote XUL would have been to tie in our web offers naturally with the native application - both from the styling perspective (user POV, native looks, identical to native app, so seamless integration) and the coding perspective (same dev team can develop remote and local parts, with identical technology). As is, you practically can't do that, you can't make a website look like a native app or XULRunner app, there's a huge, visible gap between the native app and the web offer.
Also, right now, the Text Encoding Initiative, which encodes classical or other important texts with very rich semantic markup, and is being used in universities around the world, is in the midst of applying for its own content type in the hopes of seeing TEI documents directly available through the web in its native format. Many of these are large documents (a microformat version would add a lot of markup) and can be prepared by teams making use of DTDs for ease of entity creation in these documents requiring a lot of attention and benefiting from DTDs' modularity (e.g., for conveniently adding obscure symbols). Support for DTDs over the web could invite more universities to put these documents online and applications could be more readily created for them.
The original target of this bug has been lost sight of and overreaction has set in.  The original bug, which is still with us, is that application/xml documents for which Mozilla today _does_ process both internal and external subsets, does not get that processing when they are fetched from the local disk.  For application/xml documents which reference stylesheets the processing of the subests is often important, and regularly used.  See for example http://www.w3.org/TR/2009/REC-sml-20090512/sml.xml, which works just fine today.

So, please be very careful to separate discussion of text/html processing and application/xml or application/xhtml+xml processing.  The patch above for the latter would be a welcome change.  Getting _rid_ of subset processing for the xml cases would be a disasterous regression.
Lest I be misunderstood, by 'patch above', I mean the patch submitted by Peter Van der Beken [:peterv] on 2007-06-26 07:31:53 PDT
Oops, my bad, external subset doesn't _ever_ work for entities (or, I guess, IDness).  We are still hung waiting for Boris to explain his override on Peter's patch.
(In reply to comment #132)
> Oops, my bad, external subset doesn't _ever_ work for entities (or, I guess,
> IDness).  We are still hung waiting for Boris to explain his override on
> Peter's patch.

We're not waiting on Boris; he already reviewed the patch and rejected it due to lack of documentation. (Comment #91 and Comment #92) Peter Van der Beken, who wrote the patch, says that this is no longer a priority for him (Comment #113); so any waiting is actually on someone to write a new patch with documentation (or for a WONTFIX as Henri wants).
Flags: wanted-fennec1.0?
WONTFIXing per discussion at All Hands in December 2010, since bug 651049 is going to be on my plate and off-the-main-thread parsing, our IO APIs and expat's external entity APIs don't go nicely together and the XML spec left this feature optional precisely in order to allow Web browsers not to be burdened with external DTDs. (See Tim Bray's annotated XML spec.)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
This is a serious interop problem, please reconsider.  Both Opera and Chrome (and, therefore, I presume, Safari) process the external subset, so that e.g. access to the MathML entity declarations is straight-forward for them.  This decision is also a serious roadblock for Polyglot HTML5, since all those entities _are_ available for the HTML serialisation, but, w/o external subset processing, will _not_ be available for the XHTML serialisation.

Maintaining an entity stack is already part of expat, and you need it for processing the _internal_ subset.  Why isn't Van der Beken's patch usable?
Further to Comment #136 here's an example of existing content on the web which works in Opera and Chrome but not in Firefox:

 http://www.w3.org/2001/tag/doc/metaDataInURI-31-20070102.xml

Don't Break the Web :-)
(In reply to Henry S. Thompson from comment #136)
> Both Opera and Chrome (and, therefore, I presume, Safari) process the external 
> subset

Can you point me to verifiable evidence, please?

http://hsivonen.iki.fi/test/moz/external-subset.xml shows that neither Opera (with default settings) nor Chrome load the external subset even from the same origin. (Opera has a pref for loading the external subset, but the pref is off by default.)

> This decision is also a serious roadblock for Polyglot HTML5, since all
> those entities _are_ available for the HTML serialisation, but, w/o external
> subset processing, will _not_ be available for the XHTML serialisation.

They are available to the XHTML serialization if you use one of the special public ids in the doctype. E.g.
<!DOCTYPE html PUBLIC
    "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
    "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">

Also, various hacks for making the well known set of entities available are a different issue from actually performing network IO to actually fetch external subsets.

> Maintaining an entity stack is already part of expat, and you need it for
> processing the _internal_ subset.  Why isn't Van der Beken's patch usable?

As I said in comment 135, the plan is to move XML parsing off the main thread and expat's external entity API and our network IO APIs don't work nicely together in that case.

(In reply to Henry S. Thompson from comment #137)
> Further to Comment #136 here's an example of existing content on the web
> which works in Opera and Chrome but not in Firefox:
> 
>  http://www.w3.org/2001/tag/doc/metaDataInURI-31-20070102.xml

This is a different issue: What happens once an external entity isn't fetched from the network. Also, it doesn't "work" in Opera. Look for the word "December" in Opera. With default settings, you see ampersand, letters nbsp and a semicolon on both sides of the word December instead of seeing a non-breaking space.
You need to log in before you can comment on or make changes to this bug.