Closed Bug 603716 (xml-entity-defs) Opened 14 years ago Closed 13 years ago

[MathML3] Update XML Entity Definitions for Characters

Categories

(Core :: MathML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla5

People

(Reporter: fredw, Assigned: fredw)

References

(Blocks 1 open bug, )

Details

Attachments

(2 files, 4 obsolete files)

The recommendation entitled "XML Entity Definitions for Characters" contains HTML and MathML entity definitions that we should probably use instead of the list generated from the XHTML+MathML+SVG DTD:

content/mathml/content/src/mathml.dtd

David Carlisle mentioned this file:

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

The REC also points to that one (which seems to have more entities, but I guess not used in HTML/MathML):

http://www.w3.org/2003/entities/2007/w3centities-f.ent

We have also made some modifications to content/mathml/content/src/mathml.dtd that, I suppose, must be preserved.
(In reply to comment #0)
> We have also made some modifications to content/mathml/content/src/mathml.dtd
> that, I suppose, must be preserved.

If you are referring to http://hg.mozilla.org/try/rev/a8807ac28d1e
then that may have been a mistake, and I think I'm happy to revert.
(See bug 289938 comment 16 and subsequent.)
I was actually thinking more generally to the modifications put at the top of the file (see attachment 483194 [details]). I've made a diff of our current version and the one furnished by David Carlisle (attachment 483195 [details] [diff] [review]). Some of our modifications are now part of htmlmath-f.ent but others such that imath, varphi etc are not (I haven't really followed carefully the discussion on the MathWG's list about these entity refs, though).

For the entities from the group HTML5-UPPERCASE, I suppose we are already taking them into account elsewhere in our parser, so that we don't need to add them in mathml.dtd?

BTW, I've seen that Opera now supports MathML entity references:
http://my.opera.com/mathml/forums/topic.dml?id=213671
Attachment #483194 - Attachment is patch: false
(In reply to comment #4)
> I was actually thinking more generally to the modifications put at the top of
> the file (see attachment 483194 [details]). I've made a diff of our current version and
> the one furnished by David Carlisle (attachment 483195 [details] [diff] [review]). Some of our
> modifications are now part of htmlmath-f.ent but others such that imath, varphi
> etc are not (I haven't really followed carefully the discussion on the MathWG's
> list about these entity refs, though).

See attachment 295889 [details] if you are interested in the reasons for those modifications.  I raised the issues on the www-mathml list.  As you note, some were accepted but others were not.  We should just follow the recommended list now.

> For the entities from the group HTML5-UPPERCASE, I suppose we are already
> taking them into account elsewhere in our parser, so that we don't need to add
> them in mathml.dtd?

I assume so, but I don't know the details about in which particular document types they are or should be included.

I also don't know in which particular situations htmlmathml-f.ent or w3centities-f.ent should be used.
(In reply to comment #5)

> > For the entities from the group HTML5-UPPERCASE, I suppose we are already
> > taking them into account elsewhere in our parser, so that we don't need to add
> > them in mathml.dtd?
> 
> I assume so, but I don't know the details about in which particular document
> types they are or should be included.

They aren't handled by the XML parser anywhere. They are handled by the HTML5 parser, but the HTML5 parser includes the entity definitions on its own without using mathml.dtd.

I don't see a strong practical reason not to support them on the XML side as well.

> I also don't know in which particular situations htmlmathml-f.ent or
> w3centities-f.ent should be used.

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent corresponds to what is supported by HTML5 for text/html, so I think it makes sense to put http://www.w3.org/2003/entities/2007/htmlmathml-f.ent in mathml.dtd. Supporting w3centities-f.ent would suck, because we'd end up supporting more and more entities without a compat motivator.

Personally, I'd be inclined to try to get rid of xhtml11.dtd and use mathml.dtd instead whenever xhtml11.dtd is currently used. That's a bit radical, though. If you don't feel that radical, yet, please at least update the mappings for ⟨ and ⟩ in xhtml11.dtd, too.
Attached patch Patch V1 (obsolete) — Splinter Review
David Carlisle confirmed that w3centities-f.ent is enough:
http://lists.w3.org/Archives/Public/www-math/2010Oct/0019.html

The patch removes the DTDs for XHTML and MathML and adds w3centities-f.ent instead. I wonder if I should modify w3centities-f.ent to include a Mozilla License Boilerplate. Also, it seems that nsCatalogData->mLocalDTD is no longer needed, but I'm not sure whether it should be removed.
Assignee: nobody → fred.wang
Status: NEW → ASSIGNED
(In reply to comment #7)
> The patch removes the DTDs for XHTML and MathML and adds w3centities-f.ent
> instead. I wonder if I should modify w3centities-f.ent to include a Mozilla
> License Boilerplate.

I don't think we should add Mozilla license to that because it is not a Mozilla file.  We already have a number of mochitests under that licence.
I don't know whether it makes a difference that it is distributed with a binary, so asking Gerv to comment on that.
https://bugzilla.mozilla.org/attachment.cgi?id=488674&action=diff#a/content/xml/content/src/htmlmathml-f.ent_sec1
Frédéric: please use http://www.mozilla.org/MPL/license-policy-flowchart.png and tell me where you end up :-) (This is a new image, and so you can beta test its clarity for us :-)

Gerv
The file has been written by David Carlisle from the MathML WG and is already publicly available on www.w3.org. So first I've thought "Use the license it already uses" but I was not sure about what is meant by "repository" and whether it includes something like the w3c site. But after having read license-policy, I understand that it means "mozilla.org repository", and so in the present case I must discuss with licensing@mozilla.org and the code must be tri-license or compatible (which I think is the case of the W3C's license).

It would be nice if the flowchart was an SVG image and in particular if the references at the top were actual links.
Frédéric: that's great feedback, thanks :-) The correct answer is that it should keep its existing licence and, if that licence has requirements e.g. about "including a copy in the documentation", we need to do that. I will get the flowchart improved.

Can you post a copy of the licence text here, please?

Gerv
The file is distributed under the W3C Software Notice and License (http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231). Apparently it is not in about:license yet.

W3C Software Notice and License

This work (and included software, documentation such as READMEs, or other related items) is being provided by the copyright holders under the following license.

License

By obtaining, using and/or copying this work, you (the licensee) agree that you have read, understood, and will comply with the following terms and conditions.

Permission to copy, modify, and distribute this software and its documentation, with or without modification, for any purpose and without fee or royalty is hereby granted, provided that you include the following on ALL copies of the software and documentation or portions thereof, including modifications:

    The full text of this NOTICE in a location viewable to users of the redistributed or derivative work.
    Any pre-existing intellectual property disclaimers, notices, or terms and conditions. If none exist, the W3C Software Short Notice should be included (hypertext is preferred, text is permitted) within the body of any redistributed or derivative code.
    Notice of any changes or modifications to the files, including the date changes were made. (We recommend you provide URIs to the location from which the code is derived.)

Disclaimers

THIS SOFTWARE AND DOCUMENTATION IS PROVIDED "AS IS," AND COPYRIGHT HOLDERS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.

COPYRIGHT HOLDERS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THE SOFTWARE OR DOCUMENTATION.

The name and trademarks of copyright holders may NOT be used in advertising or publicity pertaining to the software without specific, written prior permission. Title to copyright in this software and any associated documentation will at all times remain with copyright holders.
The file also contains the following information:

"Some entity names in this file are derived from files carrying the
 following notices:

     (C) International Organization for Standardization 1986,1991
     Permission to copy in any form is granted for use with
     conforming SGML systems and applications as defined in
     ISO 8879, provided this notice is included in all copies."
Urk. The W3C licence is fine, but that latter clause is not open source if you only have permission to copy.

However, it relates to only the entity _names_. Can you process the file to remove the comments giving the entity names, rename it (perhaps to htmlmathml-f-stripped.ent) and add a comment at the top explaining what has been done, and referencing the original file?

Then, you can remove the "entity names" paragraph from the licence copy in the file. And I will add the W3C licence to about:license.

Gerv
The comments contain the Unicode character names (not the entity names). The comments can be stripped, but it won't help AFAICT.

What we want from the file is the mappings between entity names and Unicode code points.

Is permission required to copy individual words from a document?
I wonder whether HTML is even "conforming SGML".
FYI, I've raised the issue on the MathML list:
http://lists.w3.org/Archives/Public/www-math/2010Nov/0005.html
As I noted in my reply on list at

http://lists.w3.org/Archives/Public/www-math/2010Nov/0006.html

I don't think that that comment is a restriction, just a statement of fact that some of the entity names come from ISO. My build script puts in that comment if any of the names in the file built come from the ISO entity sets, so it would do that even if only listed the html entities such as nbsp. HTML1 (or was it 2 that added these) may not have so explicitly acknowledged the fact, but the names certainly came from the ISO sets. I can (probably) reword that comment to look less like a licence restriction.

I note that this bug is against the mathml component but I think (since this exact set of entities is used for html5) that it would make sense for this entity set to be used for any xhtml/svg/mathml document, which would give the most compatibility between the xhtml and html sides of the world. Formally as an XML parser this is equivalent to saying you are using an xml catalog that substitutes any dtd specified for this, and uses this if no dtd is specified.


Finally while I note that you say the W3C licence is OK, I have no particular wish to make browser distributions carry yet another licence text. I put the W3C licence on as I wanted to make sure it carried some open source licence, and that one seemed appropriate since it's distributed from the W3C, but I can't see any problem in saying that it is dual (or more) licenced and that it may be used under (say) either W3C or GPL. I won't change anything about the licence until the ISO reference position is clarified, but let me know if that would be useful (Otherwise I probably wouldn't do it)
If the entity names were copied from a file with a copyright notice, then we have a copyright issue, unless we can show the information is not copyrightable.

CCing Luis for his view, as I'm not around much for the next week.

Regarding the license text, the only license which would not require us to add more text to about:license (which we are very happy to do) is the MIT licence. All BSD licences require it. Or you could use the MPL/LGPL/GPL triple-licence, but I suspect you have other users who want a BSD-style permissive licence.

Gerv
(In reply to comment #18)
> If the entity names were copied from a file with a copyright notice, then we
> have a copyright issue, unless we can show the information is not
> copyrightable.

They were not mechanically copied from anywhere but it's undoubtedly the case that the most of the entity names used in html and mathml originated in ISO publications. The original ISO files that have the comment quoted are unusable in any relevant systems, they just list the names with no mapping to unicode (The mapping to Unicode, while of course not totally dissimilar to other attempts, is I would say definitely a new original work in the xml entities spec).

But the list of names comes at least in part (including most of the html ones) from ISO 8879 which is of course copyright ISO, but does the fact that ISO specified an entity name of eacute in ISOLAT1 mean that HTML should not be using that? Even though the important detail about eacute, that it be mapped to U+00E9 was definitely not specified by ISO.

All the names have been in use in mathml in public documents since 1998 and at one point I was explicitly coopted to the relevant ISO working group to update ISO9573 to match the W3C entity definitions but unfortunately the ISO side of that work stalled, so we are where we are. Technically it's most accurate to say the sources in question were copied from the mathml2 sources (since they were) but that I suppose leaves open the question of whether mathml and html are clearly able to use those names.
Hi, David- is it correct to say that all these entities are also part of http://www.w3.org/TR/xml-entity-names/ ?
(In reply to comment #20)
> Hi, David- is it correct to say that all these entities are also part of
> http://www.w3.org/TR/xml-entity-names/ ?

oh yes, the entity file in question is made from the same source, by the same script as that TR spec. The list of entity names in html5 is also generated from the same source (using different tools, I don't think Ian Hickson is so font of xslt) The entity files in MathML2 and MathML1 going back to 1998 were also generated by earlier incarnations of essentially the same sources.
there were two versions of the comment under question.

In files that are explicitly mimicking the original ISO sets (but in XML form with Unicode mappings) I used

     Entity names in this file are derived from files carrying the
     following notice:

     (C) International Organization for Standardization 1991
     Permission to copy in any form is granted for use with
     conforming SGML systems and applications as defined in
     ISO 8879, provided this notice is included in all copies.

see for example

http://www.w3.org/2003/entities/2007/isoamsa.ent


However in the combined files (in particular the htmlmathml-f.ent that it is suggested be used by browsers) the wording was

     Some entity names in this file are derived from files carrying the
     following notices:

     (C) International Organization for Standardization 1986,1991
     Permission to copy in any form is granted for use with
     conforming SGML systems and applications as defined in
     ISO 8879, provided this notice is included in all copies.


This doesn't seem very useful "some entity names" not giving any indication of which, and it's causing concern, so I have removed the comment. In no sense is this file a copy of any previous file.

The only question is I think, irrespective of what comments I put on these files, whether HTML systems are supposed to be using entity names like eacute and nbsp derived from ISO. I am sure (from my dealings with the relevant ISO working group) that it certainly is not their intention to restrict such use, but I leave it for lawyers to decide if their intentions and the details of their copyright notices are in alignment.


It's a bit hard to find an original source for the ISO sets on the public web but there are copies at for example

http://www.opensource.apple.com/source/samba/samba-23/docs/docbook/dbsgml/ent/ISOnum

which is a copy of the original ISO file defining nbsp,

the current files are not copied from these SGML files in any sense other than the fact that the same entity names are used (but with different definitions)
Thanks for the clarification, David. I have to check with some other folks here to see what the best next step is; hopefully it'll happen today but it might not happen until Monday.
Has there been any progress on these legal issues?
Alias: xml-entity-defs
Version: unspecified → Trunk
Luis: ping?

Gerv
Still no update from Mozilla or W3C/ISO on that issue?
Mozilla's update was handled in a separate bug; I don't have the bug number handy but Henri should be able to find it. Sorry for not updating this one.
(In reply to comment #27)
> Mozilla's update was handled in a separate bug; I don't have the bug number
> handy but Henri should be able to find it. Sorry for not updating this one.

I think you mean bug 619497.
(In reply to comment #27)
> Mozilla's update was handled in a separate bug; I don't have the bug number
> handy but Henri should be able to find it. Sorry for not updating this one.

This bug was about updating the dtd for entity handling in the xml parser, the other bug is (if I read it right) about the text/html parser. Does this mean that the entity files used for xhtml/mathml may be (or have been) similarly updated?

If it makes the licensing situation easier, all the relevant data may be extracted from the whatwg html spec under the whatwg license.
Yes, you can use the same techniques and resulting licensing terms used in that bug to update any other lists from the same sources.

Is that what you wanted to know?

Gerv
(In reply to comment #30)

> Is that what you wanted to know?
> 

I think so yes thanks. I think this means I don't need to do anything and the maintainers of the mozilla mathml project can use a set of entities html/mathml with the whatwg licence. Or they prefer, I could make a version of the file listed in the original comment of this bug with an explict whatwg (instead of or as well as the w3c licence it has at present)
My intention was to copy directly the official htmlmathml-f.ent from the W3C. Since I think we don't have the W3C license text listed in about:license, maybe the most convenient would be that David uses the whatwg instead?
(In reply to comment #32)
> My intention was to copy directly the official htmlmathml-f.ent from the W3C.
> Since I think we don't have the W3C license text listed in about:license, maybe
> the most convenient would be that David uses the whatwg instead?


I'll do that. As  I stated in an earlier comment I have no particular wish to force any particular open source license. I put the w3c one on to make sure that it had some licence, but am happy for any generally accepted licence to be used.

simplest for me is if I make the "standard" file say explicitly that it may be distributed under either the w3c or whatwg licence.  But if a single, dual licenced file is difficult for you I'll make two files available from the same directory.

David
 > I'll do that.

done.

cvs commit -m "w3c or whatwg licence" 2007 2007xml


http://www.w3.org/2003/entities/2007/htmlmathml-f.ent
Attached patch Patch V2 (obsolete) — Splinter Review
Thanks David.

Here is a new version of the patch with the updated entity file.
Note: I've seen for example that the torture test uses ϕ. Thus we need to remember to check if our MathML demos contain entity names that have changed...
Attachment #488674 - Attachment is obsolete: true
Attachment #508145 - Flags: review?(karlt)
Comment on attachment 508145 [details] [diff] [review]
Patch V2

Requesting sr for the "radical" changes suggested by Henri in comment 6.
Attachment #508145 - Flags: superreview?(jonas)
Comment on attachment 508145 [details] [diff] [review]
Patch V2

AIUI, htmlmathml-f.ent is available under either of two licences:

1. The W3C license, which requires reproduction of a notice.

2. The WHATWG license, which requires nothing.

I gather we are essentially choosing the WHATWG license by not presenting any
notices in about:license.  That seems fine to me.

The ISO notice is no longer present, so that is no longer an issue.

Thank you, David.  That makes things easy for us.

>--- a/browser/installer/removed-files.in
>+++ b/browser/installer/removed-files.in
>@@ -838,18 +838,17 @@ xpicleanup@BIN_SUFFIX@
>   #ifdef XP_WIN
>     modules/WindowsJumpLists.jsm
>     modules/WindowsPreviewPerTab.jsm
>   #endif
>   modules/XPCOMUtils.jsm
>   modules/XPIProvider.jsm
>   res/contenteditable.css
>   res/designmode.css
>-  res/dtd/mathml.dtd
>-  res/dtd/xhtml11.dtd
>+  res/dtd/htmlmathml-f.ent

Looking at the history of this file, it seems that it is what is used by the
new version being installed rather than the old version being removed.

i.e. This should contain a list of all files that ever existed in an old
version but are not in the new version.

mathml.dtd and xhtml11.dtd are here because they are now part of omnijar
rather than individual files.

No changes to removed-files.in should be made in this patch.
r=me on the rest.
Attachment #508145 - Flags: review?(karlt) → review+
Attachment #508145 - Flags: superreview?(jonas) → superreview+
Attachment #508145 - Attachment is obsolete: true
Depends on: post2.0
David, AMP is defined "&" in your file. Shouldn't it be just "&"?
these entities too:

<!ENTITY lt               "&#38;#60;" ><!--LESS-THAN SIGN -->
<!ENTITY LT               "&#38;#60;" ><!--LESS-THAN SIGN -->
<!ENTITY amp              "&#38;#38;" ><!--AMPERSAND -->
<!ENTITY nvlt             "&#38;#x0003C;&#x020D2;" ><!--LESS-THAN SIGN with vertical line -->
(In reply to comment #39)
> David, AMP is defined "&#38;#38;" in your file. Shouldn't it be just "&#38;"?

no, an entity definition of just "&#38; is not well formed, see the definitions of amp and friends in the xml spec at

http://www.w3.org/TR/xml/#sec-predefined-ent
> no, an entity definition of just "&#38; is not well formed, see the definitions
> of amp and friends in the xml spec at
> 
> http://www.w3.org/TR/xml/#sec-predefined-ent

OK, thanks.
I wondering suspect this patch triggered a failure in test_bug422403-2.xhtml 

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1300955124.1300957715.18509.gz
No longer depends on: post2.0
Yes, I can confirm it triggers a reftest failure.
I think it is due to &rang;/&lang; (HTML/MathML did not use the same characters for these entities).

>-<!ENTITY lang "&#9001;">
>-<!ENTITY rang "&#9002;">
>+<!ENTITY lang             "&#x027E8;" ><!--MATHEMATICAL LEFT ANGLE BRACKET -->
>+<!ENTITY rang             "&#x027E9;" ><!--MATHEMATICAL RIGHT ANGLE BRACKET -->

I guess we should update file_xhtmlserializer_2*.xhtml as well as 
parser/htmlparser/src/nsHTMLEntityList.h.
(In reply to comment #44)
> Yes, I can confirm it triggers a reftest failure.
> I think it is due to &rang;/&lang; (HTML/MathML did not use the same characters

yes unfortunately Unicode added a decomposition of these characters unifying them with U+3008 intended for CJK punctuation, and meaning that the old expansion of rang and lang was not in NFC form, so Unicode 3.1  added new characters essentially the same as the old ones but without the unfortunate canonical decomposition. In order to follow the policy that entities for use of the web should expand to NFC form, rang and lang changed (after some discussion on www-math and public-html lists) to use the Unicode 3.1 characters.
BTW, it seems that there is an error in the failing test. It is testing four times the same page file_xhtmlserializer_2_basic.xhtml instead of all the file_xhtmlserializer_2_*.xhtml

http://mxr.mozilla.org/mozilla-central/source/content/base/test/test_bug422403-2.xhtml#61
Attached patch Patch rang/langSplinter Review
Finally, I'm not really sure to understand what is the purpose of the other file_xhtmlserializer_2_*.xhtml. I think the expected result is always file_xhtmlserializer_2_basic.xhtml (with all entities expanded). I've only modified the characters for lang/rang and the file parser/htmlparser/src/nsHTMLEntityList.h.
Attachment #522310 - Flags: review?(karlt)
(In reply to comment #47)
> I'm not really sure to understand what is the purpose of the other
> file_xhtmlserializer_2_*.xhtml. I think the expected result is always
> file_xhtmlserializer_2_basic.xhtml (with all entities expanded).

Looks like those other files have been unused since bug 545644.
Comment on attachment 522310 [details] [diff] [review]
Patch rang/lang

This looks right to me.
Asking Jonas to confirm that there is no reason why nsHTMLEntityList.h should be different from what nsExpatDriver.cpp uses.
Attachment #522310 - Flags: superreview?(jonas)
Attachment #522310 - Flags: review?(karlt)
Attachment #522310 - Flags: review+
Comment on attachment 522310 [details] [diff] [review]
Patch rang/lang

This is really dead code at this point. The entity list that the HTML5 parser uses lives in parser/html/nsHtml5NamedCharactersInclude.h

However I can't find where that file is generated from. Henri would know.
Attachment #522310 - Flags: superreview?(jonas) → review?(hsivonen)
Comment on attachment 522310 [details] [diff] [review]
Patch rang/lang

(In reply to comment #51)
> Comment on attachment 522310 [details] [diff] [review]
> Patch rang/lang
> 
> This is really dead code at this point.

Unfortunately, it isn't. Even if you count the old parser dead for parsing (even though it isn't quite completely dead just yet), nsParserService exposes this entity list to nsHTMLContentSerializer. I believe the rang and lang mappings aren't exposed to Web content though. That is, the serializer configurations that use them are only available to chrome-privileged code, AFAICT.

The serializers can also use other tables. I filed bug 648491 about the tables in http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/tables/

> The entity list that the HTML5 parser
> uses lives in parser/html/nsHtml5NamedCharactersInclude.h
> 
> However I can't find where that file is generated from. Henri would know.

It is generated by scraping the tables in the WHATWG HTML spec. The generator lives in http://hg.mozilla.org/projects/htmlparser/file/1f633cef7de7/translator-src/nu/validator/htmlparser/generator/GenerateNamedCharactersCpp.java
Attachment #522310 - Flags: review?(hsivonen) → review+
Attachment #483194 - Attachment is obsolete: true
Attachment #483195 - Attachment is obsolete: true
Keywords: checkin-needed
Pushed to m-c:
http://hg.mozilla.org/mozilla-central/rev/1a9a58693f6f
http://hg.mozilla.org/mozilla-central/rev/ca93335759fc
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Thanks!
Keywords: checkin-needed
Target Milestone: --- → mozilla2.2
Blocks: 1223829
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: