Last Comment Bug 289938 - Should use real astral chars (not PUA) for math chars outside the Basic Multilingual Plane
: Should use real astral chars (not PUA) for math chars outside the Basic Multi...
Status: RESOLVED FIXED
[swag:3d]
:
Product: Core
Classification: Components
Component: MathML (show other bugs)
: Trunk
: All All
: P2 normal with 3 votes (vote)
: mozilla1.9beta5
Assigned To: Karl Tomlinson (:karlt)
: Hixie (not reading bugmail)
Mentors:
Depends on: 413115
Blocks: 321438 324857 cambria-math 400938 asana-math
  Show dependency treegraph
 
Reported: 2005-04-11 11:16 PDT by Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
Modified: 2010-10-17 12:39 PDT (History)
17 users (show)
roc: blocking1.9-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
script used to update our entity list (6.22 KB, text/plain)
2008-01-07 20:07 PST, Karl Tomlinson (:karlt)
no flags Details
mathml.dtd patch [checked-in] (125.73 KB, patch)
2008-01-07 20:12 PST, Karl Tomlinson (:karlt)
pavlov: review+
Details | Diff | Splinter Review
entity changes in sorted pseudo-unified-diff format (23.96 KB, text/plain)
2008-01-07 20:40 PST, Karl Tomlinson (:karlt)
no flags Details
include ​ in short arrow entities (1.84 KB, patch)
2008-01-09 20:17 PST, Karl Tomlinson (:karlt)
no flags Details | Diff | Splinter Review
include ZWSP in short arrow entities (including slarr and srarr) [checked in] (1.85 KB, patch)
2008-01-09 20:28 PST, Karl Tomlinson (:karlt)
pavlov: review+
Details | Diff | Splinter Review
operator dictionary changes consistent with entity changes [checked-in] (36.58 KB, patch)
2008-01-09 20:34 PST, Karl Tomlinson (:karlt)
pavlov: review+
Details | Diff | Splinter Review
corresponding nsIEntityConverter table changes [checked-in] (20.64 KB, patch)
2008-01-09 22:27 PST, Karl Tomlinson (:karlt)
pavlov: review+
Details | Diff | Splinter Review

Description Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2005-04-11 11:16:51 PDT
In order to keep internal strings as UCS2, Mozilla fakes astral math entities by
mapping them to PUA characters. Since strings are now UTF-16, I think this hack
should be removed and the real characters should be allowed in the DOM and
passed to the gfx with ATSUI/Pango/Uniscribe rendering the correct glyphs
provided a properly mapped font is installed.

Pure UTF-8 already goes the real astral route, so I think it would make sense to
fix the gfx implementations to make sure the pure UTF-8 route is the first-class
route.

Actual results:
Astral entities map to PUA chars which are special cased in Win32 and X11 gfx
impls but (it seems) not on Mac.

Expected results:
Astral entities map to the right astral chars and gfx impls on all platforms
deal with those chars appropriately.
Comment 1 rbs 2007-07-19 06:02:48 PDT
We use fictional Unicode points (from the so-called Private Use Area - PUA) to reference the special math glyphs that do not have official Unicode assignments. This is in principle what the PUA is meant for. The special math glyphs are needed especially in the stretching process because the process involves "half pieces" that will never get separate individual Unicode points. However, with Cairo, we can modernize the remapping and avoid the detour via the PUA. Doing this will involve pushing the glyph lookup process down to thebes (a lot of the code in nsMathMLChar.cpp). We needed nsMathMLChar.cpp to factor the common functionality away from the disparate GFX platforms. Now with Cairo, we have a common gateway to these platforms, and could push the lookup functionality there, and in the process eliminate our internal assignments to the PUA. This is a major work, but would provide a more elegant approach. Without a unifying Cairo, it would be a nightmare to have multiple GFX implementations of this process.
Comment 2 Karl Tomlinson (:karlt) 2007-09-26 15:34:59 PDT
(In reply to comment #1)
> However, with Cairo, we can modernize the remapping and avoid the detour via
> the PUA.

Do all necessary complete math characters (if we exclude "half pieces") have (non-PUA) unicode points?

> Doing this will involve pushing the glyph lookup process down to thebes
> (a lot of the code in nsMathMLChar.cpp). We needed nsMathMLChar.cpp to factor
> the common functionality away from the disparate GFX platforms. Now with
> Cairo, we have a common gateway to these platforms, and could push the lookup
> functionality there, and in the process eliminate our internal assignments to
> the PUA. This is a major work, but would provide a more elegant approach.

Are you suggesting that thebes does the splitting into half pieces?
And therefore manages the stretching process?
Comment 3 Karl Tomlinson (:karlt) 2007-12-07 14:37:42 PST
Things that need to be done:

1) Update layout/mathml/content/src/mathml.dtd
   - scripts in layout/mathml/tools/ may be helpful here

2) Ensure Plane 1 mathvariant entries in mathfont.properties can be parsed
   appropriately and update those entries.

3) Update intl/unicharutil/tables/mathml20.properties.
   - Is this still used?  What for?

Fortunately, so far, we've only needed Plane 0 entries in the mathfontFONT.properties files for stretchy chars, but for Asana Math we'll need Plane 16 for its PUA mappings.
Comment 4 Karl Tomlinson (:karlt) 2008-01-07 20:07:46 PST
Created attachment 295889 [details]
script used to update our entity list

There are some entities that are handled specially due to updates to Unicode, but most are generated from the XHTML 1.1 plus MathML 2.0 plus SVG 1.1 dtd.
Comment 5 Karl Tomlinson (:karlt) 2008-01-07 20:12:02 PST
Created attachment 295890 [details] [diff] [review]
mathml.dtd patch [checked-in]
Comment 6 Karl Tomlinson (:karlt) 2008-01-07 20:40:04 PST
Created attachment 295894 [details]
entity changes in sorted pseudo-unified-diff format

Changes include:
* new Unicode assignments
* removal of inconsistent definitions
* adding of missing entities
* spaces preceding characters represented by combining marks
  (so they don't combine with the previous character or xml ">")
* private use area is no longer used
* Use of U+FE00 variant selector (which will fallback to standard character with
  same meaning when no variant is provided by the font.)
* Use of U+333 COMBINING DOUBLE LOW LINE, U+338 COMBINING LONG SOLIDUS OVERLAY,
  U+20D2 COMBINING LONG VERTICAL LINE OVERLAY, and U+20E5 COMBINING REVERSE
  SOLIDUS OVERLAY combining marks to generate symbols from combinations of
  Unicode points.
Comment 7 Karl Tomlinson (:karlt) 2008-01-09 20:17:29 PST
Created attachment 296287 [details] [diff] [review]
include ​ in short arrow entities

This makes short arrows detectably different from normal arrows so they don't stretch (consistent with previous behavior).

This is the message that I sent to www-math@w3.org yesterday but appears to have got stuck in a moderation queue or similar:

ShortRightArrow and RightArrow entities represent the same character
(http://www.w3.org/TR/MathML2/bycodes.html)
but have different operator dictionary definitions in
http://www.w3.org/TR/2007/WD-MathML3-20070427/appendixf.html#oper-dict.entries

ShortRightArrow has default attribute stretchy="false" while RightArrow has
stretchy="true".

"The choice of name for a given character in MathML has no effect on its
rendering"
(http://www.w3.org/TR/2007/WD-MathML3-20070427/appendixf.html#oper-dict.names)
so I'm trying to work out what the default value of the stretchy attribute
should be.

My first impression was to make RightArrow (etc.) stretchy="false" by default
because LongRightArrow etc can be used for stretchy arrows.

However, RightArrow is given as an example of a stretchy operator here:
http://www.w3.org/TR/2007/WD-MathML3-20070427/chapter3.html#id.3.2.5.8

The operator dictionary is non-normative, so I guess the example is then
definitive.

Would it make sense that a ShortRightArrow (or ShortUpArrow) can
stretch by default to become non-Short?

I'm considering using a different entity definition for
ShortRightArrow, to distinguish from RightArrow.

One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
elements.  What evil might this cause?

Another option is to include another (hopefully) insignificant character such
as ZERO WIDTH SPACE: <!ENTITY ShortRightArrow "&#x2192;&#x200B;">
Is this any better?  Is there a better character that could be used?
Comment 8 Karl Tomlinson (:karlt) 2008-01-09 20:28:27 PST
Created attachment 296291 [details] [diff] [review]
include ZWSP in short arrow entities (including slarr and srarr) [checked in]
Comment 9 Karl Tomlinson (:karlt) 2008-01-09 20:34:25 PST
Created attachment 296293 [details] [diff] [review]
operator dictionary changes consistent with entity changes [checked-in]

(mathvariant properties will be done separately)
Comment 10 Stuart Parmenter 2008-01-09 20:38:05 PST
Comment on attachment 296293 [details] [diff] [review]
operator dictionary changes consistent with entity changes [checked-in]

rs=me
Comment 11 Stuart Parmenter 2008-01-09 20:38:27 PST
Comment on attachment 295890 [details] [diff] [review]
mathml.dtd patch [checked-in]

rs=me
Comment 12 Karl Tomlinson (:karlt) 2008-01-09 22:27:34 PST
Created attachment 296299 [details] [diff] [review]
corresponding nsIEntityConverter table changes [checked-in]
Comment 13 Stuart Parmenter 2008-01-09 22:41:20 PST
Comment on attachment 296299 [details] [diff] [review]
corresponding nsIEntityConverter table changes [checked-in]

rs=me
Comment 14 Karl Tomlinson (:karlt) 2008-01-10 03:07:20 PST
The entity list at http://www.mozilla.org/projects/mathml/demo/entity.js
will need updating at some stage.
Comment 15 Stuart Parmenter 2008-01-10 15:58:48 PST
Comment on attachment 296291 [details] [diff] [review]
include ZWSP in short arrow entities (including slarr and srarr) [checked in]

rs=me
Comment 16 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2008-01-17 10:57:49 PST
> One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> elements.  What evil might this cause?

 * It violates the principle of least surprise when considering what an XML-savvy person should reasonably expect: You should always be able to process the DTD and reserialize without entity references on the author-side. With the above entity definition, leaving entity expansion to the client side yields a different DOM.

 * Further, if the author uses a given Unicode character (directly, as an NCR or as a named entity that has a de jure definition and entities are expanded), scripts and the clipboard should see the same character that the author put there.

 * It will cause interop grief. Even the existing character entity behavior in Gecko is a problem for Opera and WebKit: DTDs are fundamentally a design mistake as far as XML in the Web context goes. See http://hsivonen.iki.fi/no-dtd/ . Since it would be very bad for browsers to actually load DTDs from the network, Gecko doesn't. WebKit doesn't, either, soWebKit has had to add an approximation of the Gecko behavior for at least some of the public ids Gecko magically knows about in order to work with pages that have been authored by Gecko users. Opera hasn't done that yet, but Gecko's precedent is a problem for them: http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible and http://annevankesteren.nl/2007/12/xml-entities . Gecko's magic list has de facto become what everyone needs to implement. Even though it is probably too late to rip out the magic list, at least it could be frozen, considered grandfathered and documented in a WHATWGish spec--perhaps even in HTML5 itself as part of processing requirements for XHTML5. Keeping Gecko's pseudo-DTD catalog a moving target isn't good for interop. In fact, the current Gecko entity resolver is a dead end for Gecko itself as well: if the magic list ever gets a new public id entry (e.g. for MathML 3) and authors start authoring with the public id, effects in older Gecko versions will be distinctly ungraceful (YSoD).

Therefore, I suggest freezing the Gecko DTD catalog (with the proper Unicode mapping for MathML stuff), documenting it for the sake of interop and advising authors to transition to DTDlessness with XHTML5 and MathML3. (MathML is too complex to be typed directly anyway, so whatever tool generates MathML from the format the author prefers to edit could generate pure UTF-8 or NCRs instead of entity references.)
Comment 17 Karl Tomlinson (:karlt) 2008-01-29 20:50:38 PST
(the remaining issues here are not P2)
Comment 18 Karl Tomlinson (:karlt) 2008-03-15 20:17:04 PDT
(In reply to comment #16)
> > One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> > With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> > elements.  What evil might this cause?
> 
>  * It violates the principle of least surprise when considering what an
> XML-savvy person should reasonably expect: You should always be able to process
> the DTD and reserialize without entity references on the author-side. With the
> above entity definition, leaving entity expansion to the client side yields a
> different DOM.

Thanks, I didn't use the above approach, but the second suggestion in comment 7.

But with current MathML entity definitions, I still can't see a solution to the problem in comment 7 that enables reserializing on the author-side and still indicating a ShortRightArrow with default attribute stretchy="false" (unless the stretchy="false" is added explicitly when not present).

>  * Further, if the author uses a given Unicode character (directly, as an NCR
> or as a named entity that has a de jure definition and entities are expanded),
> scripts and the clipboard should see the same character that the author put
> there.

This is a good point, but remember that Unicode definitions change too, and using a named entity is sometimes a more reliable why of describing the intended character.  The meanings of varphi straightphi are clear, but their corresponding Unicode point meanings have changed.

> but Gecko's precedent is a problem for them:
> http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible

I don't think I understand the issue here.  The snippet at http://my.opera.com/Andrew%20Gregory/blog/2007/12/01/ie-xhtml-mathml-and-svg
is xhtml and specifies the dtd explicitly, and that dtd defines nbsp.  Why should nbsp not be defined?
Comment 19 Karl Tomlinson (:karlt) 2008-03-15 20:19:43 PDT
(In reply to comment #3)
> 2) Ensure Plane 1 mathvariant entries in mathfont.properties can be parsed
>    appropriately and update those entries.

Done in bug 413115.
Comment 20 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2008-03-21 04:54:05 PDT
(In reply to comment #18)
> (In reply to comment #16)
> > > One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> > > With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> > > elements.  What evil might this cause?
> > 
> >  * It violates the principle of least surprise when considering what an
> > XML-savvy person should reasonably expect: You should always be able to process
> > the DTD and reserialize without entity references on the author-side. With the
> > above entity definition, leaving entity expansion to the client side yields a
> > different DOM.
> 
> Thanks, I didn't use the above approach, but the second suggestion in comment
> 7.

The ZERO-WIDTH SPACE suggestion is better, but it is still questionable to have mnemonic names of characters to expand to something more than *a* character and anything other than the official definition.

> But with current MathML entity definitions, I still can't see a solution to the
> problem in comment 7 that enables reserializing on the author-side and still
> indicating a ShortRightArrow with default attribute stretchy="false" (unless
> the stretchy="false" is added explicitly when not present).

In that case, I think the MathML spec needs fixing. Having the entities map to the same character and expecting them to have different rendering is totally, utterly bogus. If there's an attribute for stretchiness, that attribute needs to be explicit in the XML source, then.

While anonymous rendering boxes are OK in CSS terms if they can be inferred from the DOM--not the serialization syntactic sugar, inferring non-anonymous DOM-visible stuff from syntactic sugar is seriously not OK with XML. Breaking the clear vocabulary-agnostic mapping between XML source and the DOM would be like giving the little finger to the mess that is <isindex> all over again.

> >  * Further, if the author uses a given Unicode character (directly, as an NCR
> > or as a named entity that has a de jure definition and entities are expanded),
> > scripts and the clipboard should see the same character that the author put
> > there.
> 
> This is a good point, but remember that Unicode definitions change too, and
> using a named entity is sometimes a more reliable why of describing the
> intended character.  The meanings of varphi straightphi are clear, but their
> corresponding Unicode point meanings have changed.

Do you mean that Unicode changed its definitions (very bad) or that MathML changed the entity-to-Unicode mapping (also bad)?

> > but Gecko's precedent is a problem for them:
> > http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible
> 
> I don't think I understand the issue here.  The snippet at
> http://my.opera.com/Andrew%20Gregory/blog/2007/12/01/ie-xhtml-mathml-and-svg
> is xhtml and specifies the dtd explicitly, and that dtd defines nbsp.  Why
> should nbsp not be defined?

XML processors are not required to process external entities. Therefore, a document that relies on external entities will not parse reliably on all XML parsers (those that opt not to process external entities). If the legacy Gecko behavior were ignored, categorically refusing to process external entities would be the most reasonable course of action for browsers (hence, killing character entities in XML on the Web--including nbsp). (Since the reasonable course of action arising from the constraints of XML 1.0 would lead to killing character entities, one might conclude that the design of XML 1.0 is broken, but it is what it is.)

Opera is taking the course of action that would be reasonable absent the legacy made possible by Gecko. Problems arise when Opera hits the legacy content out there that relies on the peculiarity of Gecko. Thus, Gecko's the well-intentioned attempt not to kill XML character entities on the Web causes trouble for everyone else. Safari has already had to adapt.

If Gecko adds new stuff to its hard-wired DTD catalog, it makes itself a moving target for everyone else, which is bad for interop. Doing so would also stab the users of previous Gecko releases until they upgrade (as the new stuff would trigger the YSoD in earlier Gecko versions). For example, adding a theoretical XHTML 1.1 + MathML 3.0 DTD to Gecko's hardwired catalog would make a page using the character entity definition from that DTD give the YSoD on Firefox 2.0.

Making browsers fetch DTDs is not a viable option (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic), therefore, we have two interop-friendly options:

1) Freezing Gecko's hard-wired DTD catalog, documenting it, making the documentation part of a "how to implement XML in a Web browser" standard and firmly resisting any spec writer (including the MathML WG) seeking to introduce new DTDs or public ids for DTDs. This does not involve changes to XML 1.0 but would only codify which optional features of XML 1.0 are used and in which exact way.

2) Defining XML5 with a large frozen built-in set of character entities encompassing all those used in existing content, deploying XML5 parsers in all major Web browsers thereby stabbing XML 1.0 implementations and causing grief over a transition period to XML5 and then firmly resisting attempts to expand the frozen built-in set of character entities.
Comment 21 distler 2008-03-21 06:17:11 PDT
>In that case, I think the MathML spec needs fixing. Having the
>entities map to the same character and expecting them to have
>different rendering is totally, utterly bogus. If there's an
>attribute for stretchiness, that attribute needs to be explicit
>in the XML source, then.

This is definitely a problem with the Spec. I would hope that the current behaviour (namely that <mo>&#x2192;</mo> is stretchy by default) is retained.

I would take the DTD as definitive (namely that both &RightArrow; and &ShortRightArrow; map to U+2192).

It's unfortunate, but the only reasonable conclusion is that an explicit <mo stretchy="false"> is needed to obtain a non-stretchy arrow.

>Do you mean that Unicode changed its definitions (very bad)
>or that MathML changed the entity-to-Unicode mapping (also bad)?

Unicode changed its definition, on the insistence of the Greeks. U+03C6 (small greek letter phi) used to be the straight phi, and U+03D5 was the curly phi. This changed in Unicode 3, since the Greeks say the curly letter should be the default rendering for running text.

However, users of TeX (say) expect that that &phi; maps to the straight letter, and &varphi; to the curly letter, despite the shift in where Unicode placed these glyphs.
Comment 22 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2008-03-22 02:08:02 PDT
(In reply to comment #21)
> >In that case, I think the MathML spec needs fixing. Having the
> >entities map to the same character and expecting them to have
> >different rendering is totally, utterly bogus. If there's an
> >attribute for stretchiness, that attribute needs to be explicit
> >in the XML source, then.
> 
> This is definitely a problem with the Spec. I would hope that the current
> behaviour (namely that <mo>&#x2192;</mo> is stretchy by default) is retained.

Making strecthiness default depend on content does not break fundamental XML or DOM assumptions, so that would be OK in the XML/DOM sense.

> I would take the DTD as definitive (namely that both &RightArrow; and
> &ShortRightArrow; map to U+2192).

Makes sense.

> It's unfortunate, but the only reasonable conclusion is that an explicit <mo
> stretchy="false"> is needed to obtain a non-stretchy arrow.

I think this is better than taking liberties with entity expansions.

> >Do you mean that Unicode changed its definitions (very bad)
> >or that MathML changed the entity-to-Unicode mapping (also bad)?
> 
> Unicode changed its definition, on the insistence of the Greeks. U+03C6 (small
> greek letter phi) used to be the straight phi, and U+03D5 was the curly phi.
> This changed in Unicode 3, since the Greeks say the curly letter should be the
> default rendering for running text.

Whoa. That's *bad*.

> However, users of TeX (say) expect that that &phi; maps to the straight letter,
> and &varphi; to the curly letter, despite the shift in where Unicode placed
> these glyphs.

Trouble ahead! HTML maps &phi; to U+03C6.
Comment 23 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2008-03-22 02:52:47 PDT
(In reply to comment #22)
> Trouble ahead! HTML maps &phi; to U+03C6.

So now the meaning of an (X)HTML source fragment with &phi; in it changes when a MathML public id is put on top. Not good but perhaps not often enough hit to make fixing less disruptive than fixing. 

I think this should be raised as an issue in the relevant W3C WGs. OK if I raise this there?

Comment 24 distler 2008-03-22 07:31:31 PDT
As a point of information, this is the reason itex2MML maps

   \phi -> <mi>&#x3D5;</mi>
\varphi -> <mi>&#x3C6;</mi>

completely bypassing MathML named entities. At this point, it seems impossible to get right, otherwise.

>I think this should be raised as an issue in the relevant W3C WGs. OK if I
raise this there?

I believe they (at least, the Math WG) are well-aware of it. It was the cause of much chagrin, back in the day.

Note You need to log in before you can comment on or make changes to this bug.