Closed Opened 18 years ago Closed 15 years ago

# Should use real astral chars (not PUA) for math chars outside the Basic Multilingual Plane

P2
normal

RESOLVED FIXED
mozilla1.9beta5

## Attachments

### (6 files, 1 obsolete file)

 6.22 KB, text/plain Details 125.73 KB, patch pavlov : review+ Details | Diff | Splinter Review 23.96 KB, text/plain Details 1.85 KB, patch pavlov : review+ Details | Diff | Splinter Review 36.58 KB, patch pavlov : review+ Details | Diff | Splinter Review 20.64 KB, patch pavlov : review+ Details | Diff | Splinter Review
In order to keep internal strings as UCS2, Mozilla fakes astral math entities by
mapping them to PUA characters. Since strings are now UTF-16, I think this hack
should be removed and the real characters should be allowed in the DOM and
passed to the gfx with ATSUI/Pango/Uniscribe rendering the correct glyphs
provided a properly mapped font is installed.

Pure UTF-8 already goes the real astral route, so I think it would make sense to
fix the gfx implementations to make sure the pure UTF-8 route is the first-class
route.

Actual results:
Astral entities map to PUA chars which are special cased in Win32 and X11 gfx
impls but (it seems) not on Mac.

Expected results:
Astral entities map to the right astral chars and gfx impls on all platforms
deal with those chars appropriately.
We use fictional Unicode points (from the so-called Private Use Area - PUA) to reference the special math glyphs that do not have official Unicode assignments. This is in principle what the PUA is meant for. The special math glyphs are needed especially in the stretching process because the process involves "half pieces" that will never get separate individual Unicode points. However, with Cairo, we can modernize the remapping and avoid the detour via the PUA. Doing this will involve pushing the glyph lookup process down to thebes (a lot of the code in nsMathMLChar.cpp). We needed nsMathMLChar.cpp to factor the common functionality away from the disparate GFX platforms. Now with Cairo, we have a common gateway to these platforms, and could push the lookup functionality there, and in the process eliminate our internal assignments to the PUA. This is a major work, but would provide a more elegant approach. Without a unifying Cairo, it would be a nightmare to have multiple GFX implementations of this process.
(In reply to comment #1)
> However, with Cairo, we can modernize the remapping and avoid the detour via
> the PUA.

Do all necessary complete math characters (if we exclude "half pieces") have (non-PUA) unicode points?

> Doing this will involve pushing the glyph lookup process down to thebes
> (a lot of the code in nsMathMLChar.cpp). We needed nsMathMLChar.cpp to factor
> the common functionality away from the disparate GFX platforms. Now with
> Cairo, we have a common gateway to these platforms, and could push the lookup
> functionality there, and in the process eliminate our internal assignments to
> the PUA. This is a major work, but would provide a more elegant approach.

Are you suggesting that thebes does the splitting into half pieces?
And therefore manages the stretching process?
Blocks: 400938
Assignee: rbs → mozbugz
Flags: blocking1.9?
Priority: -- → P3
Flags: blocking1.9? → blocking1.9+
Priority: P3 → P2
Blocks: 324857
Blocks: 321438
Blocks: cambria-math
Things that need to be done:

1) Update layout/mathml/content/src/mathml.dtd
- scripts in layout/mathml/tools/ may be helpful here

2) Ensure Plane 1 mathvariant entries in mathfont.properties can be parsed
appropriately and update those entries.

3) Update intl/unicharutil/tables/mathml20.properties.
- Is this still used?  What for?

Fortunately, so far, we've only needed Plane 0 entries in the mathfontFONT.properties files for stretchy chars, but for Asana Math we'll need Plane 16 for its PUA mappings.
Blocks: asana-math
There are some entities that are handled specially due to updates to Unicode, but most are generated from the XHTML 1.1 plus MathML 2.0 plus SVG 1.1 dtd.
Changes include:
* new Unicode assignments
* removal of inconsistent definitions
* adding of missing entities
* spaces preceding characters represented by combining marks
(so they don't combine with the previous character or xml ">")
* private use area is no longer used
* Use of U+FE00 variant selector (which will fallback to standard character with
same meaning when no variant is provided by the font.)
* Use of U+333 COMBINING DOUBLE LOW LINE, U+338 COMBINING LONG SOLIDUS OVERLAY,
U+20D2 COMBINING LONG VERTICAL LINE OVERLAY, and U+20E5 COMBINING REVERSE
SOLIDUS OVERLAY combining marks to generate symbols from combinations of
Unicode points.
This makes short arrows detectably different from normal arrows so they don't stretch (consistent with previous behavior).

This is the message that I sent to www-math@w3.org yesterday but appears to have got stuck in a moderation queue or similar:

ShortRightArrow and RightArrow entities represent the same character
(http://www.w3.org/TR/MathML2/bycodes.html)
but have different operator dictionary definitions in
http://www.w3.org/TR/2007/WD-MathML3-20070427/appendixf.html#oper-dict.entries

ShortRightArrow has default attribute stretchy="false" while RightArrow has
stretchy="true".

"The choice of name for a given character in MathML has no effect on its
rendering"
(http://www.w3.org/TR/2007/WD-MathML3-20070427/appendixf.html#oper-dict.names)
so I'm trying to work out what the default value of the stretchy attribute
should be.

My first impression was to make RightArrow (etc.) stretchy="false" by default
because LongRightArrow etc can be used for stretchy arrows.

However, RightArrow is given as an example of a stretchy operator here:
http://www.w3.org/TR/2007/WD-MathML3-20070427/chapter3.html#id.3.2.5.8

The operator dictionary is non-normative, so I guess the example is then
definitive.

Would it make sense that a ShortRightArrow (or ShortUpArrow) can
stretch by default to become non-Short?

I'm considering using a different entity definition for
ShortRightArrow, to distinguish from RightArrow.

One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
elements.  What evil might this cause?

Another option is to include another (hopefully) insignificant character such
as ZERO WIDTH SPACE: <!ENTITY ShortRightArrow "&#x2192;&#x200B;">
Is this any better?  Is there a better character that could be used?
Attachment #296287 - Flags: review?
Attachment #296287 - Attachment is obsolete: true
Attachment #296287 - Flags: review?
Attachment #296291 - Flags: review?(pavlov)
(mathvariant properties will be done separately)
Attachment #296293 - Flags: review?(pavlov)
Attachment #295890 - Flags: review?(pavlov)
Comment on attachment 296293 [details] [diff] [review]
operator dictionary changes consistent with entity changes [checked-in]

rs=me
Attachment #296293 - Flags: review?(pavlov) → review+
Comment on attachment 295890 [details] [diff] [review]
mathml.dtd patch [checked-in]

rs=me
Attachment #295890 - Flags: review?(pavlov) → review+
Attachment #296299 - Flags: review?(pavlov)
Comment on attachment 296299 [details] [diff] [review]
corresponding nsIEntityConverter table changes [checked-in]

rs=me
Attachment #296299 - Flags: review?(pavlov) → review+
Attachment #296293 - Attachment description: operator dictionary changes consistent with entity changes → operator dictionary changes consistent with entity changes [checked-in]
Attachment #295890 - Attachment description: mathml.dtd patch → mathml.dtd patch [checked-in]
Attachment #296299 - Attachment description: corresponding nsIEntityConverter table changes → corresponding nsIEntityConverter table changes [checked-in]
The entity list at http://www.mozilla.org/projects/mathml/demo/entity.js
will need updating at some stage.
Comment on attachment 296291 [details] [diff] [review]
include ZWSP in short arrow entities (including slarr and srarr) [checked in]

rs=me
Attachment #296291 - Flags: review?(pavlov) → review+
Attachment #296291 - Attachment description: include &#x200B; in short arrow entities (including slarr and srarr) → include ZWSP in short arrow entities (including slarr and srarr) [checked in]
Status: NEW → ASSIGNED
> One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> elements.  What evil might this cause?

* It violates the principle of least surprise when considering what an XML-savvy person should reasonably expect: You should always be able to process the DTD and reserialize without entity references on the author-side. With the above entity definition, leaving entity expansion to the client side yields a different DOM.

* Further, if the author uses a given Unicode character (directly, as an NCR or as a named entity that has a de jure definition and entities are expanded), scripts and the clipboard should see the same character that the author put there.

* It will cause interop grief. Even the existing character entity behavior in Gecko is a problem for Opera and WebKit: DTDs are fundamentally a design mistake as far as XML in the Web context goes. See http://hsivonen.iki.fi/no-dtd/ . Since it would be very bad for browsers to actually load DTDs from the network, Gecko doesn't. WebKit doesn't, either, soWebKit has had to add an approximation of the Gecko behavior for at least some of the public ids Gecko magically knows about in order to work with pages that have been authored by Gecko users. Opera hasn't done that yet, but Gecko's precedent is a problem for them: http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible and http://annevankesteren.nl/2007/12/xml-entities . Gecko's magic list has de facto become what everyone needs to implement. Even though it is probably too late to rip out the magic list, at least it could be frozen, considered grandfathered and documented in a WHATWGish spec--perhaps even in HTML5 itself as part of processing requirements for XHTML5. Keeping Gecko's pseudo-DTD catalog a moving target isn't good for interop. In fact, the current Gecko entity resolver is a dead end for Gecko itself as well: if the magic list ever gets a new public id entry (e.g. for MathML 3) and authors start authoring with the public id, effects in older Gecko versions will be distinctly ungraceful (YSoD).

Therefore, I suggest freezing the Gecko DTD catalog (with the proper Unicode mapping for MathML stuff), documenting it for the sake of interop and advising authors to transition to DTDlessness with XHTML5 and MathML3. (MathML is too complex to be typed directly anyway, so whatever tool generates MathML from the format the author prefers to edit could generate pure UTF-8 or NCRs instead of entity references.)
(the remaining issues here are not P2)
Priority: P2 → P3
Whiteboard: [swag:3d]
Flags: blocking1.9-
Flags: tracking1.9+
Depends on: 413115
(In reply to comment #16)
> > One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> > With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> > elements.  What evil might this cause?
>
>  * It violates the principle of least surprise when considering what an
> XML-savvy person should reasonably expect: You should always be able to process
> the DTD and reserialize without entity references on the author-side. With the
> above entity definition, leaving entity expansion to the client side yields a
> different DOM.

Thanks, I didn't use the above approach, but the second suggestion in comment 7.

But with current MathML entity definitions, I still can't see a solution to the problem in comment 7 that enables reserializing on the author-side and still indicating a ShortRightArrow with default attribute stretchy="false" (unless the stretchy="false" is added explicitly when not present).

>  * Further, if the author uses a given Unicode character (directly, as an NCR
> or as a named entity that has a de jure definition and entities are expanded),
> scripts and the clipboard should see the same character that the author put
> there.

This is a good point, but remember that Unicode definitions change too, and using a named entity is sometimes a more reliable why of describing the intended character.  The meanings of varphi straightphi are clear, but their corresponding Unicode point meanings have changed.

> but Gecko's precedent is a problem for them:
> http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible

I don't think I understand the issue here.  The snippet at http://my.opera.com/Andrew%20Gregory/blog/2007/12/01/ie-xhtml-mathml-and-svg
is xhtml and specifies the dtd explicitly, and that dtd defines nbsp.  Why should nbsp not be defined?
(In reply to comment #3)
> 2) Ensure Plane 1 mathvariant entries in mathfont.properties can be parsed
>    appropriately and update those entries.

Done in bug 413115.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Priority: P3 → P2
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9beta5
(In reply to comment #18)
> (In reply to comment #16)
> > > One option is: <!ENTITY ShortRightArrow "<mo stretchy="false">&#x2192;</mo>">
> > > With this entity, <mo>&ShortRightArrow;</mo> would then be nested <mo>
> > > elements.  What evil might this cause?
> >
> >  * It violates the principle of least surprise when considering what an
> > XML-savvy person should reasonably expect: You should always be able to process
> > the DTD and reserialize without entity references on the author-side. With the
> > above entity definition, leaving entity expansion to the client side yields a
> > different DOM.
>
> Thanks, I didn't use the above approach, but the second suggestion in comment
> 7.

The ZERO-WIDTH SPACE suggestion is better, but it is still questionable to have mnemonic names of characters to expand to something more than *a* character and anything other than the official definition.

> But with current MathML entity definitions, I still can't see a solution to the
> problem in comment 7 that enables reserializing on the author-side and still
> indicating a ShortRightArrow with default attribute stretchy="false" (unless
> the stretchy="false" is added explicitly when not present).

In that case, I think the MathML spec needs fixing. Having the entities map to the same character and expecting them to have different rendering is totally, utterly bogus. If there's an attribute for stretchiness, that attribute needs to be explicit in the XML source, then.

While anonymous rendering boxes are OK in CSS terms if they can be inferred from the DOM--not the serialization syntactic sugar, inferring non-anonymous DOM-visible stuff from syntactic sugar is seriously not OK with XML. Breaking the clear vocabulary-agnostic mapping between XML source and the DOM would be like giving the little finger to the mess that is <isindex> all over again.

> >  * Further, if the author uses a given Unicode character (directly, as an NCR
> > or as a named entity that has a de jure definition and entities are expanded),
> > scripts and the clipboard should see the same character that the author put
> > there.
>
> This is a good point, but remember that Unicode definitions change too, and
> using a named entity is sometimes a more reliable why of describing the
> intended character.  The meanings of varphi straightphi are clear, but their
> corresponding Unicode point meanings have changed.

Do you mean that Unicode changed its definitions (very bad) or that MathML changed the entity-to-Unicode mapping (also bad)?

> > but Gecko's precedent is a problem for them:
> > http://my.opera.com/Andrew%20Gregory/blog/2007/12/18/opera-not-xhtml-svg-mathml-compatible
>
> I don't think I understand the issue here.  The snippet at
> http://my.opera.com/Andrew%20Gregory/blog/2007/12/01/ie-xhtml-mathml-and-svg
> is xhtml and specifies the dtd explicitly, and that dtd defines nbsp.  Why
> should nbsp not be defined?

XML processors are not required to process external entities. Therefore, a document that relies on external entities will not parse reliably on all XML parsers (those that opt not to process external entities). If the legacy Gecko behavior were ignored, categorically refusing to process external entities would be the most reasonable course of action for browsers (hence, killing character entities in XML on the Web--including nbsp). (Since the reasonable course of action arising from the constraints of XML 1.0 would lead to killing character entities, one might conclude that the design of XML 1.0 is broken, but it is what it is.)

Opera is taking the course of action that would be reasonable absent the legacy made possible by Gecko. Problems arise when Opera hits the legacy content out there that relies on the peculiarity of Gecko. Thus, Gecko's the well-intentioned attempt not to kill XML character entities on the Web causes trouble for everyone else. Safari has already had to adapt.

If Gecko adds new stuff to its hard-wired DTD catalog, it makes itself a moving target for everyone else, which is bad for interop. Doing so would also stab the users of previous Gecko releases until they upgrade (as the new stuff would trigger the YSoD in earlier Gecko versions). For example, adding a theoretical XHTML 1.1 + MathML 3.0 DTD to Gecko's hardwired catalog would make a page using the character entity definition from that DTD give the YSoD on Firefox 2.0.

Making browsers fetch DTDs is not a viable option (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic), therefore, we have two interop-friendly options:

1) Freezing Gecko's hard-wired DTD catalog, documenting it, making the documentation part of a "how to implement XML in a Web browser" standard and firmly resisting any spec writer (including the MathML WG) seeking to introduce new DTDs or public ids for DTDs. This does not involve changes to XML 1.0 but would only codify which optional features of XML 1.0 are used and in which exact way.

2) Defining XML5 with a large frozen built-in set of character entities encompassing all those used in existing content, deploying XML5 parsers in all major Web browsers thereby stabbing XML 1.0 implementations and causing grief over a transition period to XML5 and then firmly resisting attempts to expand the frozen built-in set of character entities.

>In that case, I think the MathML spec needs fixing. Having the
>entities map to the same character and expecting them to have
>different rendering is totally, utterly bogus. If there's an
>attribute for stretchiness, that attribute needs to be explicit
>in the XML source, then.

This is definitely a problem with the Spec. I would hope that the current behaviour (namely that <mo>&#x2192;</mo> is stretchy by default) is retained.

I would take the DTD as definitive (namely that both &RightArrow; and &ShortRightArrow; map to U+2192).

It's unfortunate, but the only reasonable conclusion is that an explicit <mo stretchy="false"> is needed to obtain a non-stretchy arrow.

>Do you mean that Unicode changed its definitions (very bad)
>or that MathML changed the entity-to-Unicode mapping (also bad)?

Unicode changed its definition, on the insistence of the Greeks. U+03C6 (small greek letter phi) used to be the straight phi, and U+03D5 was the curly phi. This changed in Unicode 3, since the Greeks say the curly letter should be the default rendering for running text.

However, users of TeX (say) expect that that &phi; maps to the straight letter, and &varphi; to the curly letter, despite the shift in where Unicode placed these glyphs.
(In reply to comment #21)
> >In that case, I think the MathML spec needs fixing. Having the
> >entities map to the same character and expecting them to have
> >different rendering is totally, utterly bogus. If there's an
> >attribute for stretchiness, that attribute needs to be explicit
> >in the XML source, then.
>
> This is definitely a problem with the Spec. I would hope that the current
> behaviour (namely that <mo>&#x2192;</mo> is stretchy by default) is retained.

Making strecthiness default depend on content does not break fundamental XML or DOM assumptions, so that would be OK in the XML/DOM sense.

> I would take the DTD as definitive (namely that both &RightArrow; and
> &ShortRightArrow; map to U+2192).

Makes sense.

> It's unfortunate, but the only reasonable conclusion is that an explicit <mo
> stretchy="false"> is needed to obtain a non-stretchy arrow.

I think this is better than taking liberties with entity expansions.

> >Do you mean that Unicode changed its definitions (very bad)
> >or that MathML changed the entity-to-Unicode mapping (also bad)?
>
> Unicode changed its definition, on the insistence of the Greeks. U+03C6 (small
> greek letter phi) used to be the straight phi, and U+03D5 was the curly phi.
> This changed in Unicode 3, since the Greeks say the curly letter should be the
> default rendering for running text.

Whoa. That's *bad*.

> However, users of TeX (say) expect that that &phi; maps to the straight letter,
> and &varphi; to the curly letter, despite the shift in where Unicode placed
> these glyphs.

Trouble ahead! HTML maps &phi; to U+03C6.
(In reply to comment #22)
> Trouble ahead! HTML maps &phi; to U+03C6.

So now the meaning of an (X)HTML source fragment with &phi; in it changes when a MathML public id is put on top. Not good but perhaps not often enough hit to make fixing less disruptive than fixing.

I think this should be raised as an issue in the relevant W3C WGs. OK if I raise this there?


As a point of information, this is the reason itex2MML maps

\phi -> <mi>&#x3D5;</mi>
\varphi -> <mi>&#x3C6;</mi>

completely bypassing MathML named entities. At this point, it seems impossible to get right, otherwise.

>I think this should be raised as an issue in the relevant W3C WGs. OK if I
raise this there?

I believe they (at least, the Math WG) are well-aware of it. It was the cause of much chagrin, back in the day.
Attachment #295889 - Attachment mime type: text/x-perl → text/plain
You need to log in before you can comment on or make changes to this bug.