Last Comment Bug 162431 - add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder
: add non-BMP Unicode (plane 1 and above. surrogate) support to charset encode...
Status: RESOLVED WONTFIX
: intl
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
: -- normal with 4 votes (vote)
: mozilla1.7alpha
Assigned To: Jungshik Shin
:
: Makoto Kato [:m_kato]
Mentors:
: 162403 320086 (view as bug list)
Depends on:
Blocks: jis0213 230006 403564
  Show dependency treegraph
 
Reported: 2002-08-13 00:17 PDT by Frank Tang
Modified: 2014-01-08 12:35 PST (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
working patch, still need to change nsIUnicodeEncodeHelper.cpp (4.84 KB, patch)
2002-08-13 00:22 PDT, Frank Tang
no flags Details | Diff | Splinter Review
Sample XML invoking DOM. (1.52 KB, text/xml)
2003-06-01 14:44 PDT, Rolf Culver
no flags Details
Straight XML. (567 bytes, text/xml)
2003-06-01 14:45 PDT, Rolf Culver
no flags Details
Straight HTML. (193 bytes, text/html)
2003-06-01 14:46 PDT, Rolf Culver
no flags Details
support higher planes (13.48 KB, patch)
2006-01-02 16:33 PST, YAMASHITA Makoto
no flags Details | Diff | Splinter Review
multiple plane encoder (1.31 KB, text/plain)
2006-01-02 16:35 PST, YAMASHITA Makoto
no flags Details
multiplane encoder implementation (3.78 KB, text/plain)
2006-01-02 16:36 PST, YAMASHITA Makoto
no flags Details
MSBM10 (for texture) converter (2.30 KB, text/plain)
2006-01-02 16:38 PST, YAMASHITA Makoto
no flags Details
MSBM converter impl (2.90 KB, text/plain)
2006-01-02 16:40 PST, YAMASHITA Makoto
no flags Details
msbm (for texture) plane 0 uf (3.77 KB, text/plain)
2006-01-02 16:41 PST, YAMASHITA Makoto
no flags Details
msbm plane 1 uf (3.60 KB, text/plain)
2006-01-02 16:42 PST, YAMASHITA Makoto
no flags Details
higher plane support intl + gfx/mac & gfx/win (30.04 KB, patch)
2006-02-11 20:49 PST, YAMASHITA Makoto
no flags Details | Diff | Splinter Review
"normalized" nsMultiPlaneEncoderSupport.cpp (3.83 KB, text/plain)
2006-02-11 20:53 PST, YAMASHITA Makoto
no flags Details

Description Frank Tang 2002-08-13 00:17:08 PDT
here is my plan to add the support
use the upper 5 bits of aShiftTable->classID to indicate the plane of the
mapping passing with it. The mapping is still divided into a 16 bits to 16 bits
mapping. For example the, cns11643 plane 3 to unicode mapping will be divdied
into two 16-16 mapping. one map cns11643 plane 3 to unicode bmp, one map
cns11643 plane 3 to unicode plane 2. For the 2nd one, the 16-16 bits mapping
does  not include the 2 of the 0x20000 part. that information is pass in by the
higher 5 bits of aShiftTable

We need to fully support euc-tw
Comment 1 Frank Tang 2002-08-13 00:20:55 PDT
also see bug 162364, that bug talk about update cns11643 plane 3-7,15 table
We also generated seperated table for extension b now. (see those
cnsIRGTp*ExtB.txt and cns*extb.uf cns*extb.ut) 
Comment 2 Frank Tang 2002-08-13 00:22:25 PDT
Created attachment 95064 [details] [diff] [review]
working patch, still need to change nsIUnicodeEncodeHelper.cpp 

not complete, have not compile yet.
Comment 3 rbs 2002-08-13 02:44:00 PDT
*** Bug 162403 has been marked as a duplicate of this bug. ***
Comment 4 Frank Tang 2002-08-15 19:40:51 PDT
assign
rbs- why you need it ? I need it to support euc-tw cns 11643 plane 3-7 . what is
your need. That will help me to justify the priority of this bug
Comment 5 rbs 2002-08-15 20:27:12 PDT
Unicode 3.2 and the MathML spec include a bunch of math characters in plane-1.
And, as usual, these math characters cannot be found in ordinary fonts. So in
order to support these surrogate characters, a number of things are needed:
- support of surrogate characters in the system (without this foundation, there
is not even a starting point).
- then, support the mapping between surrogate characters and the glyphs within
the special math fonts. Currently, nsIUnicodeEncode::Convert() is used to do the
mapping, but this breaks if the input contains surrogate pairs. So to fix the
problem, nsIUnicodeEncode::Convert() needs to be able to take an input that may
contain surrogate pairs, and do a re-map according to a factory pre-built
mapping table.

Needless to say, the tools used to factory-build the mapping tables need to be
updated to support surrogate input as well. Otherwise, it won't be possible to
easily setup the internal data sets that are necessary to do the mapping.
Comment 6 Frank Tang 2002-08-16 03:31:41 PDT
my idea is not to change the tool
I think we should do is keep the mappint a 16-16 mapping
and supply additional information somewhere else. 
therefore, if we need to convert unicode to one charset and that charset encode
bmp and one addtional plane, then we need to uf files for that. One for bmp and
one fo the other plane. and we supply that plane information in the shift table. 
Comment 7 Jungshik Shin 2002-12-18 13:55:42 PST
Frank,
What's your plan for nsICharRepresentable? Currently, it uses 2048(=64k / 32)
PRUint32s
to store the representability of BMP characters.
Increasing the array size by a factor of 17(plane 0 thru 16) would be quick,
but is  not  so good an idea. Do you want to use a technique similar to what's used
in CCMap(compressed charmap) in gfx?   Hmm, we can just switch to it, can't we
(is FillInfo() frozen?) ?
Downsides of switching are that it takes longer to build CCMap (at run time)
than 'info' array currently used and that more extensive patch is necessary
(both in intl and gfx where 'nsICharRe..' is used) The former problem can be
worked around by 
storing precompiled CCMaps as statics 
in converter class instead of building them up at run-time whereever
necessary/desirable/possible (see bug  180266. the perl script there has 
to be extended to generate extended CCMaps to cover non-BMP characters).
Certainly, there's a trade-off between speed and memory footprint here.

Alternatively, we can extend 'info' array (without inflating it by a factor of
17) as was
done to extend CCMap to support plane 1 and beyond.  By this, I mean
adding 16 offsets(pointers) to 'info' arrayes for plane 1 - 16 (or 17 
if we want to avoid branching completely, in which case we also need
2048 zerod-out PRUint32s) followed by the size and as many 
arrays of 2048 PRUint32's as there are non-empty non-BMP planes.
 
  
 
Comment 8 Rolf Culver 2003-05-31 06:34:29 PDT
Can reach MathML SMP fonts in XML via entity reference (𝒜).
Numeric character references and mathvariants in XML or DOM do not work.
Comment 9 Jungshik Shin 2003-05-31 19:17:37 PDT
> Can reach MathML SMP fonts in XML via entity reference (𝒜).
> Numeric character references and mathvariants in XML or DOM do not work.

  It looks like this is a separate issue. What font did you use (Code2001?)? Do
NCR(numeric character reference)'s work in html? 
Comment 10 Rolf Culver 2003-06-01 14:44:23 PDT
Created attachment 124694 [details]
Sample XML invoking DOM.
Comment 11 Rolf Culver 2003-06-01 14:45:47 PDT
Created attachment 124695 [details]
Straight XML.
Comment 12 Rolf Culver 2003-06-01 14:46:46 PDT
Created attachment 124696 [details]
Straight HTML.
Comment 13 Rolf Culver 2003-06-01 14:48:01 PDT
I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica
4.1 Fonts (Math1-5 and Math1-5Mono) from
http://www.mozilla.org/projects/mathml/fonts/.

Numeric character references above 0xffff do not work in XML or DOM or
HTML--they show as question marks.  In the DOM I can show that an SMP character
round-trip loses 0x10000.

Please see the three attachments which show attempts to display a mathematical
script letter A.  For me only the 𝒜 method works, and this is not available
in the DOM.  I am using Windows 98 and Mozilla build 2003051008.
Comment 14 Jungshik Shin 2003-06-01 16:36:54 PDT
> I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica
> 4.1 Fonts (Math1-5 and Math1-5Mono) from

  Without fixing this bug, these fonts cannot be used to render non-BMP
characters (unless they're in xml and repersented with entity names.) 
If you install a font that actually covers Plane 1 (e.g. Code2001,
http://home.att.net/~jameskass), you'll see that NCRs work fine in both html and
xml. In html, the character entity name for 'Script A' doesn't work, which has
to be filed as a separate bug. The DOM issue has to be filed as another separate
bug. 

Why does 'ASCR' work in xml (not in html)? That's because it's mapped to a PUA
code point for MathML. See

http://lxr.mozilla.org/seamonkey/source/layout/mathml/content/src/mathml.dtd

This file has to be modified once this bug is fixed.
Comment 15 Jungshik Shin 2003-06-01 20:06:36 PDT
FYI, bug 207919 and bug 207923 were filed for entity names for non-BMP
characters in (X)HTML/XML and fromCharCode() in JS, respectively. 
Comment 16 Jungshik Shin 2004-01-03 21:01:50 PST
@netscape.com address doesn't work any more and ftang's aol address is not in
bugzilla.
Comment 17 Jungshik Shin 2004-01-03 21:30:19 PST
what do you think of the problem mentioned in comment #17?
Comment 18 Roland Mainz 2004-01-03 22:02:31 PST
Jungshik Shin wrote:
> Additional Comment #17 From Jungshik Shin 2004-01-03 21:30 ------- 
                     ^^^
> what do you think of the problem mentioned in comment #17?
                                                        ^^^
Uhm, now I am stuck in an endless loop... :)
... which comment do you mean ?
Comment 19 Jungshik Shin 2004-01-03 22:17:45 PST
Ooops.sorry it's comment #7.
Comment 20 Roland Mainz 2004-01-03 22:25:51 PST
Jungshik Shin wrote:
> Ooops.sorry it's comment #7.

OK... suggestion: First implement a simple_, _working_ version for release 1.7a,
regardless whether it makes the_ zilla bigger or not...
... and then do the fine-tuning and footprint work for release 1.8 cycle.
Comment 21 Jungshik Shin 2004-01-04 23:11:39 PST
re comment #7: I was wrong to think that every instance of nsIUnicodeEncoder
carries |info| (2048 PRInt32's) array. Callers (of
nsICharRepresentable::FillInfo) have to take care of the memory alloc/dealloc. 
And, the only caller is nsCompressedCharMap. So, what has to be done is to make
some changes in the way FillInfo works (or add FillInfoEx) in
intl/uconv/src/(umap.c, nsUCSupport.cpp, nsUnicodeEncoderHelper.cpp, etc). 
Comment 22 rbs 2005-12-13 01:42:16 PST
*** Bug 320086 has been marked as a duplicate of this bug. ***
Comment 23 YAMASHITA Makoto 2005-12-25 20:35:57 PST
Is PRUint32-info array really needed? If I'm not missing something important, it's almost equivalent to ccmap, and only used inside intl, between nsCompressedCharMap and nsBasicEncoder subclasses.
Then we can change the behavior of nsBasicEncoder to directly set ccmap, rather than info arrays.
In this way the change will be invisible to gfx. And gfx codes can be modified just to use CCMAP_HAS_CHAR_EXT instead of CCMAP_HAS_CHAR...

To make the situation consistent we can introduce more radical changes (which can be hard ;)
Implement HasChar, HasChars(PRUnichar* ptr, PRUint32 len) (what's really needed in gfx) to nsCompressedCharMap, and something like a GetCompressedCharMap to nsICharacterRepresentable.
Then gfx codes should simply ask for nsCompressedCharMap instead of direct CCMAP data block handling.
In this way a more algorithmic approach for CCMAPs (like the ones in the mapping tables) will be possible.
Comment 24 YAMASHITA Makoto 2006-01-02 16:33:27 PST
Created attachment 207378 [details] [diff] [review]
support higher planes

This experimentary patch is even more ad-hoc than what I've suggested in comment 23.
Beside applying the patch, you need to move nsCompressedCharMap.* from unicharutil/util/ to uconv/util/ and put nsMultiPlaneEncoderSupport.* there. And nsUnicodeToTeXMSBM files into uconv/ucvmath.
The patch introduces
nsICharRepresentable
- added HasChars(UInt16*, UInt32*) for (surrogate-aware) representability testing
nsBasicEncoder
- stores ccmap (constructed from self FillInfo) to implement HasChars
- some unicode converters are changed to inherit this 
- This way we can replace old CCMAP_HAS_CHAR(MapperToCCMAP(*),-) by *->HasChars(&-,len)
Comment 25 YAMASHITA Makoto 2006-01-02 16:35:29 PST
Created attachment 207379 [details]
multiple plane encoder

This nsMultiPlaneEncoderSupport introduces:
parent converter
- init with mapping tables array and shift tables array
- has child converter (maybe null) for each plane. They are just instances of nsTableEncoderSupport, doing conversion from  UTF-32 lower 16 bit and don't know which plane they are associated to.
in this way we don't have to extend mapping tables.
- FillInfo is just redirected to that of the plane 0 child, for backward compatibility.
- surrogate decomposition in conversion
- has an extended CCMAP
- HasChars implementation can be moved to nsCompressedCharMap once we have declared an interface to access it from outside intl.
Comment 26 YAMASHITA Makoto 2006-01-02 16:36:18 PST
Created attachment 207380 [details]
multiplane encoder implementation
Comment 27 YAMASHITA Makoto 2006-01-02 16:38:03 PST
Created attachment 207381 [details]
MSBM10 (for texture) converter
Comment 28 YAMASHITA Makoto 2006-01-02 16:40:40 PST
Created attachment 207382 [details]
MSBM converter impl

as you can see, it #includes two uf's for plane 0 and 1.
Comment 29 YAMASHITA Makoto 2006-01-02 16:41:27 PST
Created attachment 207383 [details]
msbm (for texture) plane 0 uf
Comment 30 YAMASHITA Makoto 2006-01-02 16:42:03 PST
Created attachment 207384 [details]
msbm plane 1 uf
Comment 31 YAMASHITA Makoto 2006-02-11 20:49:36 PST
Created attachment 211552 [details] [diff] [review]
higher plane support intl + gfx/mac & gfx/win

The change is build bustage fix and gfx codes to illustrate how it will work. config.mk declares a ref to ucvutil because nsICharRepresentable->CMAP conversion has moved to there.
rbs, could you look into how this works?
Current shortcomings are:
- gfx/win is yet complete (surrogate char is accompanied by a "ghost" char).
- On windows somehow I couldn't import to nsMultiPlane...cpp the surrogate macros (IS_HIGH/LOW_SURROGATE) in nsCharTraits.h.
- const declarations in nsUnicodeToTeXMSBM.cpp might cause errors. Please just drop them then.

filenames for previous posts are:
25 nsMultiPlaneEncoderSupport.h
27 nsUnicodeToTeXMSBM.h
28 nsUnicodeToTeXMSBM.cpp
29 msbm10p0.uf
30 msbm10p1.uf
Comment 32 YAMASHITA Makoto 2006-02-11 20:53:13 PST
Created attachment 211553 [details]
"normalized" nsMultiPlaneEncoderSupport.cpp

replaces the attachment 207380 [details] of the comment 26.
Comment 33 Ho Fung Wong 2008-12-08 13:23:34 PST
When can this bug be fixed?
Comment 34 Zack Weinberg (:zwol) 2011-11-15 16:58:08 PST
This is a very old bug and it's not clear to me that any of it is still relevant.  Please indicate what still needs to be done, or else close the bug.
Comment 35 Masatoshi Kimura [:emk] 2011-11-15 18:35:56 PST
We still need this to support EUC-TW and Big5-HKSCS fully.
Comment 36 Zack Weinberg (:zwol) 2011-11-15 20:40:48 PST
(In reply to Masatoshi Kimura [:emk] from comment #35)
> We still need this to support EUC-TW and Big5-HKSCS fully.

Thanks for the update.  What remains to be done before a patch can be reviewed?
Comment 37 Henri Sivonen (:hsivonen) 2014-01-03 00:41:02 PST
Per bug 912470 comment 25, I suggest we WONTFIX this. EUC-TW is no longer used by Firefox and we never supported a non-x- label for it. The code lingers around so that mailnews devs can decide if they want EUC-TW to be their problem. That leaves only unified Big5 from the Encoding Standard, which is different from all other encodings when it comes to astral characters. I think we should implement it from the spec in bug 912470 instead of trying to add generic astral plane support to our existing generic machinery.

(Also, the encoder patch here doesn't seem ready yet anyway. At least it doesn't seem to handle the case where halves of a surrogate pair fall on different sides of a buffer boundary.)
Comment 38 Zack Weinberg (:zwol) 2014-01-03 08:47:38 PST
I have no objection as long as the MathML fonts stuff (way, way above) has been dealt with some other way, which I suspect it has.
Comment 39 Henri Sivonen (:hsivonen) 2014-01-04 03:50:34 PST
My understanding is that we now only support and only want to support TTF/OTF math fonts that know their own Unicode mapping, but needinfoing fredw to check that we don't need font-specific encoders for math anymore.
Comment 40 Frédéric Wang (:fredw) 2014-01-04 04:24:14 PST
(In reply to Henri Sivonen (:hsivonen) from comment #39)
> My understanding is that we now only support and only want to support
> TTF/OTF math fonts that know their own Unicode mapping, but needinfoing
> fredw to check that we don't need font-specific encoders for math anymore.

I think Karl took care a long time ago to move the MathML code to use only TTF/OTF fonts. At the moment, the MathML code only accesses characters by Unicode code point (and using non-BMP characters has been possible since we support Asana fonts). The plan for bug 407059 is to add access by glyph index for stretchy characters using the MATH table, but I don't think anything in this bug will help / is necessary. cc'ing Karl.
Comment 41 Karl Tomlinson (:karlt) 2014-01-08 12:35:11 PST
(In reply to Henri Sivonen (:hsivonen) from comment #39)
> My understanding is that we now only support and only want to support
> TTF/OTF math fonts that know their own Unicode mapping,

Legacy non-OTF fonts are still supported by graphics code, but I expect (but I'm not sure) that we only support non-Unicode mappings when the platform provides the translation.  Regardless, we don't need to add a non-BMP encoder for fonts or math.

Note You need to log in before you can comment on or make changes to this bug.