Closed Bug 162431 Opened 22 years ago Closed 10 years ago

add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla1.7alpha

People

(Reporter: ftang, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(11 files, 2 obsolete files)

here is my plan to add the support
use the upper 5 bits of aShiftTable->classID to indicate the plane of the
mapping passing with it. The mapping is still divided into a 16 bits to 16 bits
mapping. For example the, cns11643 plane 3 to unicode mapping will be divdied
into two 16-16 mapping. one map cns11643 plane 3 to unicode bmp, one map
cns11643 plane 3 to unicode plane 2. For the 2nd one, the 16-16 bits mapping
does  not include the 2 of the 0x20000 part. that information is pass in by the
higher 5 bits of aShiftTable

We need to fully support euc-tw
also see bug 162364, that bug talk about update cns11643 plane 3-7,15 table
We also generated seperated table for extension b now. (see those
cnsIRGTp*ExtB.txt and cns*extb.uf cns*extb.ut) 
Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper
not complete, have not compile yet.
OS: Windows XP → All
Hardware: PC → All
Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper
*** Bug 162403 has been marked as a duplicate of this bug. ***
Keywords: intl
QA Contact: ruixu → ylong
assign
rbs- why you need it ? I need it to support euc-tw cns 11643 plane 3-7 . what is
your need. That will help me to justify the priority of this bug
Status: NEW → ASSIGNED
Unicode 3.2 and the MathML spec include a bunch of math characters in plane-1.
And, as usual, these math characters cannot be found in ordinary fonts. So in
order to support these surrogate characters, a number of things are needed:
- support of surrogate characters in the system (without this foundation, there
is not even a starting point).
- then, support the mapping between surrogate characters and the glyphs within
the special math fonts. Currently, nsIUnicodeEncode::Convert() is used to do the
mapping, but this breaks if the input contains surrogate pairs. So to fix the
problem, nsIUnicodeEncode::Convert() needs to be able to take an input that may
contain surrogate pairs, and do a re-map according to a factory pre-built
mapping table.

Needless to say, the tools used to factory-build the mapping tables need to be
updated to support surrogate input as well. Otherwise, it won't be possible to
easily setup the internal data sets that are necessary to do the mapping.
my idea is not to change the tool
I think we should do is keep the mappint a 16-16 mapping
and supply additional information somewhere else. 
therefore, if we need to convert unicode to one charset and that charset encode
bmp and one addtional plane, then we need to uf files for that. One for bmp and
one fo the other plane. and we supply that plane information in the shift table. 
Target Milestone: --- → mozilla1.2beta
Frank,
What's your plan for nsICharRepresentable? Currently, it uses 2048(=64k / 32)
PRUint32s
to store the representability of BMP characters.
Increasing the array size by a factor of 17(plane 0 thru 16) would be quick,
but is  not  so good an idea. Do you want to use a technique similar to what's used
in CCMap(compressed charmap) in gfx?   Hmm, we can just switch to it, can't we
(is FillInfo() frozen?) ?
Downsides of switching are that it takes longer to build CCMap (at run time)
than 'info' array currently used and that more extensive patch is necessary
(both in intl and gfx where 'nsICharRe..' is used) The former problem can be
worked around by 
storing precompiled CCMaps as statics 
in converter class instead of building them up at run-time whereever
necessary/desirable/possible (see bug  180266. the perl script there has 
to be extended to generate extended CCMaps to cover non-BMP characters).
Certainly, there's a trade-off between speed and memory footprint here.

Alternatively, we can extend 'info' array (without inflating it by a factor of
17) as was
done to extend CCMap to support plane 1 and beyond.  By this, I mean
adding 16 offsets(pointers) to 'info' arrayes for plane 1 - 16 (or 17 
if we want to avoid branching completely, in which case we also need
2048 zerod-out PRUint32s) followed by the size and as many 
arrays of 2048 PRUint32's as there are non-empty non-BMP planes.
 
  
 
Can reach MathML SMP fonts in XML via entity reference (𝒜).
Numeric character references and mathvariants in XML or DOM do not work.
> Can reach MathML SMP fonts in XML via entity reference (𝒜).
> Numeric character references and mathvariants in XML or DOM do not work.

  It looks like this is a separate issue. What font did you use (Code2001?)? Do
NCR(numeric character reference)'s work in html? 
Attached file Straight XML.
Attached file Straight HTML.
I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica
4.1 Fonts (Math1-5 and Math1-5Mono) from
http://www.mozilla.org/projects/mathml/fonts/.

Numeric character references above 0xffff do not work in XML or DOM or
HTML--they show as question marks.  In the DOM I can show that an SMP character
round-trip loses 0x10000.

Please see the three attachments which show attempts to display a mathematical
script letter A.  For me only the 𝒜 method works, and this is not available
in the DOM.  I am using Windows 98 and Mozilla build 2003051008.
> I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica
> 4.1 Fonts (Math1-5 and Math1-5Mono) from

  Without fixing this bug, these fonts cannot be used to render non-BMP
characters (unless they're in xml and repersented with entity names.) 
If you install a font that actually covers Plane 1 (e.g. Code2001,
http://home.att.net/~jameskass), you'll see that NCRs work fine in both html and
xml. In html, the character entity name for 'Script A' doesn't work, which has
to be filed as a separate bug. The DOM issue has to be filed as another separate
bug. 

Why does 'ASCR' work in xml (not in html)? That's because it's mapped to a PUA
code point for MathML. See

http://lxr.mozilla.org/seamonkey/source/layout/mathml/content/src/mathml.dtd

This file has to be modified once this bug is fixed.
FYI, bug 207919 and bug 207923 were filed for entity names for non-BMP
characters in (X)HTML/XML and fromCharCode() in JS, respectively. 
@netscape.com address doesn't work any more and ftang's aol address is not in
bugzilla.
Assignee: ftang → jshin
Blocks: 230006
Status: ASSIGNED → NEW
what do you think of the problem mentioned in comment #17?
Summary: add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper → add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder
Target Milestone: mozilla1.2beta → mozilla1.7alpha
Jungshik Shin wrote:
> Additional Comment #17 From Jungshik Shin 2004-01-03 21:30 ------- 
                     ^^^
> what do you think of the problem mentioned in comment #17?
                                                        ^^^
Uhm, now I am stuck in an endless loop... :)
... which comment do you mean ?
Ooops.sorry it's comment #7.
Jungshik Shin wrote:
> Ooops.sorry it's comment #7.

OK... suggestion: First implement a simple_, _working_ version for release 1.7a,
regardless whether it makes the_ zilla bigger or not...
... and then do the fine-tuning and footprint work for release 1.8 cycle.
Blocks: jis0213
re comment #7: I was wrong to think that every instance of nsIUnicodeEncoder
carries |info| (2048 PRInt32's) array. Callers (of
nsICharRepresentable::FillInfo) have to take care of the memory alloc/dealloc. 
And, the only caller is nsCompressedCharMap. So, what has to be done is to make
some changes in the way FillInfo works (or add FillInfoEx) in
intl/uconv/src/(umap.c, nsUCSupport.cpp, nsUnicodeEncoderHelper.cpp, etc). 
Status: NEW → ASSIGNED
*** Bug 320086 has been marked as a duplicate of this bug. ***
Is PRUint32-info array really needed? If I'm not missing something important, it's almost equivalent to ccmap, and only used inside intl, between nsCompressedCharMap and nsBasicEncoder subclasses.
Then we can change the behavior of nsBasicEncoder to directly set ccmap, rather than info arrays.
In this way the change will be invisible to gfx. And gfx codes can be modified just to use CCMAP_HAS_CHAR_EXT instead of CCMAP_HAS_CHAR...

To make the situation consistent we can introduce more radical changes (which can be hard ;)
Implement HasChar, HasChars(PRUnichar* ptr, PRUint32 len) (what's really needed in gfx) to nsCompressedCharMap, and something like a GetCompressedCharMap to nsICharacterRepresentable.
Then gfx codes should simply ask for nsCompressedCharMap instead of direct CCMAP data block handling.
In this way a more algorithmic approach for CCMAPs (like the ones in the mapping tables) will be possible.
Attached patch support higher planes (obsolete) — Splinter Review
This experimentary patch is even more ad-hoc than what I've suggested in comment 23.
Beside applying the patch, you need to move nsCompressedCharMap.* from unicharutil/util/ to uconv/util/ and put nsMultiPlaneEncoderSupport.* there. And nsUnicodeToTeXMSBM files into uconv/ucvmath.
The patch introduces
nsICharRepresentable
- added HasChars(UInt16*, UInt32*) for (surrogate-aware) representability testing
nsBasicEncoder
- stores ccmap (constructed from self FillInfo) to implement HasChars
- some unicode converters are changed to inherit this 
- This way we can replace old CCMAP_HAS_CHAR(MapperToCCMAP(*),-) by *->HasChars(&-,len)
Attached file multiple plane encoder
This nsMultiPlaneEncoderSupport introduces:
parent converter
- init with mapping tables array and shift tables array
- has child converter (maybe null) for each plane. They are just instances of nsTableEncoderSupport, doing conversion from  UTF-32 lower 16 bit and don't know which plane they are associated to.
in this way we don't have to extend mapping tables.
- FillInfo is just redirected to that of the plane 0 child, for backward compatibility.
- surrogate decomposition in conversion
- has an extended CCMAP
- HasChars implementation can be moved to nsCompressedCharMap once we have declared an interface to access it from outside intl.
Attached file multiplane encoder implementation (obsolete) —
Attached file MSBM converter impl
as you can see, it #includes two uf's for plane 0 and 1.
Attached file msbm plane 1 uf
The change is build bustage fix and gfx codes to illustrate how it will work. config.mk declares a ref to ucvutil because nsICharRepresentable->CMAP conversion has moved to there.
rbs, could you look into how this works?
Current shortcomings are:
- gfx/win is yet complete (surrogate char is accompanied by a "ghost" char).
- On windows somehow I couldn't import to nsMultiPlane...cpp the surrogate macros (IS_HIGH/LOW_SURROGATE) in nsCharTraits.h.
- const declarations in nsUnicodeToTeXMSBM.cpp might cause errors. Please just drop them then.

filenames for previous posts are:
25 nsMultiPlaneEncoderSupport.h
27 nsUnicodeToTeXMSBM.h
28 nsUnicodeToTeXMSBM.cpp
29 msbm10p0.uf
30 msbm10p1.uf
Attachment #207378 - Attachment is obsolete: true
replaces the attachment 207380 [details] of the comment 26.
Attachment #207380 - Attachment is obsolete: true
Blocks: 403564
When can this bug be fixed?
QA Contact: amyy → i18n
This is a very old bug and it's not clear to me that any of it is still relevant.  Please indicate what still needs to be done, or else close the bug.
We still need this to support EUC-TW and Big5-HKSCS fully.
(In reply to Masatoshi Kimura [:emk] from comment #35)
> We still need this to support EUC-TW and Big5-HKSCS fully.

Thanks for the update.  What remains to be done before a patch can be reviewed?
Per bug 912470 comment 25, I suggest we WONTFIX this. EUC-TW is no longer used by Firefox and we never supported a non-x- label for it. The code lingers around so that mailnews devs can decide if they want EUC-TW to be their problem. That leaves only unified Big5 from the Encoding Standard, which is different from all other encodings when it comes to astral characters. I think we should implement it from the spec in bug 912470 instead of trying to add generic astral plane support to our existing generic machinery.

(Also, the encoder patch here doesn't seem ready yet anyway. At least it doesn't seem to handle the case where halves of a surrogate pair fall on different sides of a buffer boundary.)
I have no objection as long as the MathML fonts stuff (way, way above) has been dealt with some other way, which I suspect it has.
My understanding is that we now only support and only want to support TTF/OTF math fonts that know their own Unicode mapping, but needinfoing fredw to check that we don't need font-specific encoders for math anymore.
Flags: needinfo?(fred.wang)
(In reply to Henri Sivonen (:hsivonen) from comment #39)
> My understanding is that we now only support and only want to support
> TTF/OTF math fonts that know their own Unicode mapping, but needinfoing
> fredw to check that we don't need font-specific encoders for math anymore.

I think Karl took care a long time ago to move the MathML code to use only TTF/OTF fonts. At the moment, the MathML code only accesses characters by Unicode code point (and using non-BMP characters has been possible since we support Asana fonts). The plan for bug 407059 is to add access by glyph index for stretchy characters using the MATH table, but I don't think anything in this bug will help / is necessary. cc'ing Karl.
Flags: needinfo?(fred.wang) → needinfo?(karlt)
(In reply to Henri Sivonen (:hsivonen) from comment #39)
> My understanding is that we now only support and only want to support
> TTF/OTF math fonts that know their own Unicode mapping,

Legacy non-OTF fonts are still supported by graphics code, but I expect (but I'm not sure) that we only support non-Unicode mappings when the platform provides the translation.  Regardless, we don't need to add a non-BMP encoder for fonts or math.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(karlt)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: