Closed Bug 162431 Opened 23 years ago Closed 11 years ago

add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder

Tracking

()

Status:

RESOLVED WONTFIX

Milestone:

mozilla1.7alpha

People

(Reporter: ftang, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(11 files, 2 obsolete files)

working patch, still need to change nsIUnicodeEncodeHelper.cpp 23 years ago Frank Tang 4.84 KB, patch		Details \| Diff \| Splinter Review
Sample XML invoking DOM. 22 years ago Rolf Culver 1.52 KB, text/xml		Details
Straight XML. 22 years ago Rolf Culver 567 bytes, text/xml		Details
Straight HTML. 22 years ago Rolf Culver 193 bytes, text/html		Details
support higher planes 19 years ago YAMASHITA Makoto 13.48 KB, patch		Details \| Diff \| Splinter Review
multiple plane encoder 19 years ago YAMASHITA Makoto 1.31 KB, text/plain		Details
multiplane encoder implementation 19 years ago YAMASHITA Makoto 3.78 KB, text/plain		Details
MSBM10 (for texture) converter 19 years ago YAMASHITA Makoto 2.30 KB, text/plain		Details
MSBM converter impl 19 years ago YAMASHITA Makoto 2.90 KB, text/plain		Details
msbm (for texture) plane 0 uf 19 years ago YAMASHITA Makoto 3.77 KB, text/plain		Details
msbm plane 1 uf 19 years ago YAMASHITA Makoto 3.60 KB, text/plain		Details
higher plane support intl + gfx/mac & gfx/win 19 years ago YAMASHITA Makoto 30.04 KB, patch		Details \| Diff \| Splinter Review
"normalized" nsMultiPlaneEncoderSupport.cpp 19 years ago YAMASHITA Makoto 3.83 KB, text/plain		Details

Frank Tang

Reporter

Description

•

23 years ago

here is my plan to add the support use the upper 5 bits of aShiftTable->classID to indicate the plane of the mapping passing with it. The mapping is still divided into a 16 bits to 16 bits mapping. For example the, cns11643 plane 3 to unicode mapping will be divdied into two 16-16 mapping. one map cns11643 plane 3 to unicode bmp, one map cns11643 plane 3 to unicode plane 2. For the 2nd one, the 16-16 bits mapping does not include the 2 of the 0x20000 part. that information is pass in by the higher 5 bits of aShiftTable We need to fully support euc-tw

Frank Tang

Reporter

Comment 1

•

23 years ago

also see bug 162364, that bug talk about update cns11643 plane 3-7,15 table We also generated seperated table for extension b now. (see those cnsIRGTp*ExtB.txt and cns*extb.uf cns*extb.ut)

Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper

Frank Tang

Reporter

Comment 2

•

23 years ago

Attached patch working patch, still need to change nsIUnicodeEncodeHelper.cpp — Details — Splinter Review

not complete, have not compile yet.

Frank Tang

Reporter

Updated

•

23 years ago

OS: Windows XP → All

Hardware: PC → All

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper

rbs

Comment 3

•

23 years ago

*** Bug 162403 has been marked as a duplicate of this bug. ***

Rui Xu

Updated

•

23 years ago

Keywords: intl

QA Contact: ruixu → ylong

Frank Tang

Reporter

Comment 4

•

23 years ago

assign rbs- why you need it ? I need it to support euc-tw cns 11643 plane 3-7 . what is your need. That will help me to justify the priority of this bug

Status: NEW → ASSIGNED

rbs

Comment 5

•

23 years ago

Unicode 3.2 and the MathML spec include a bunch of math characters in plane-1. And, as usual, these math characters cannot be found in ordinary fonts. So in order to support these surrogate characters, a number of things are needed: - support of surrogate characters in the system (without this foundation, there is not even a starting point). - then, support the mapping between surrogate characters and the glyphs within the special math fonts. Currently, nsIUnicodeEncode::Convert() is used to do the mapping, but this breaks if the input contains surrogate pairs. So to fix the problem, nsIUnicodeEncode::Convert() needs to be able to take an input that may contain surrogate pairs, and do a re-map according to a factory pre-built mapping table. Needless to say, the tools used to factory-build the mapping tables need to be updated to support surrogate input as well. Otherwise, it won't be possible to easily setup the internal data sets that are necessary to do the mapping.

Frank Tang

Reporter

Comment 6

•

23 years ago

my idea is not to change the tool I think we should do is keep the mappint a 16-16 mapping and supply additional information somewhere else. therefore, if we need to convert unicode to one charset and that charset encode bmp and one addtional plane, then we need to uf files for that. One for bmp and one fo the other plane. and we supply that plane information in the shift table.

Frank Tang

Reporter

Updated

•

23 years ago

Target Milestone: --- → mozilla1.2beta

Jungshik Shin

Assignee

Comment 7

•

22 years ago

Frank, What's your plan for nsICharRepresentable? Currently, it uses 2048(=64k / 32) PRUint32s to store the representability of BMP characters. Increasing the array size by a factor of 17(plane 0 thru 16) would be quick, but is not so good an idea. Do you want to use a technique similar to what's used in CCMap(compressed charmap) in gfx? Hmm, we can just switch to it, can't we (is FillInfo() frozen?) ? Downsides of switching are that it takes longer to build CCMap (at run time) than 'info' array currently used and that more extensive patch is necessary (both in intl and gfx where 'nsICharRe..' is used) The former problem can be worked around by storing precompiled CCMaps as statics in converter class instead of building them up at run-time whereever necessary/desirable/possible (see bug 180266. the perl script there has to be extended to generate extended CCMaps to cover non-BMP characters). Certainly, there's a trade-off between speed and memory footprint here. Alternatively, we can extend 'info' array (without inflating it by a factor of 17) as was done to extend CCMap to support plane 1 and beyond. By this, I mean adding 16 offsets(pointers) to 'info' arrayes for plane 1 - 16 (or 17 if we want to avoid branching completely, in which case we also need 2048 zerod-out PRUint32s) followed by the size and as many arrays of 2048 PRUint32's as there are non-empty non-BMP planes.

Rolf Culver

Comment 8

•

22 years ago

Can reach MathML SMP fonts in XML via entity reference (&Ascr;). Numeric character references and mathvariants in XML or DOM do not work.

Jungshik Shin

Assignee

Comment 9

•

22 years ago

> Can reach MathML SMP fonts in XML via entity reference (&Ascr;). > Numeric character references and mathvariants in XML or DOM do not work. It looks like this is a separate issue. What font did you use (Code2001?)? Do NCR(numeric character reference)'s work in html?

Rolf Culver

Comment 10

•

22 years ago

Attached file Sample XML invoking DOM. — Details

Rolf Culver

Comment 11

•

22 years ago

Attached file Straight XML. — Details

Rolf Culver

Comment 12

•

22 years ago

Attached file Straight HTML. — Details

Rolf Culver

Comment 13

•

22 years ago

I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica 4.1 Fonts (Math1-5 and Math1-5Mono) from http://www.mozilla.org/projects/mathml/fonts/. Numeric character references above 0xffff do not work in XML or DOM or HTML--they show as question marks. In the DOM I can show that an SMP character round-trip loses 0x10000. Please see the three attachments which show attempts to display a mathematical script letter A. For me only the &Ascr; method works, and this is not available in the DOM. I am using Windows 98 and Mozilla build 2003051008.

Jungshik Shin

Assignee

Comment 14

•

22 years ago

> I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica > 4.1 Fonts (Math1-5 and Math1-5Mono) from Without fixing this bug, these fonts cannot be used to render non-BMP characters (unless they're in xml and repersented with entity names.) If you install a font that actually covers Plane 1 (e.g. Code2001, http://home.att.net/~jameskass), you'll see that NCRs work fine in both html and xml. In html, the character entity name for 'Script A' doesn't work, which has to be filed as a separate bug. The DOM issue has to be filed as another separate bug. Why does 'ASCR' work in xml (not in html)? That's because it's mapped to a PUA code point for MathML. See http://lxr.mozilla.org/seamonkey/source/layout/mathml/content/src/mathml.dtd This file has to be modified once this bug is fixed.

Jungshik Shin

Assignee

Comment 15

•

22 years ago

FYI, bug 207919 and bug 207923 were filed for entity names for non-BMP characters in (X)HTML/XML and fromCharCode() in JS, respectively.

Jungshik Shin

Assignee

Comment 16

•

21 years ago

@netscape.com address doesn't work any more and ftang's aol address is not in bugzilla.

Assignee: ftang → jshin

Blocks: 230006

Status: ASSIGNED → NEW

Jungshik Shin

Assignee

Comment 17

•

21 years ago

what do you think of the problem mentioned in comment #17?

Summary: add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper → add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder

Target Milestone: mozilla1.2beta → mozilla1.7alpha

Roland Mainz

Comment 18

•

21 years ago

Jungshik Shin wrote: > Additional Comment #17 From Jungshik Shin 2004-01-03 21:30 ------- ^^^ > what do you think of the problem mentioned in comment #17? ^^^ Uhm, now I am stuck in an endless loop... :) ... which comment do you mean ?

Jungshik Shin

Assignee

Comment 19

•

21 years ago

Ooops.sorry it's comment #7.

Roland Mainz

Comment 20

•

21 years ago

Jungshik Shin wrote: > Ooops.sorry it's comment #7. OK... suggestion: First implement a simple_, _working_ version for release 1.7a, regardless whether it makes the_ zilla bigger or not... ... and then do the fine-tuning and footprint work for release 1.8 cycle.

Jungshik Shin

Assignee

Updated

•

21 years ago

Blocks: jis0213

Jungshik Shin

Assignee

Comment 21

•

21 years ago

re comment #7: I was wrong to think that every instance of nsIUnicodeEncoder carries |info| (2048 PRInt32's) array. Callers (of nsICharRepresentable::FillInfo) have to take care of the memory alloc/dealloc. And, the only caller is nsCompressedCharMap. So, what has to be done is to make some changes in the way FillInfo works (or add FillInfoEx) in intl/uconv/src/(umap.c, nsUCSupport.cpp, nsUnicodeEncoderHelper.cpp, etc).

Status: NEW → ASSIGNED

rbs

Comment 22

•

19 years ago

*** Bug 320086 has been marked as a duplicate of this bug. ***

YAMASHITA Makoto

Comment 23

•

19 years ago

Is PRUint32-info array really needed? If I'm not missing something important, it's almost equivalent to ccmap, and only used inside intl, between nsCompressedCharMap and nsBasicEncoder subclasses. Then we can change the behavior of nsBasicEncoder to directly set ccmap, rather than info arrays. In this way the change will be invisible to gfx. And gfx codes can be modified just to use CCMAP_HAS_CHAR_EXT instead of CCMAP_HAS_CHAR... To make the situation consistent we can introduce more radical changes (which can be hard ;) Implement HasChar, HasChars(PRUnichar* ptr, PRUint32 len) (what's really needed in gfx) to nsCompressedCharMap, and something like a GetCompressedCharMap to nsICharacterRepresentable. Then gfx codes should simply ask for nsCompressedCharMap instead of direct CCMAP data block handling. In this way a more algorithmic approach for CCMAPs (like the ones in the mapping tables) will be possible.

YAMASHITA Makoto

Comment 24

•

19 years ago

Attached patch support higher planes (obsolete) — Details — Splinter Review

This experimentary patch is even more ad-hoc than what I've suggested in comment 23. Beside applying the patch, you need to move nsCompressedCharMap.* from unicharutil/util/ to uconv/util/ and put nsMultiPlaneEncoderSupport.* there. And nsUnicodeToTeXMSBM files into uconv/ucvmath. The patch introduces nsICharRepresentable - added HasChars(UInt16*, UInt32*) for (surrogate-aware) representability testing nsBasicEncoder - stores ccmap (constructed from self FillInfo) to implement HasChars - some unicode converters are changed to inherit this - This way we can replace old CCMAP_HAS_CHAR(MapperToCCMAP(*),-) by *->HasChars(&-,len)

YAMASHITA Makoto

Comment 25

•

19 years ago

Attached file multiple plane encoder — Details

This nsMultiPlaneEncoderSupport introduces: parent converter - init with mapping tables array and shift tables array - has child converter (maybe null) for each plane. They are just instances of nsTableEncoderSupport, doing conversion from UTF-32 lower 16 bit and don't know which plane they are associated to. in this way we don't have to extend mapping tables. - FillInfo is just redirected to that of the plane 0 child, for backward compatibility. - surrogate decomposition in conversion - has an extended CCMAP - HasChars implementation can be moved to nsCompressedCharMap once we have declared an interface to access it from outside intl.

YAMASHITA Makoto

Comment 26

•

19 years ago

Attached file multiplane encoder implementation (obsolete) — Details

YAMASHITA Makoto

Comment 27

•

19 years ago

Attached file MSBM10 (for texture) converter — Details

YAMASHITA Makoto

Comment 28

•

19 years ago

Attached file MSBM converter impl — Details

as you can see, it #includes two uf's for plane 0 and 1.

YAMASHITA Makoto

Comment 29

•

19 years ago

Attached file msbm (for texture) plane 0 uf — Details

YAMASHITA Makoto

Comment 30

•

19 years ago

Attached file msbm plane 1 uf — Details

YAMASHITA Makoto

Comment 31

•

19 years ago

Attached patch higher plane support intl + gfx/mac & gfx/win — Details — Splinter Review

The change is build bustage fix and gfx codes to illustrate how it will work. config.mk declares a ref to ucvutil because nsICharRepresentable->CMAP conversion has moved to there. rbs, could you look into how this works? Current shortcomings are: - gfx/win is yet complete (surrogate char is accompanied by a "ghost" char). - On windows somehow I couldn't import to nsMultiPlane...cpp the surrogate macros (IS_HIGH/LOW_SURROGATE) in nsCharTraits.h. - const declarations in nsUnicodeToTeXMSBM.cpp might cause errors. Please just drop them then. filenames for previous posts are: 25 nsMultiPlaneEncoderSupport.h 27 nsUnicodeToTeXMSBM.h 28 nsUnicodeToTeXMSBM.cpp 29 msbm10p0.uf 30 msbm10p1.uf

Attachment #207378 - Attachment is obsolete: true

YAMASHITA Makoto

Comment 32

•

19 years ago

Attached file "normalized" nsMultiPlaneEncoderSupport.cpp — Details

replaces the attachment 207380 [details] of the comment 26.

Attachment #207380 - Attachment is obsolete: true

Simon Montagu :smontagu

Updated

•

17 years ago

Blocks: 403564

Ho Fung Wong

Comment 33

•

16 years ago

When can this bug be fixed?

Phil Ringnalda (:philor)

Updated

•

16 years ago

QA Contact: amyy → i18n

Zack Weinberg (:zwol)

Comment 34

•

13 years ago

This is a very old bug and it's not clear to me that any of it is still relevant. Please indicate what still needs to be done, or else close the bug.

Masatoshi Kimura [:emk]

Comment 35

•

13 years ago

We still need this to support EUC-TW and Big5-HKSCS fully.

Zack Weinberg (:zwol)

Comment 36

•

13 years ago

(In reply to Masatoshi Kimura [:emk] from comment #35) > We still need this to support EUC-TW and Big5-HKSCS fully. Thanks for the update. What remains to be done before a patch can be reviewed?

Henri Sivonen (:hsivonen)

Comment 37

•

11 years ago

Per bug 912470 comment 25, I suggest we WONTFIX this. EUC-TW is no longer used by Firefox and we never supported a non-x- label for it. The code lingers around so that mailnews devs can decide if they want EUC-TW to be their problem. That leaves only unified Big5 from the Encoding Standard, which is different from all other encodings when it comes to astral characters. I think we should implement it from the spec in bug 912470 instead of trying to add generic astral plane support to our existing generic machinery. (Also, the encoder patch here doesn't seem ready yet anyway. At least it doesn't seem to handle the case where halves of a surrogate pair fall on different sides of a buffer boundary.)

Zack Weinberg (:zwol)

Comment 38

•

11 years ago

I have no objection as long as the MathML fonts stuff (way, way above) has been dealt with some other way, which I suspect it has.

Henri Sivonen (:hsivonen)

Comment 39

•

11 years ago

My understanding is that we now only support and only want to support TTF/OTF math fonts that know their own Unicode mapping, but needinfoing fredw to check that we don't need font-specific encoders for math anymore.

Flags: needinfo?(fred.wang)

Frédéric Wang (:fredw)

Comment 40

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #39) > My understanding is that we now only support and only want to support > TTF/OTF math fonts that know their own Unicode mapping, but needinfoing > fredw to check that we don't need font-specific encoders for math anymore. I think Karl took care a long time ago to move the MathML code to use only TTF/OTF fonts. At the moment, the MathML code only accesses characters by Unicode code point (and using non-BMP characters has been possible since we support Asana fonts). The plan for bug 407059 is to add access by glyph index for stretchy characters using the MATH table, but I don't think anything in this bug will help / is necessary. cc'ing Karl.

Flags: needinfo?(fred.wang) → needinfo?(karlt)

Karl Tomlinson (:karlt)

Comment 41

•

11 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #39) > My understanding is that we now only support and only want to support > TTF/OTF math fonts that know their own Unicode mapping, Legacy non-OTF fonts are still supported by graphics code, but I expect (but I'm not sure) that we only support non-Unicode mappings when the platform provides the translation. Regardless, we don't need to add a non-BMP encoder for fonts or math.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Flags: needinfo?(karlt)

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.