Open Bug 284394 Opened 19 years ago Updated 2 years ago

XSLT <xsl:number format="" does not number schem for Unicode characters decimal value of 1 except 0x31

Categories

(Core :: XSLT, defect)

defect

Tracking

()

People

(Reporter: FrankTang, Assigned: peterv)

References

(Depends on 1 open bug, )

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

The XSLT spec said
"
Any token where the last character has a decimal digit value of 1 (as specified
in the Unicode character property database), and the Unicode value of preceding
characters is one less than the Unicode value of the last character generates a
decimal representation of the number where each number is at least as long as
the format token. Thus, a format token 1 generates the sequence 1 2 ... 10 11 12
..., and a format token 01 generates the sequence 01 02 ... 09 10 11 12 ... 99
100 101."

Also, according to Unicode 4.0 database at
http://www.unicode.org/Public/4.0-Update1/extracted/DerivedNumericValues-4.0.1.txt

The following 51 unicode code points have value 1.
0031          ; 1.0 # Nd       DIGIT ONE
00B9          ; 1.0 # No       SUPERSCRIPT ONE
0661          ; 1.0 # Nd       ARABIC-INDIC DIGIT ONE
06F1          ; 1.0 # Nd       EXTENDED ARABIC-INDIC DIGIT ONE
0967          ; 1.0 # Nd       DEVANAGARI DIGIT ONE
09E7          ; 1.0 # Nd       BENGALI DIGIT ONE
09F4          ; 1.0 # No       BENGALI CURRENCY NUMERATOR ONE
0A67          ; 1.0 # Nd       GURMUKHI DIGIT ONE
0AE7          ; 1.0 # Nd       GUJARATI DIGIT ONE
0B67          ; 1.0 # Nd       ORIYA DIGIT ONE
0BE7          ; 1.0 # Nd       TAMIL DIGIT ONE
0C67          ; 1.0 # Nd       TELUGU DIGIT ONE
0CE7          ; 1.0 # Nd       KANNADA DIGIT ONE
0D67          ; 1.0 # Nd       MALAYALAM DIGIT ONE
0E51          ; 1.0 # Nd       THAI DIGIT ONE
0ED1          ; 1.0 # Nd       LAO DIGIT ONE
0F21          ; 1.0 # Nd       TIBETAN DIGIT ONE
1041          ; 1.0 # Nd       MYANMAR DIGIT ONE
1369          ; 1.0 # Nd       ETHIOPIC DIGIT ONE
17E1          ; 1.0 # Nd       KHMER DIGIT ONE
17F1          ; 1.0 # No       KHMER SYMBOL LEK ATTAK MUOY
1811          ; 1.0 # Nd       MONGOLIAN DIGIT ONE
1947          ; 1.0 # Nd       LIMBU DIGIT ONE
2081          ; 1.0 # No       SUBSCRIPT ONE
215F          ; 1.0 # No       FRACTION NUMERATOR ONE
2160          ; 1.0 # Nl       ROMAN NUMERAL ONE
2170          ; 1.0 # Nl       SMALL ROMAN NUMERAL ONE
2460          ; 1.0 # No       CIRCLED DIGIT ONE
2474          ; 1.0 # No       PARENTHESIZED DIGIT ONE
2488          ; 1.0 # No       DIGIT ONE FULL STOP
24F5          ; 1.0 # No       DOUBLE CIRCLED DIGIT ONE
2776          ; 1.0 # No       DINGBAT NEGATIVE CIRCLED DIGIT ONE
2780          ; 1.0 # No       DINGBAT CIRCLED SANS-SERIF DIGIT ONE
278A          ; 1.0 # No       DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
3021          ; 1.0 # Nl       HANGZHOU NUMERAL ONE
3192          ; 1.0 # No       IDEOGRAPHIC ANNOTATION ONE MARK
3220          ; 1.0 # No       PARENTHESIZED IDEOGRAPH ONE
3280          ; 1.0 # No       CIRCLED IDEOGRAPH ONE
4E00          ; 1.0 # Lo       CJK UNIFIED IDEOGRAPH-4E00
58F1          ; 1.0 # Lo       CJK UNIFIED IDEOGRAPH-58F1
58F9          ; 1.0 # Lo       CJK UNIFIED IDEOGRAPH-58F9
5F0C          ; 1.0 # Lo       CJK UNIFIED IDEOGRAPH-5F0C
FF11          ; 1.0 # Nd       FULLWIDTH DIGIT ONE
10107         ; 1.0 # No       AEGEAN NUMBER ONE
10320         ; 1.0 # No       OLD ITALIC NUMERAL ONE
104A1         ; 1.0 # Nd       OSMANYA DIGIT ONE
1D7CF         ; 1.0 # Nd       MATHEMATICAL BOLD DIGIT ONE
1D7D9         ; 1.0 # Nd       MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
1D7E3         ; 1.0 # Nd       MATHEMATICAL SANS-SERIF DIGIT ONE
1D7ED         ; 1.0 # Nd       MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
1D7F7         ; 1.0 # Nd       MATHEMATICAL MONOSPACE DIGIT ONE

# Total code points: 51

But currently, XSLT only know how to handle the value 0x31

see 
http://lxr.mozilla.org/seamonkey/source/extensions/transformiix/source/xslt/txXSLTNumberCounters.cpp

To fix this, we need to first make the txDecimalCounter constructor to take a
Unicode code point as base character for '1'. 
Then we need to change the void txDecimalCounter::appendNumber(PRInt32 aNumber,
nsAString& aDest) implementation to convert the number into decimal based on the
 number 1 Unicode character we passed in as unicode code point.
Then, we need to change the switch statment in the
txFormattedCounter::getCounterFor to consider those 50 characters for decimal.



Reproducible: Always
In particular, 
"format="&#x0E51;" specifies numbering with Thai digits"
mentioned in XSLT spec does not work.
I don't recommend you use the decimal rule for those characters other than Nd
category. 

to test it, use your normal <xsl:number test case and change 
format="1" to
format="&#x0E51;"
or
format="&#x0661;"
or
format="&#x09E7;"

etc
also see 284395
Looking at the link mentioned in comment 0 this seems non-trivial. The problem
is that apparently not all of these have the numbers in order in the unicode
table. So for example SUPERSCRIPT ONE is 00B9, but SUPERSCRIPT ZERO is 2070 and
SUBPERSCRIPT TWO is 00B2. Also, are all these base 10?

Anyhow, my point is that we shouldn't just hack this into transformiix. This
needs support from intl. I filed bug 284420 on that.

Also, i think the xslt spec is wrong here.
# and the Unicode value of preceding characters is one less than the Unicode
# value of the last character
Doesn't seem to work for the superscript numbers mentioned above. Or am I
missunderstanding the term "Unicode value"?
Status: UNCONFIRMED → NEW
No longer depends on: 284420
Ever confirmed: true
I understand "a decimal digit value of 1" to mean that the character should have
the General Character "Nd" and "1" in field 6. That includes 26 characters, if
my grepping is correct:

0031;DIGIT ONE
0661;ARABIC-INDIC DIGIT ONE
06F1;EXTENDED ARABIC-INDIC DIGIT ONE
0967;DEVANAGARI DIGIT ONE
09E7;BENGALI DIGIT ONE
0A67;GURMUKHI DIGIT ONE
0AE7;GUJARATI DIGIT ONE
0B67;ORIYA DIGIT ONE
0BE7;TAMIL DIGIT ONE
0C67;TELUGU DIGIT ONE
0CE7;KANNADA DIGIT ONE
0D67;MALAYALAM DIGIT ONE
0E51;THAI DIGIT ONE
0ED1;LAO DIGIT ONE
0F21;TIBETAN DIGIT ONE
1041;MYANMAR DIGIT ONE
17E1;KHMER DIGIT ONE
1811;MONGOLIAN DIGIT ONE
1947;LIMBU DIGIT ONE
FF11;FULLWIDTH DIGIT ONE
104A1;OSMANYA DIGIT ONE
1D7CF;MATHEMATICAL BOLD DIGIT ONE
1D7D9;MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
1D7E3;MATHEMATICAL SANS-SERIF DIGIT ONE
1D7ED;MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
1D7F7;MATHEMATICAL MONOSPACE DIGIT ONE
(In reply to comment #7)
Simon: I think you are are right except 
0BE7          ; 1.0 # Nd       TAMIL DIGIT ONE
TAMIL DIGIT ZERO does not exist in Unicode. 

I think it is a good idea to refactor the number generation part into intl 
Well, it it's just those chars and they are all consecutive unichar values and
they are all base 10 then it should be fine to do in transformiix. Though it
would be nice with a function like:

PRBool IsZeroDigit(PRUnichar c)

Though i'm not sure how feasable it is to do non-bmp-0 characters. I don't know
how much support mozilla has for that in general. Are there stringiterators that
can iterate UTF-16 strings and expose decoded unichar values? Do we even have a
datatype like PRUnichar that is 32bit?
Btw, how do you write 10 or 101 in tamil numbers if you don't have a zero digit?
(In reply to comment #10)
> Btw, how do you write 10 or 101 in tamil numbers if you don't have a zero digit?

They do have 
TAMIL NUMBER TEN U+0BF0
TAMIL NUMBER ONE HUNDRED U+0BF1
TAMIL NUMBER ONE THOUSAND U+0BF2

How they been used is not clear. I suggest you stay away from it for now. 

(In reply to comment #11)
> (In reply to comment #10)
> > Btw, how do you write 10 or 101 in tamil numbers if you don't have a zero digit?
> 
> They do have 
> TAMIL NUMBER TEN U+0BF0
> TAMIL NUMBER ONE HUNDRED U+0BF1
> TAMIL NUMBER ONE THOUSAND U+0BF2
> 
> How they been used is not clear. I suggest you stay away from it for now. 
> 
> 
If you REALLY REALLY care about TAMIL number system. read
http://weblogs.asp.net/michkap/archive/2005/01/24/359347.aspx

"It is an additive and positional system (unlike Roman numerals, there is no
subtraction involved) that has no zero but includes characters for 10, 100, and
1000.

In the traditional system the number 3,782 would be represented as &#3049;&#3058;&#3053;&#3057;&#3054;&#3056;&#3048;
(literally Three-Thousand(s)-Seven-Hundread(s)-Eight-Ten(s)-Two).

At least since the early 1800s, however, usage of the Tamil numerals as digits
has been more and more common. Thus the number 3,782 would often be represented
as &#3049;&#3053;&#3054;&#3048; (literally 3782). "
(In reply to comment #8)
> Simon: I think you are are right except 
> 0BE7          ; 1.0 # Nd       TAMIL DIGIT ONE
> TAMIL DIGIT ZERO does not exist in Unicode. 

It's being added (at 0BE6) in Unicode 4.1, due to be released this month.
(In reply to comment #9)

> PRBool IsZeroDigit(PRUnichar c)
> 
> Though i'm not sure how feasable it is to do non-bmp-0 characters. 

  |PRBool IsZeroDigit(PRUint32 c)| would be better.

> I don't know how much support mozilla has for that in general. 
> Are there stringiterators that
> can iterate UTF-16 strings and expose decoded unichar values? 

  Currently, it's done 'manually' (check whether the current 'char' is surrogate
or not, etc...) as necessary in a few places. It might be a good idea to add
this iterator to 'nsAString'. 

> Do we even have a datatype like PRUnichar that is 32bit?

  PRUint32  :-) We don't have UCS4 string classes (perhaps, we'll never have...)
except that we have ns(Value)Array.

Adding the editor of the XSLT to Cc 
QA Contact: keith → xslt
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.