Closed Bug 136292 Opened 18 years ago Closed 11 years ago

jsdIValue doesn't handle Unicode string values at all


(Other Applications Graveyard :: Venkman JS Debugger, defect)

Not set


(Not tracked)



(Reporter: pwilson, Assigned: timeless)




(3 files)

When trying to encode the XML character classes as RegExps I use (cutdown):
const Digit    = "[\u0030-\u0039] | [\u0660-\u0669] | [\u06F0-\u06F9] | \
                  [\u0966-\u096F] | [\u09E6-\u09EF] | [\u0A66-\u0A6F] | \
                  [\u0AE6-\u0AEF] | [\u0B66-\u0B6F] | [\u0BE7-\u0BEF] | \
                  [\u0C66-\u0C6F] | [\u0CE6-\u0CEF] | [\u0D66-\u0D6F] | \
                  [\u0E50-\u0E59] | [\u0ED0-\u0ED9] | [\u0F20-\u0F29]"; 

Venkman Trace:
var R = new RegExp("(" + Digit + ")+", "gm")
$[37] = [void] void
A = R.toString()
$[38] = [string] "/([0-9] | [`-i] | [\xF0-\xF9] | [f-o] | [\xE6-\xEF] | [f-o] | 
[\xE6-\xEF] | [f-o] | [\xE7-\xEF] | [f-o] | [\xE6-\xEF] | [f-o] | [P-Y] | 
[\xD0-\xD9] | [ -)])+/gm"
A = R.test("123")
$[39] = [boolean] false

1. The output of the regexp back to a string is clearly wrong.
2. The "123" input should return true.

if I reduce the Digits expression to just the first range then it works 
Sorry, I forgot to note the build: 2002040214 Win2000
Of course it helps if you enter the regexp correctly - but not much.

If you remove the spaces from Digit:
const Digit 	= "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" +
                 "[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" +
                 "[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" +
                 "[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" +
The toString() method still chops the high order bytes.

The test() call is more interesting:
A = R.test("123") returns true (also true for longer digit strings).
A = R.test("12") returns false
A = R.test("21") returns true
A = R.test("1") returns false
A = R.test("0") returns true

Attached file HTML testcase
I'm not seeing any problem at all with RegExp when I run the HTML testcase
with Mozilla trunk binaries 20020406xx WinNT, 20020407xx Linux. For example,
here is my output on WinNT:

var R = new RegExp("(" + Digit + ")+", "gm")    where:

Digit =

R.toString() =

R.test("123") = true
R.test("123") = false

R.test("12") = true
R.test("12") = false

R.test("21") = true
R.test("21") = false

R.test("1") = true
R.test("1") = false

R.test("0") = true
R.test("0") = false
Of course the visual representation of the Unicode characters depends
on your environment. The way these characters look after I posted them
to the Bugzilla server is not the way they look when I run the HTML
testcase locally. The real point is, |R.toString()| is identical to
|Digit|, only wrapped like this: /(Digit)+/gm

So if there is a problem, it's not with R.toString(), but with the 
way the string |Digit| is being represented. Again, this depends on 
your environment. I put a <META> tag in the HTML testcase to set 
content="text/html; charset=utf-8". In Mozilla, this will force
View ---> Character Coding to equal "Unicode (UTF-8)". However, the
visual representation will still depend on what fonts you have, etc.

To test this, try the testcase in IE6 and in Mozilla. For me on WinNT,
the output of the testcase is identical in both browsers.

Note I have expanded your example by applying R.test() twice to each
test string. This shows why you have alternating results of true and false.
The reason is, you have the global flag set. Whenever the global flag
is set, the property |R.lastIndex| gets incremented to the position of
the last match. If the last match was null, R.lastIndex gets reset to 0,
i.e. to the beginning of the string.

So for example, after R.test("123") = true, we have R.lastIndex = 3.
When we do R.test("123") again, the search will start from position 3
in the string, i.e. after the end of the string! Thus there is no 
match, we have R.test("123") = false, and R.lastIndex gets set to 0.

Thus, when we try the next test, R.test("12"), we begin the search from
the beginning of the string again, and we get a successful match again.
This is what happens when you repeatedly try R.test() when the global
flag is set. See bug 98409 (INVALID) for another example of this behavior.
I'm going to mark this one Invalid for now, and ask Peter to confirm
this by running the testcase in Mozilla and IE6. Note: NN4.7 and IE4
cannot handle the regexp, so don't try the testcase with them.

What I'm expecting is this:

1. Your output in Mozilla is the same as your output in IE6

2. In each browser, |R.toString()| is identical to |Digit|
   up to the wrapping /()+/gm

If so, we have no bug in RegExp; it's then a question of how the
browser is displaying Unicode strings in your environment.

If we do have a problem with that, I'll have to reassign this to the
International component to assess. The way Unicode strings are handled
in the browser is not the task of the JS Engine. Or perhaps it has to do
with how Venkman handles Unicode strings. Here is my output from Venkman:

$[2] = [string]

$[1] = [string]

Again, we see they are identical strings except for the wrapping /()+/gm.
Here is the output of the same testcase in the standalone JS shell, 
when run in my Cygwin shell on WinNT:

Digit =

R.toString() =

Peter: if you agree there is no bug here, please mark this bug 
"Verified". If you disagree, please state why, and reopen the bug.
Then I will reassign it to International or Venkman - thanks.
Closed: 18 years ago
Resolution: --- → INVALID
> 1. Your output in Mozilla is the same as your output in IE6

I guess I can't ask for that exactly; my IE6 is showing empty-box
characters where Mozilla is showing [0-9]. You can see the difference
in the way the two browsers handle Unicode characters on your system
simply by trying this javascript: URL in each one:

                 javascript: alert('\u0F29')
Note: I tried the |evald| command in Venkman on this.

evald "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" + 
"[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" + 
"[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" + 
"[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" + 

The output I get looks the same as the browser output when I run
the HTML testcase above. I shouldn't dare post this to Bugzilla,
but here goes:


The point is, this is NOT what we see when debugging the above
HTML testcase on the same string (see Comment #6): 


Reopening bug and assigning to Venkman to ask why -
Resolution: INVALID → ---
Reassigning. If you open Venkman and do |evald| on a Unicode character:

evald '\u06F0';    you get:

But if Venkman is running a script, and you are stopped at a breakpoint,
then if you use |eval| on the Unicode character, you get this:

eval '\u06F0'

This might mean that Venkman is at the mercy of the browser; in other
words, the browser treats the Unicode character as '\xF0', so that's
what Venkman has to say. But that doesn't seem to be true, by trying
this little testcase I will attach below:

<SCRIPT language="JavaScript">
var Digit = "\u06F0";

function test()
  document.write("\\" + "u06F0 = " + Digit);
  alert("\\" + "u06F0 = " + Digit);
  alert(Digit === '\xF0');


So why is Venkman coming up with this \xHH value for a \uHHHH character?
This is probably an International-component issue, but I thought I'd 
ask Rob first -
Assignee: rogerl → rginda
Component: JavaScript Engine → JavaScript Debugger
Ever confirmed: true
QA Contact: pschwartau → caillon
Attached file Reduced HTML testcase
Changing OS: Win2K ---> All.

Resummarizing from: "RegExp does not handle Unicode char ranges correctly"
                to: "Does Venkman handle Unicode characters correctly?"
OS: Windows 2000 → All
Summary: RegExp does not handle unicode char ranges correctly → Does Venkman handle Unicode characters correctly?
Again, note when you post Unicode characters to the Bugzilla server,
they get represented differently than how they appear locally.

The expression |&#1776;| in Comment #9 above is a decimal value which
translates in hex to 0x6F0, which corresponds to our assignment
|var Digit = "\u06F0";|.

By contrast, the hex value \xF0, which is what Venkman shows in
debug mode, corresponds to the decimal value 240.
I was not aware that RegExp's are stateful when the g flag is used. Hence my 
confusion about the apparent erroneous values. I can verify that the RegExp is 
working as expected.

----------------------------- SUMMARY -----------------------------------

In Venkman, |evald| and |eval| are giving different output on
Unicode characters. The reduced HTML testcase uses '\u06F0'.

evald '\u06F0'  ---> a graphical representation of character '\u06F0'
eval  '\u06F0'  ---> '\xF0'

In other words, |eval| is chopping off the high byte. I was tempted
to blame the browser for causing this problem by providing Venkman
with the truncated information. But the browser seems to retain the
high byte, as seen when we run the reduced HTML testcase.

So why is Venkman missing the high byte? Note I tried this in 
Venkman; recall that 1776 is the decimal representation of 0x6F0:

evald '\u06F0'.charCodeAt(0)  ---> 1776  
eval  '\u06F0'.charCodeAt(0)  ---> 1776

evald '\u06F0'.charAt(0)  ---> a graphical representation of character '\u06F0'
eval  '\u06F0'.charAt(0)  ---> '\xF0' 

So |evald|, |eval| both agree on charCodeAt(), but differ on charAt();
|eval| seems to drop the high byte.
Depends on: 138720
Product: Core → Other Applications

Can you see the problem yet?

At a guess, this needs to be upped to |wstring| in the IDL (or is one of the other seemingly random string types better?) and wchar** or whatever in the C++.
Hardware: PC → All
Summary: Does Venkman handle Unicode characters correctly? → jsdIValue doesn't handle Unicode string values at all
It'd probably be better to add a getUnicodeValue() method than whack the signature of getStringValue().  For backward compatibility's sake.
I'm concerned at the number of other |string|s in the IDL; does JSD actually have an C++ users? (I always think of it as JS-only, although I know it isn't technically.)
Brendan, I share James' concern, and am wondering if it would be possible/feasible to change the idl to use wstring wherever possible - in the 1.9 timeframe.

After 1.9 we can stop caring given Tamarin will do away with all these interfaces (or require extensive tinkering to them) - as far as I know, anyway. Which is why I'd like to get this before 1.9.
Flags: blocking1.9?
Blocks: 335098
So, Brendan, I guess you're really really busy, but could you please take a look at this? It's been two months and we're close to shipping betas and all that... :-(
Flags: blocking1.9?
QA Contact: caillon → venkman
i hit this while chasing certificates, this also fixes the other bug about nulls
Assignee: rginda → timeless
Attachment #352710 - Flags: review?(jst)
Comment on attachment 352710 [details] [diff] [review]
convert idl to use utf8 strings to make xpconnect happy

- In jsdStackFrame::GetFunctionName():

+    _rval.Assign(JSD_GetNameForStackFrame(mCx, mThreadState,
+                                                mStackFrameInfo));

Fix the second line argument indentation.

Attachment #352710 - Flags: superreview+
Attachment #352710 - Flags: review?(jst)
Attachment #352710 - Flags: review+
Closed: 18 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.2a1
Blocks: 482809
It would've been nice if you hadn't changed IIDs of interfaces that did not actually change :/
Any chance we can roll back the IID changes for interfaces that didn't change in 1.9.2 and beyond? It'd help to simplify the code I'm working on now (working with jsdIScriptHook, jsdICallHook, etc)
IID should not change if the interface didn't. File a followup bug and make it block this bug.

Depends on: 519276
Depends on: 700302
Product: Other Applications → Other Applications Graveyard
You need to log in before you can comment on or make changes to this bug.