Closed
Bug 136292
Opened 23 years ago
Closed 16 years ago
jsdIValue doesn't handle Unicode string values at all
Categories
(Other Applications Graveyard :: Venkman JS Debugger, defect)
Other Applications Graveyard
Venkman JS Debugger
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla1.9.2a1
People
(Reporter: pwilson, Assigned: timeless)
References
Details
Attachments
(3 files)
1.99 KB,
text/html
|
Details | |
541 bytes,
text/html
|
Details | |
25.89 KB,
patch
|
jst
:
review+
jst
:
superreview+
|
Details | Diff | Splinter Review |
When trying to encode the XML character classes as RegExps I use (cutdown):
const Digit = "[\u0030-\u0039] | [\u0660-\u0669] | [\u06F0-\u06F9] | \
[\u0966-\u096F] | [\u09E6-\u09EF] | [\u0A66-\u0A6F] | \
[\u0AE6-\u0AEF] | [\u0B66-\u0B6F] | [\u0BE7-\u0BEF] | \
[\u0C66-\u0C6F] | [\u0CE6-\u0CEF] | [\u0D66-\u0D6F] | \
[\u0E50-\u0E59] | [\u0ED0-\u0ED9] | [\u0F20-\u0F29]";
Venkman Trace:
var R = new RegExp("(" + Digit + ")+", "gm")
$[37] = [void] void
A = R.toString()
$[38] = [string] "/([0-9] | [`-i] | [\xF0-\xF9] | [f-o] | [\xE6-\xEF] | [f-o] |
[\xE6-\xEF] | [f-o] | [\xE7-\xEF] | [f-o] | [\xE6-\xEF] | [f-o] | [P-Y] |
[\xD0-\xD9] | [ -)])+/gm"
A = R.test("123")
$[39] = [boolean] false
1. The output of the regexp back to a string is clearly wrong.
2. The "123" input should return true.
if I reduce the Digits expression to just the first range then it works
correctly.
Reporter | ||
Comment 1•23 years ago
|
||
Sorry, I forgot to note the build: 2002040214 Win2000
Reporter | ||
Comment 2•23 years ago
|
||
Of course it helps if you enter the regexp correctly - but not much.
If you remove the spaces from Digit:
const Digit = "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" +
"[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" +
"[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" +
"[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" +
"[\u0E50-\u0E59]|[\u0ED0-\u0ED9]|[\u0F20-\u0F29]";
The toString() method still chops the high order bytes.
The test() call is more interesting:
A = R.test("123") returns true (also true for longer digit strings).
A = R.test("12") returns false
A = R.test("21") returns true
A = R.test("1") returns false
A = R.test("0") returns true
Comment 3•23 years ago
|
||
Comment 4•23 years ago
|
||
I'm not seeing any problem at all with RegExp when I run the HTML testcase
with Mozilla trunk binaries 20020406xx WinNT, 20020407xx Linux. For example,
here is my output on WinNT:
var R = new RegExp("(" + Digit + ")+", "gm") where:
Digit =
"[0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩]"
R.toString() =
"/([0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩])+/gm"
R.test("123") = true
R.test("123") = false
R.test("12") = true
R.test("12") = false
R.test("21") = true
R.test("21") = false
R.test("1") = true
R.test("1") = false
R.test("0") = true
R.test("0") = false
Comment 5•23 years ago
|
||
Of course the visual representation of the Unicode characters depends
on your environment. The way these characters look after I posted them
to the Bugzilla server is not the way they look when I run the HTML
testcase locally. The real point is, |R.toString()| is identical to
|Digit|, only wrapped like this: /(Digit)+/gm
So if there is a problem, it's not with R.toString(), but with the
way the string |Digit| is being represented. Again, this depends on
your environment. I put a <META> tag in the HTML testcase to set
content="text/html; charset=utf-8". In Mozilla, this will force
View ---> Character Coding to equal "Unicode (UTF-8)". However, the
visual representation will still depend on what fonts you have, etc.
To test this, try the testcase in IE6 and in Mozilla. For me on WinNT,
the output of the testcase is identical in both browsers.
Note I have expanded your example by applying R.test() twice to each
test string. This shows why you have alternating results of true and false.
The reason is, you have the global flag set. Whenever the global flag
is set, the property |R.lastIndex| gets incremented to the position of
the last match. If the last match was null, R.lastIndex gets reset to 0,
i.e. to the beginning of the string.
So for example, after R.test("123") = true, we have R.lastIndex = 3.
When we do R.test("123") again, the search will start from position 3
in the string, i.e. after the end of the string! Thus there is no
match, we have R.test("123") = false, and R.lastIndex gets set to 0.
Thus, when we try the next test, R.test("12"), we begin the search from
the beginning of the string again, and we get a successful match again.
This is what happens when you repeatedly try R.test() when the global
flag is set. See bug 98409 (INVALID) for another example of this behavior.
Comment 6•23 years ago
|
||
I'm going to mark this one Invalid for now, and ask Peter to confirm
this by running the testcase in Mozilla and IE6. Note: NN4.7 and IE4
cannot handle the regexp, so don't try the testcase with them.
What I'm expecting is this:
1. Your output in Mozilla is the same as your output in IE6
2. In each browser, |R.toString()| is identical to |Digit|
up to the wrapping /()+/gm
If so, we have no bug in RegExp; it's then a question of how the
browser is displaying Unicode strings in your environment.
If we do have a problem with that, I'll have to reassign this to the
International component to assess. The way Unicode strings are handled
in the browser is not the task of the JS Engine. Or perhaps it has to do
with how Venkman handles Unicode strings. Here is my output from Venkman:
Digit
$[2] = [string]
"[0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[
-)]"
R.toString()
$[1] = [string]
"/([0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[
-)])+/gm"
Again, we see they are identical strings except for the wrapping /()+/gm.
Here is the output of the same testcase in the standalone JS shell,
when run in my Cygwin shell on WinNT:
Digit =
"[0-9]|[`-i]|[≡-∙]|[f-o]|[µ-∩]|[f-o]|[µ-∩]|[f-o]|[τ-∩]|[f-o]|[µ-∩]|[f-o]|[P-Y]|[╨-┘]|[
-)]"
R.toString() =
"/([0-9]|[`-i]|[≡-∙]|[f-o]|[µ-∩]|[f-o]|[µ-∩]|[f-o]|[τ-∩]|[f-o]|[µ-∩]|[f-o]|[P-Y]|[╨-┘]|[
-)])+/gm"
Peter: if you agree there is no bug here, please mark this bug
"Verified". If you disagree, please state why, and reopen the bug.
Then I will reassign it to International or Venkman - thanks.
Status: UNCONFIRMED → RESOLVED
Closed: 23 years ago
Resolution: --- → INVALID
Comment 7•23 years ago
|
||
> 1. Your output in Mozilla is the same as your output in IE6
I guess I can't ask for that exactly; my IE6 is showing empty-box
characters where Mozilla is showing [0-9]. You can see the difference
in the way the two browsers handle Unicode characters on your system
simply by trying this javascript: URL in each one:
javascript: alert('\u0F29')
Comment 8•23 years ago
|
||
Note: I tried the |evald| command in Venkman on this.
evald "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" +
"[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" +
"[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" +
"[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" +
"[\u0E50-\u0E59]|[\u0ED0-\u0ED9]|[\u0F20-\u0F29]";
The output I get looks the same as the browser output when I run
the HTML testcase above. I shouldn't dare post this to Bugzilla,
but here goes:
[0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩]
The point is, this is NOT what we see when debugging the above
HTML testcase on the same string (see Comment #6):
"[0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[
-)]"
Reopening bug and assigning to Venkman to ask why -
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Comment 9•23 years ago
|
||
Reassigning. If you open Venkman and do |evald| on a Unicode character:
evald '\u06F0'; you get:
۰
But if Venkman is running a script, and you are stopped at a breakpoint,
then if you use |eval| on the Unicode character, you get this:
eval '\u06F0'
"\xF0"
This might mean that Venkman is at the mercy of the browser; in other
words, the browser treats the Unicode character as '\xF0', so that's
what Venkman has to say. But that doesn't seem to be true, by trying
this little testcase I will attach below:
<SCRIPT language="JavaScript">
var Digit = "\u06F0";
function test()
{
document.write("\\" + "u06F0 = " + Digit);
alert("\\" + "u06F0 = " + Digit);
alert(Digit === '\xF0');
}
test();
</script>
So why is Venkman coming up with this \xHH value for a \uHHHH character?
This is probably an International-component issue, but I thought I'd
ask Rob first -
Assignee: rogerl → rginda
Status: UNCONFIRMED → NEW
Component: JavaScript Engine → JavaScript Debugger
Ever confirmed: true
QA Contact: pschwartau → caillon
Comment 10•23 years ago
|
||
Comment 11•23 years ago
|
||
Changing OS: Win2K ---> All.
Resummarizing from: "RegExp does not handle Unicode char ranges correctly"
to: "Does Venkman handle Unicode characters correctly?"
OS: Windows 2000 → All
Summary: RegExp does not handle unicode char ranges correctly → Does Venkman handle Unicode characters correctly?
Comment 12•23 years ago
|
||
Again, note when you post Unicode characters to the Bugzilla server,
they get represented differently than how they appear locally.
The expression |۰| in Comment #9 above is a decimal value which
translates in hex to 0x6F0, which corresponds to our assignment
|var Digit = "\u06F0";|.
By contrast, the hex value \xF0, which is what Venkman shows in
debug mode, corresponds to the decimal value 240.
Reporter | ||
Comment 13•23 years ago
|
||
I was not aware that RegExp's are stateful when the g flag is used. Hence my
confusion about the apparent erroneous values. I can verify that the RegExp is
working as expected.
Comment 14•23 years ago
|
||
----------------------------- SUMMARY -----------------------------------
In Venkman, |evald| and |eval| are giving different output on
Unicode characters. The reduced HTML testcase uses '\u06F0'.
evald '\u06F0' ---> a graphical representation of character '\u06F0'
eval '\u06F0' ---> '\xF0'
In other words, |eval| is chopping off the high byte. I was tempted
to blame the browser for causing this problem by providing Venkman
with the truncated information. But the browser seems to retain the
high byte, as seen when we run the reduced HTML testcase.
So why is Venkman missing the high byte? Note I tried this in
Venkman; recall that 1776 is the decimal representation of 0x6F0:
evald '\u06F0'.charCodeAt(0) ---> 1776
eval '\u06F0'.charCodeAt(0) ---> 1776
evald '\u06F0'.charAt(0) ---> a graphical representation of character '\u06F0'
eval '\u06F0'.charAt(0) ---> '\xF0'
So |evald|, |eval| both agree on charCodeAt(), but differ on charAt();
|eval| seems to drop the high byte.
Updated•20 years ago
|
Product: Core → Other Applications
Comment 15•19 years ago
|
||
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/js/jsd/idl/jsdIDebuggerService.idl&rev=1.33&mark=1105#1102
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/js/jsd/jsd_xpc.cpp&rev=1.75&mark=2166#2165
Can you see the problem yet?
At a guess, this needs to be upped to |wstring| in the IDL (or is one of the other seemingly random string types better?) and wchar** or whatever in the C++.
Hardware: PC → All
Summary: Does Venkman handle Unicode characters correctly? → jsdIValue doesn't handle Unicode string values at all
Comment 16•19 years ago
|
||
It'd probably be better to add a getUnicodeValue() method than whack the signature of getStringValue(). For backward compatibility's sake.
Comment 17•19 years ago
|
||
I'm concerned at the number of other |string|s in the IDL; does JSD actually have an C++ users? (I always think of it as JS-only, although I know it isn't technically.)
Comment 18•17 years ago
|
||
Brendan, I share James' concern, and am wondering if it would be possible/feasible to change the idl to use wstring wherever possible - in the 1.9 timeframe.
After 1.9 we can stop caring given Tamarin will do away with all these interfaces (or require extensive tinkering to them) - as far as I know, anyway. Which is why I'd like to get this before 1.9.
Flags: blocking1.9?
Comment 19•17 years ago
|
||
So, Brendan, I guess you're really really busy, but could you please take a look at this? It's been two months and we're close to shipping betas and all that... :-(
Updated•17 years ago
|
Flags: blocking1.9?
Assignee | ||
Comment 20•16 years ago
|
||
i hit this while chasing certificates, this also fixes the other bug about nulls
Comment 21•16 years ago
|
||
Comment on attachment 352710 [details] [diff] [review]
convert idl to use utf8 strings to make xpconnect happy
- In jsdStackFrame::GetFunctionName():
+ _rval.Assign(JSD_GetNameForStackFrame(mCx, mThreadState,
+ mStackFrameInfo));
Fix the second line argument indentation.
r+sr=jst
Attachment #352710 -
Flags: superreview+
Attachment #352710 -
Flags: review?(jst)
Attachment #352710 -
Flags: review+
Comment 22•16 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 16 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.2a1
Comment 23•15 years ago
|
||
It would've been nice if you hadn't changed IIDs of interfaces that did not actually change :/
Comment 24•15 years ago
|
||
Any chance we can roll back the IID changes for interfaces that didn't change in 1.9.2 and beyond? It'd help to simplify the code I'm working on now (working with jsdIScriptHook, jsdICallHook, etc)
Comment 25•15 years ago
|
||
IID should not change if the interface didn't. File a followup bug and make it block this bug.
/be
Comment 26•15 years ago
|
||
Updated•6 years ago
|
Product: Other Applications → Other Applications Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•