136292 - jsdIValue doesn't handle Unicode string values at all

Reporter

Description

•

23 years ago

Peter Wilson

Reporter

Comment 1

•

23 years ago

Sorry, I forgot to note the build: 2002040214 Win2000

Peter Wilson

Reporter

Comment 2

•

23 years ago

Of course it helps if you enter the regexp correctly - but not much. If you remove the spaces from Digit: const Digit = "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" + "[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" + "[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" + "[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" + "[\u0E50-\u0E59]|[\u0ED0-\u0ED9]|[\u0F20-\u0F29]"; The toString() method still chops the high order bytes. The test() call is more interesting: A = R.test("123") returns true (also true for longer digit strings). A = R.test("12") returns false A = R.test("21") returns true A = R.test("1") returns false A = R.test("0") returns true

Phil Schwartau

Comment 3

•

23 years ago

Attached file HTML testcase — Details

Phil Schwartau

Comment 4

•

23 years ago

I'm not seeing any problem at all with RegExp when I run the HTML testcase with Mozilla trunk binaries 20020406xx WinNT, 20020407xx Linux. For example, here is my output on WinNT: var R = new RegExp("(" + Digit + ")+", "gm") where: Digit = "[0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩]" R.toString() = "/([0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩])+/gm" R.test("123") = true R.test("123") = false R.test("12") = true R.test("12") = false R.test("21") = true R.test("21") = false R.test("1") = true R.test("1") = false R.test("0") = true R.test("0") = false

Phil Schwartau

Comment 5

•

23 years ago

Of course the visual representation of the Unicode characters depends on your environment. The way these characters look after I posted them to the Bugzilla server is not the way they look when I run the HTML testcase locally. The real point is, |R.toString()| is identical to |Digit|, only wrapped like this: /(Digit)+/gm So if there is a problem, it's not with R.toString(), but with the way the string |Digit| is being represented. Again, this depends on your environment. I put a <META> tag in the HTML testcase to set content="text/html; charset=utf-8". In Mozilla, this will force View ---> Character Coding to equal "Unicode (UTF-8)". However, the visual representation will still depend on what fonts you have, etc. To test this, try the testcase in IE6 and in Mozilla. For me on WinNT, the output of the testcase is identical in both browsers. Note I have expanded your example by applying R.test() twice to each test string. This shows why you have alternating results of true and false. The reason is, you have the global flag set. Whenever the global flag is set, the property |R.lastIndex| gets incremented to the position of the last match. If the last match was null, R.lastIndex gets reset to 0, i.e. to the beginning of the string. So for example, after R.test("123") = true, we have R.lastIndex = 3. When we do R.test("123") again, the search will start from position 3 in the string, i.e. after the end of the string! Thus there is no match, we have R.test("123") = false, and R.lastIndex gets set to 0. Thus, when we try the next test, R.test("12"), we begin the search from the beginning of the string again, and we get a successful match again. This is what happens when you repeatedly try R.test() when the global flag is set. See bug 98409 (INVALID) for another example of this behavior.

Phil Schwartau

Comment 6

•

23 years ago

I'm going to mark this one Invalid for now, and ask Peter to confirm this by running the testcase in Mozilla and IE6. Note: NN4.7 and IE4 cannot handle the regexp, so don't try the testcase with them. What I'm expecting is this: 1. Your output in Mozilla is the same as your output in IE6 2. In each browser, |R.toString()| is identical to |Digit| up to the wrapping /()+/gm If so, we have no bug in RegExp; it's then a question of how the browser is displaying Unicode strings in your environment. If we do have a problem with that, I'll have to reassign this to the International component to assess. The way Unicode strings are handled in the browser is not the task of the JS Engine. Or perhaps it has to do with how Venkman handles Unicode strings. Here is my output from Venkman: Digit $[2] = [string] "[0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[ -)]" R.toString() $[1] = [string] "/([0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[ -)])+/gm" Again, we see they are identical strings except for the wrapping /()+/gm. Here is the output of the same testcase in the standalone JS shell, when run in my Cygwin shell on WinNT: Digit = "[0-9]|[`-i]|[≡-∙]|[f-o]|[µ-∩]|[f-o]|[µ-∩]|[f-o]|[τ-∩]|[f-o]|[µ-∩]|[f-o]|[P-Y]|[╨-┘]|[ -)]" R.toString() = "/([0-9]|[`-i]|[≡-∙]|[f-o]|[µ-∩]|[f-o]|[µ-∩]|[f-o]|[τ-∩]|[f-o]|[µ-∩]|[f-o]|[P-Y]|[╨-┘]|[ -)])+/gm" Peter: if you agree there is no bug here, please mark this bug "Verified". If you disagree, please state why, and reopen the bug. Then I will reassign it to International or Venkman - thanks.

Status: UNCONFIRMED → RESOLVED

Closed: 23 years ago

Resolution: --- → INVALID

Phil Schwartau

Comment 7

•

23 years ago

> 1. Your output in Mozilla is the same as your output in IE6 I guess I can't ask for that exactly; my IE6 is showing empty-box characters where Mozilla is showing [0-9]. You can see the difference in the way the two browsers handle Unicode characters on your system simply by trying this javascript: URL in each one: javascript: alert('\u0F29')

Phil Schwartau

Comment 8

•

23 years ago

Note: I tried the |evald| command in Venkman on this. evald "[\u0030-\u0039]|[\u0660-\u0669]|[\u06F0-\u06F9]|" + "[\u0966-\u096F]|[\u09E6-\u09EF]|[\u0A66-\u0A6F]|" + "[\u0AE6-\u0AEF]|[\u0B66-\u0B6F]|[\u0BE7-\u0BEF]|" + "[\u0C66-\u0C6F]|[\u0CE6-\u0CEF]|[\u0D66-\u0D6F]|" + "[\u0E50-\u0E59]|[\u0ED0-\u0ED9]|[\u0F20-\u0F29]"; The output I get looks the same as the browser output when I run the HTML testcase above. I shouldn't dare post this to Bugzilla, but here goes: [0-9]|[٠-٩]|[۰-۹]|[०-९]|[০-৯]|[੦-੯]|[૦-૯]|[୦-୯]|[௧-௯]|[౦-౯]|[೦-೯]|[൦-൯]|[๐-๙]|[໐-໙]|[༠-༩] The point is, this is NOT what we see when debugging the above HTML testcase on the same string (see Comment #6): "[0-9]|[`-i]|[\xF0-\xF9]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[\xE7-\xEF]|[f-o]|[\xE6-\xEF]|[f-o]|[P-Y]|[\xD0-\xD9]|[ -)]" Reopening bug and assigning to Venkman to ask why -

Status: RESOLVED → UNCONFIRMED

Resolution: INVALID → ---

Phil Schwartau

Comment 9

•

23 years ago

Reassigning. If you open Venkman and do |evald| on a Unicode character: evald '\u06F0'; you get: ۰ But if Venkman is running a script, and you are stopped at a breakpoint, then if you use |eval| on the Unicode character, you get this: eval '\u06F0' "\xF0" This might mean that Venkman is at the mercy of the browser; in other words, the browser treats the Unicode character as '\xF0', so that's what Venkman has to say. But that doesn't seem to be true, by trying this little testcase I will attach below: <SCRIPT language="JavaScript"> var Digit = "\u06F0"; function test() { document.write("\\" + "u06F0 = " + Digit); alert("\\" + "u06F0 = " + Digit); alert(Digit === '\xF0'); } test(); </script> So why is Venkman coming up with this \xHH value for a \uHHHH character? This is probably an International-component issue, but I thought I'd ask Rob first -

Assignee: rogerl → rginda

Status: UNCONFIRMED → NEW

Component: JavaScript Engine → JavaScript Debugger

Ever confirmed: true

QA Contact: pschwartau → caillon

Phil Schwartau

Comment 10

•

23 years ago

Attached file Reduced HTML testcase — Details

Phil Schwartau

Comment 11

•

23 years ago

Changing OS: Win2K ---> All. Resummarizing from: "RegExp does not handle Unicode char ranges correctly" to: "Does Venkman handle Unicode characters correctly?"

OS: Windows 2000 → All

Summary: RegExp does not handle unicode char ranges correctly → Does Venkman handle Unicode characters correctly?

Phil Schwartau

Comment 12

•

23 years ago

Again, note when you post Unicode characters to the Bugzilla server, they get represented differently than how they appear locally. The expression |۰| in Comment #9 above is a decimal value which translates in hex to 0x6F0, which corresponds to our assignment |var Digit = "\u06F0";|. By contrast, the hex value \xF0, which is what Venkman shows in debug mode, corresponds to the decimal value 240.

Peter Wilson

Reporter

Comment 13

•

23 years ago

I was not aware that RegExp's are stateful when the g flag is used. Hence my confusion about the apparent erroneous values. I can verify that the RegExp is working as expected.

Phil Schwartau

Comment 14

•

23 years ago

----------------------------- SUMMARY ----------------------------------- In Venkman, |evald| and |eval| are giving different output on Unicode characters. The reduced HTML testcase uses '\u06F0'. evald '\u06F0' ---> a graphical representation of character '\u06F0' eval '\u06F0' ---> '\xF0' In other words, |eval| is chopping off the high byte. I was tempted to blame the browser for causing this problem by providing Venkman with the truncated information. But the browser seems to retain the high byte, as seen when we run the reduced HTML testcase. So why is Venkman missing the high byte? Note I tried this in Venkman; recall that 1776 is the decimal representation of 0x6F0: evald '\u06F0'.charCodeAt(0) ---> 1776 eval '\u06F0'.charCodeAt(0) ---> 1776 evald '\u06F0'.charAt(0) ---> a graphical representation of character '\u06F0' eval '\u06F0'.charAt(0) ---> '\xF0' So |evald|, |eval| both agree on charCodeAt(), but differ on charAt(); |eval| seems to drop the high byte.

greer

Updated

•

23 years ago

Depends on: 138720

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: Core → Other Applications

James Ross

Comment 15

•

19 years ago

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/js/jsd/idl/jsdIDebuggerService.idl&rev=1.33&mark=1105#1102 http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/js/jsd/jsd_xpc.cpp&rev=1.75&mark=2166#2165 Can you see the problem yet? At a guess, this needs to be upped to |wstring| in the IDL (or is one of the other seemingly random string types better?) and wchar** or whatever in the C++.

Hardware: PC → All

Summary: Does Venkman handle Unicode characters correctly? → jsdIValue doesn't handle Unicode string values at all

Robert Ginda

Comment 16

•

19 years ago

It'd probably be better to add a getUnicodeValue() method than whack the signature of getStringValue(). For backward compatibility's sake.

James Ross

Comment 17

•

19 years ago

I'm concerned at the number of other |string|s in the IDL; does JSD actually have an C++ users? (I always think of it as JS-only, although I know it isn't technically.)

:Gijs (he/him)

Comment 18

•

17 years ago

Brendan, I share James' concern, and am wondering if it would be possible/feasible to change the idl to use wstring wherever possible - in the 1.9 timeframe. After 1.9 we can stop caring given Tamarin will do away with all these interfaces (or require extensive tinkering to them) - as far as I know, anyway. Which is why I'd like to get this before 1.9.

Flags: blocking1.9?

Alex Vincent [:WeirdAl]

Updated

•

17 years ago

Blocks: 335098

:Gijs (he/him)

Comment 19

•

17 years ago

So, Brendan, I guess you're really really busy, but could you please take a look at this? It's been two months and we're close to shipping betas and all that... :-(

:Gijs (he/him)

Updated

•

17 years ago

Flags: blocking1.9?

timeless

Assignee

Updated

•

17 years ago

QA Contact: caillon → venkman

timeless

Assignee

Comment 20

•

16 years ago

Attached patch convert idl to use utf8 strings to make xpconnect happy — Details — Splinter Review

i hit this while chasing certificates, this also fixes the other bug about nulls

Assignee: rginda → timeless

Status: NEW → ASSIGNED

Attachment #352710 - Flags: review?(jst)

Johnny Stenback (:jst)

Comment 21

•

16 years ago

Comment on attachment 352710 [details] [diff] [review] convert idl to use utf8 strings to make xpconnect happy - In jsdStackFrame::GetFunctionName(): + _rval.Assign(JSD_GetNameForStackFrame(mCx, mThreadState, + mStackFrameInfo)); Fix the second line argument indentation. r+sr=jst

Attachment #352710 - Flags: superreview+

Attachment #352710 - Flags: review?(jst)

Attachment #352710 - Flags: review+

Phil Ringnalda (:philor)

Comment 22

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/36f4da6e262a

Status: ASSIGNED → RESOLVED

Closed: 23 years ago → 16 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla1.9.2a1

timeless

Assignee

Updated

•

16 years ago

Blocks: 482809

Christian :Biesinger (don't email me, ping me on IRC)

Comment 23

•

15 years ago

It would've been nice if you hadn't changed IIDs of interfaces that did not actually change :/

Bryan McQuade

Comment 24

•

15 years ago

Any chance we can roll back the IID changes for interfaces that didn't change in 1.9.2 and beyond? It'd help to simplify the code I'm working on now (working with jsdIScriptHook, jsdICallHook, etc)

Brendan Eich [:brendan]

Comment 25

•

15 years ago

IID should not change if the interface didn't. File a followup bug and make it block this bug. /be

Bryan McQuade

Updated

•

15 years ago

Depends on: 519276

Bryan McQuade

Comment 26

•

15 years ago

Done: https://bugzilla.mozilla.org/show_bug.cgi?id=519276

timeless

Assignee

Updated

•

13 years ago

Depends on: 700302

BMO Automation

Updated

•

6 years ago

Product: Other Applications → Other Applications Graveyard

HTML testcase 23 years ago Phil Schwartau 1.99 KB, text/html		Details
Reduced HTML testcase 23 years ago Phil Schwartau 541 bytes, text/html		Details
convert idl to use utf8 strings to make xpconnect happy 16 years ago timeless 25.89 KB, patch	jst : review+ jst : superreview+	Details \| Diff \| Splinter Review