Closed Bug 200984 Opened 22 years ago Closed 21 years ago

[FIX]Arabic text in Javascript "unescape" function returns the wrong output


(Core :: Internationalization, defect, P1)






(Reporter: neokuwait, Assigned: bzbarsky)




(3 files, 2 obsolete files)


1. view testcase

the unescape function produces Latin characters instead of Arabic

only characters in their %xx.. hexadecimal form should be affected? (see IE6)
Attached file unescape testcase
Intl -- not arabic specific.
Assignee: mkaply → smontagu
Component: BiDi Hebrew & Arabic → Internationalization
Ever confirmed: true
OS: Windows XP → All
QA Contact: zach → ylong
Hardware: PC → All
Attached patch Proposed patch (obsolete) — Splinter Review
Attachment #119646 - Attachment is obsolete: true
Comment on attachment 119647 [details] [diff] [review]
Oops, that had an extra chunk in it...

Would you review?  The key here is that ToNewCString does lossy "conversion" to
a CString.... This patch makes the conversion use the proper charset encoder,
does a bit of cleanup, and removes the unnecessary call to Reset() (new
decoders already come blank).
Attachment #119647 - Flags: superreview?(jst)
Attachment #119647 - Flags: review?(smontagu)
Assignee: smontagu → bzbarsky
Priority: -- → P1
Summary: Arabic text in Javascript "unescape" function returns the wrong output → [FIX]Arabic text in Javascript "unescape" function returns the wrong output
Target Milestone: --- → mozilla1.4beta
Comment on attachment 119647 [details] [diff] [review]
Oops, that had an extra chunk in it...

   // Allocate a buffer of the maximum length
   PRUnichar *dest = (PRUnichar*)nsMemory::Alloc(sizeof(PRUnichar) *
+  rv = decoder->Convert(encodedData, &unescapedByteCount, dest, &destLen);
+  if (NS_FAILED(rv)) {
-    return result;
+    return rv;

   aReturn.Assign(dest, destLen);

Could you make dest an nsAutoPtr<PRUnichar *> here? If not, you may want to
flip the code around so that you need only one call to nsMemory::Free(dest)...

Attachment #119647 - Flags: superreview?(jst) → superreview+
nsAutoPtr will call "delete" on the pointer, so that's no good here (unless I
actually do a "new PRUnichar[]" and cast or something like that....

I'll flip the test around; good catch.
Before I review, can you quiet my suspicions that this is a non-problem?
Specifically, is non-ASCII text legal input to unescape(), and does calling
unescape() with such text have defined output?
... and another question (possibly for another bug) : is %DA%D1%C8%ED the
correct escaping of the string in the testcase? Why not %0639%0631%0628%064A? Or
possibly %D8%B9%D8%B1%D8%A8%D9%89?
Comment on attachment 119647 [details] [diff] [review]
Oops, that had an extra chunk in it...

OK, I accept that we need this for compatibility, although we need a new bug to
consider what the correct behaviour is, especially of escape().

Opera, IE, and Konqueror all render the second line of the testcase as:

escaped: %u0639%u0631%u0628%u064A%20%u0639%u0631%u0628%u064A

>+  // To gracefully deal with encoding issues, we have to do the following:

Nit: don't split infinitives

Attachment #119647 - Flags: review?(smontagu) → review+
Comment on attachment 119647 [details] [diff] [review]
Oops, that had an extra chunk in it...

Waldemar, Simon asked me to make sure you were OK with this patch.. could you
take a look, please?
Attachment #119647 - Flags: review+ → review?(waldemar)
The "new bug" I requested in comment 11 would be a dupe of bug 44272 :-)
I'll take a look at what the patch does.  I'm currently building Mozilla so I
can try it.
Attached file UTF-16 test case
I'm not familiar with the character set used in the first test case, so I used
a UTF-16 test case.  encodeURI/decodeURI work as expected, but escape and
unescape don't produce visible output with the patch applied.  Hmmmm.....
Comment on attachment 120504 [details]
UTF-16 test case

With this testcase, "escape" produces no output even without my patch...

The problem is the use of flat char* buffers to handle UTF-16 text, of course. 
Things die at the first null, which is the first byte if your endianness is

I suppose I could go ahead and rewrite all that stuff in nsGlobalWindow to not
be broken like that....
Attachment #120504 - Attachment mime type: text/utf16 → text/html
ccing some intl people for opinions
Asking the more general question: why do the JavaScript 'escape' and 'unescape'
functions care what the document character set is?  They are always passed
UTF-16 from the JS engine and should send UTF-16 back to it.  'encodeURI' and
'decodeURI' don't care and work properly.
> why do the JavaScript 'escape' and 'unescape' functions care what the document
> character set is?

Because that's the behavior real web sites expect.  In particular, one will
often have a site escape() a string, then on the server unescape it and expect
it to be in the document encoding....  Also, some sites will escape a string
server-side, converting a string in document encoding to %-escapes. When
unescaped, we get a byte sequence that's in the document encoding; to give it
back to the JS engine we have to convert it to UTF-16.
Blocks: 126524
Waldemar, please see comment 19?
Attached patch Updated to tipSplinter Review
Attachment #119647 - Attachment is obsolete: true
Attachment #119647 - Flags: review?(waldemar)
Attachment #125691 - Attachment filename: &#29696;&#25856;&#29440;&#29696;&#11776;&#28672;&#24832;&#29696;&#25344;&#26624; → test.patch
Attachment #125691 - Flags: review?(waldemar)
Attachment #125691 - Flags: review?(waldemar) → review+
Checked in.
Closed: 21 years ago
Resolution: --- → FIXED
Verified fixed in 06-17 trunk build on WinXP except the problem in bug 44272.
You need to log in before you can comment on or make changes to this bug.