Last Comment Bug 307438 - prefread.cpp doesn't support \u escape sequences
: prefread.cpp doesn't support \u escape sequences
Status: RESOLVED FIXED
: fixed1.8
Product: Core
Classification: Components
Component: Preferences: Backend (show other bugs)
: Trunk
: All All
: -- enhancement with 1 vote (vote)
: ---
Assigned To: Daniel Veditz [:dveditz]
:
: Benjamin Smedberg [:bsmedberg]
Mentors:
: 299000 (view as bug list)
Depends on:
Blocks: 301694 309128 309133 309311
  Show dependency treegraph
 
Reported: 2005-09-07 17:36 PDT by Darin Fisher
Modified: 2006-03-12 18:54 PST (History)
8 users (show)
mconnor: blocking1.8b5+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
parse \u unicode escapes in prefs (6.19 KB, patch)
2005-09-26 01:06 PDT, Daniel Veditz [:dveditz]
no flags Details | Diff | Splinter Review
also handle surrogate pairs, and build standalone (8.41 KB, patch)
2005-09-27 05:00 PDT, Daniel Veditz [:dveditz]
darin.moz: superreview-
Details | Diff | Splinter Review
third time's the charm? (9.40 KB, patch)
2005-09-27 17:13 PDT, Daniel Veditz [:dveditz]
cbiesinger: review+
darin.moz: superreview+
asa: approval1.8b5+
Details | Diff | Splinter Review
testcase user.js (441 bytes, text/plain)
2005-09-30 23:20 PDT, Daniel Veditz [:dveditz]
no flags Details

Description Darin Fisher 2005-09-07 17:36:32 PDT
prefread.cpp doesn't support \u escape sequences

it'd be nice if it did.  see, for example, bug 301694.
Comment 1 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-08 03:37:28 PDT
should \x be supported too?
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals#Using_Special_Characters_in_Strings
Comment 2 Darin Fisher 2005-09-08 09:33:32 PDT
yes, i think so.
Comment 3 Neil Harris 2005-09-20 06:00:33 PDT
It would really help some of the proposed work on Unicode safety if this bug
could be fixed. Some of the characters that need to be added to the blocklist
are not easily rendered by any software package, and do nasty things to text
editors. Others look identical to ordinary ASCII characters, but aren't.

This makes maintaining the appropriate patches very difficult, as it requires
dirty tricks like writing programs to edit the source files to contain the
undisplayable or misleading characters, which (in my opinion) renders the source
code effectively unmaintainable. 

Being able to use \u escapes instead would be extremely useful.
Comment 4 Darin Fisher 2005-09-20 10:00:56 PDT
I hear you.  Patches welcome :)
Comment 5 Daniel Veditz [:dveditz] 2005-09-20 13:39:00 PDT
Nominating for 1.5, entering literal unicode characters is insanely unsafe and
unmaintainable.
Comment 6 Daniel Veditz [:dveditz] 2005-09-26 01:06:49 PDT
Created attachment 197391 [details] [diff] [review]
parse \u unicode escapes in prefs

Supports \u escapes in prefs. I'm not going to support \x because prefs are
supposed to be utf8 and the chances are good people aren't going to use it
correctly.
Comment 7 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-26 04:40:24 PDT
Comment on attachment 197391 [details] [diff] [review]
parse \u unicode escapes in prefs

- what about non-BMP characters?
Hm... it looks like these escape sequences are always 4 characters long, so not
an issue. Should you check the length though?
- I believe that \x is basically equivalent to \u. The page I linked to above
says:
"The character with the Latin-1 encoding specified by the two hexadecimal
digits XX between 00 and FF."
That is compatible with UTF-16.
Comment 8 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-26 04:43:52 PDT
hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
pair?

+                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);

Doesn't this expect nulltermination of its input string?
Comment 9 Darin Fisher 2005-09-26 10:57:18 PDT
Perhaps it would be simpler to just unescape inside pref_DoCallback.  Then, you
are not affecting the existing parse loop at all.  Just write a function to
unescape \u sequences, and call it inside pref_DoCallback.  Another choice would
be to move the unicode conversion into PREF_ReaderCallback.  (prefread.cpp was
sort of meant to be usable as a standalone program.)
Comment 10 Daniel Veditz [:dveditz] 2005-09-26 14:19:35 PDT
(In reply to comment #8)
> hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> pair?

It's just a mechanical bitwise conversion. When the string is used later it'll
still be two unicode characters next to each other that can be interpreted in
whatever way is appropriate.

> +                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);
> Doesn't this expect nulltermination of its input string?

No, that's why there's a length.
Comment 11 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-26 15:43:11 PDT
(In reply to comment #10)
> > Doesn't this expect nulltermination of its input string?
> 
> No, that's why there's a length.

Oh, this is implemented using Substring. I have expected that it uses
nsDependentString.
(http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsString.h#150)
Comment 12 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-26 15:46:49 PDT
(In reply to comment #10)
> (In reply to comment #8)
> > hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> > pair?
> 
> It's just a mechanical bitwise conversion.

It's this assertion I'm worried about:
http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326
Comment 13 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-26 17:31:39 PDT
> Should you check the length though?

er. nevermind, you do that already :)
Comment 14 Darin Fisher 2005-09-26 19:34:56 PDT
On second thought, I'm happy that you chose to add \uXXXX parsing in the main
parse loop.  It required you to learn my ugly parsing code, and for that I'm
truly sorry.  Hope it wasn't too painful ;-)

Anyways, could you please make sure that this code still compiles with
TEST_PREFREAD defined?  It's okay if you want to just stub NS_ConvertUTF16toUTF8
out for now, but bonus points for using nsUTF8Utils.h

Build command:
cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
-I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp
Comment 15 Daniel Veditz [:dveditz] 2005-09-27 04:58:36 PDT
(In reply to comment #12)
> It's this assertion I'm worried about:
> http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326

Bah, you're right. I figured the utf8 would get correctly reassembled into
unicode and then UTF16 code could deal with it. But the UTF8toUTF16 code will
only convert a 4-byte sequence into a surrogate pair, it won't let you
synthesize a surrogate pair out of two utf8 3-byte sequences. bad assumption...

(In reply to comment #14)
> On second thought, I'm happy that you chose to add \uXXXX parsing in the main
> parse loop.

Glad you agree :-). If we support those escapes we ought to support them at the
lowest stand-alone level, however ugly it ends up. I don't want to write and
maintain separate utf8 code though, so I'll slurp in the lower-level
nsUTF8Utils.h rather than require linking to the string library.

I went ahead and added support for \x escapes (always interpreted as Latin-1)

> Build command:
> cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
> -I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp

This now requires -I$MOZ_OBJDIR\dist\include\string and -DMOZILLA_INTERNAL_API=1
to build, but it does still build standalone. Is that OK?
Comment 16 Daniel Veditz [:dveditz] 2005-09-27 05:00:16 PDT
Created attachment 197538 [details] [diff] [review]
also handle surrogate pairs, and build standalone
Comment 17 Darin Fisher 2005-09-27 11:49:42 PDT
Comment on attachment 197538 [details] [diff] [review]
also handle surrogate pairs, and build standalone

>Index: modules/libpref/src/prefread.cpp

>+            case 'x':
>+                /* hex escape -- always interpreted as Latin-1 */
>+                state = PREF_PARSE_HEX_ESCAPE;
>+                ps->smatch = buf; /* save place in case of bad escape seq */
>+                ps->sindex = HEX_ESC_NUM_DIGITS;

Here, you're overloading the meaning of the smatch and sindex fields, right?
I think that's okay, but it deserves a more explicit comment.


>-                if (ps->lbcur == ps->lbend && !pref_GrowBuf(ps))
>+                if ((ps->lbcur+1) == ps->lbend && !pref_GrowBuf(ps))

good catch!


>+                /* bad escape sequence found, preserve partial escape as-is */
>+                *ps->lbcur++ = '\\';  /* original escape slash */
>+                if ((ps->lbcur+(buf-ps->smatch)) >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+                while (ps->smatch < buf) {
>+                    *ps->lbcur++ = *ps->smatch++;
>+                }
>+                /* push the non-hex character back for re-parsing */
>+                --buf;
>+                state = PREF_PARSE_QUOTED_STRING;
>+                break;
>+            }

Hmm... it's not necessarily possible to decrement buf.	What if buf
points to the start of a segment containing pref data?

Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
seems like this code assumes that it will grow by |buf - ps->smatch|
bytes.


>+                /* actual conversion */
>+                /* make sure there's room, 4 bytes is max utf8 len */
>+                if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+
>+                ConvertUTF16toUTF8 converter(ps->lbcur);
>+                converter.write(ps->utf16, utf16len);
>+                ps->lbcur += converter.Size();
>+                state = PREF_PARSE_QUOTED_STRING;

same deal here.  we need to probably call pref_GrowBuf in a loop
while (ps->lbcur+4 >= ps->lbend) or something like that.


>+        case PREF_PARSE_UTF16_LOW_SURROGATE:
...
>+                 --buf;

I don't think we can always decrement buf here.
Comment 18 Daniel Veditz [:dveditz] 2005-09-27 13:13:26 PDT
(In reply to comment #17)
> Here, you're overloading the meaning of the smatch and sindex fields, right?

Extra state variables or a union seemed like overkill. I'll add comments.

> Hmm... it's not necessarily possible to decrement buf. What if buf
> points to the start of a segment containing pref data?

arg.

> Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
> seems like this code assumes that it will grow by |buf - ps->smatch|
> bytes.

In fact it always doubles, starting with a 128 byte chunk. In the first version
I tried to play nice by calling grow in a byte-by-byte loop, but it's actually
easier to just change the documentation so no-one comes along in the future and
makes the code match the comments.
Comment 19 Daniel Veditz [:dveditz] 2005-09-27 15:15:02 PDT
(In reply to comment #18)
> > Hmm... it's not necessarily possible to decrement buf. What if buf
> > points to the start of a segment containing pref data?

double-arg: the start location pointed at by ps->smatch could be bogus if we've
had to load another chunk :-(
Comment 20 Daniel Veditz [:dveditz] 2005-09-27 17:13:15 PDT
Created attachment 197636 [details] [diff] [review]
third time's the charm?
Comment 21 Darin Fisher 2005-09-27 17:35:08 PDT
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

> if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))

I'd change that constant to "6" just to be safe as UTF-8 chars can be as long
as 6 bytes.  Even though we are unlikely to encounter such UTF-8 chars, it's
probably good to be future-proof here since it doesn't cost us much.

Otherwise, this patch looks really good.
sr=darin
Comment 22 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-28 06:35:43 PDT
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...
Comment 23 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-28 06:38:05 PDT
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...
Comment 24 Daniel Veditz [:dveditz] 2005-09-28 07:45:56 PDT
(In reply to comment #23)
> (From update of attachment 197636 [details] [diff] [review] [edit])
> it looks like you don't ensure that the second \u character is indeed a low
> surrogate?
> might it make more sense to put the \u string back to the buffer, like with a
> malformed \u escape?

It would be slightly more helpful, but IMO not enough so to be worth it. To keep
characters from disappearing I'd have to duplicate all the error checking in the
converter (not just for surrogate pairs). So I figure half a surrogate pair *is*
correctly translated to the non-character it represents.

> It's sorta odd to escape a single character in two \u escapes,

That's UTF16 for ya... The Java/JavaScript syntax was invented back when UCS2
covered all the defined Unicode codepoints.
Comment 25 Mike Schroepfer 2005-09-28 11:17:07 PDT
Trunk land + verify?  Like to confirm no pref bustage on trunk first.
Comment 26 Christian :Biesinger (don't email me, ping me on IRC) 2005-09-29 06:33:33 PDT
could translate half a surrogate to U+FFFD :)

>The Java/JavaScript syntax was invented back when UCS2
>covered all the defined Unicode codepoints.

Well, HTML allows: data:text/html,&#x10300;
but oh well.
Comment 27 Daniel Veditz [:dveditz] 2005-09-29 11:09:24 PDT
(In reply to comment #26)
> HTML allows: data:text/html,&#x10300; but oh well.

Yes, but the entity syntax has a handy termination character. A more similar
case is URL %-escaping and those aren't allowed to be variable length either.

We could invent \Uxxxxxx escapes ("big U" six hex digit) :-)

> could translate half a surrogate to U+FFFD :)

I thought about it but decided to match the converter in case there was some
reason for it. I messed up ignoring surrogate pairs in the first patch, figured
I'm on shaky ground and should copy the experts.

fixed checked into trunk
Comment 28 Marcia Knous [:marcia - use ni] 2005-09-30 13:26:37 PDT
Dan: Could QA get a testcase so we can verify this bug?
Comment 29 Daniel Veditz [:dveditz] 2005-09-30 22:45:47 PDT
Checked into 1.8 branch
Comment 30 Daniel Veditz [:dveditz] 2005-09-30 23:20:46 PDT
Created attachment 198109 [details]
testcase user.js
Comment 31 Christian :Biesinger (don't email me, ping me on IRC) 2005-10-30 16:20:52 PST
*** Bug 299000 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.