Closed Bug 307438 Opened 19 years ago Closed 19 years ago

prefread.cpp doesn't support \u escape sequences

Categories

(Core :: Preferences: Backend, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: darin.moz, Assigned: dveditz)

References

Details

(Keywords: fixed1.8)

Attachments

(2 files, 2 obsolete files)

prefread.cpp doesn't support \u escape sequences

it'd be nice if it did.  see, for example, bug 301694.
yes, i think so.
It would really help some of the proposed work on Unicode safety if this bug
could be fixed. Some of the characters that need to be added to the blocklist
are not easily rendered by any software package, and do nasty things to text
editors. Others look identical to ordinary ASCII characters, but aren't.

This makes maintaining the appropriate patches very difficult, as it requires
dirty tricks like writing programs to edit the source files to contain the
undisplayable or misleading characters, which (in my opinion) renders the source
code effectively unmaintainable. 

Being able to use \u escapes instead would be extremely useful.
Blocks: 309311
I hear you.  Patches welcome :)
Keywords: helpwanted
Nominating for 1.5, entering literal unicode characters is insanely unsafe and
unmaintainable.
Assignee: prefs → dveditz
Flags: blocking1.8b5-
Flags: blocking1.8b5- → blocking1.8b5?
Supports \u escapes in prefs. I'm not going to support \x because prefs are
supposed to be utf8 and the chances are good people aren't going to use it
correctly.
Attachment #197391 - Flags: superreview?(darin)
Attachment #197391 - Flags: review?(cbiesinger)
Blocks: 309133
Blocks: 301694
Comment on attachment 197391 [details] [diff] [review]
parse \u unicode escapes in prefs

- what about non-BMP characters?
Hm... it looks like these escape sequences are always 4 characters long, so not
an issue. Should you check the length though?
- I believe that \x is basically equivalent to \u. The page I linked to above
says:
"The character with the Latin-1 encoding specified by the two hexadecimal
digits XX between 00 and FF."
That is compatible with UTF-16.
hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
pair?

+                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);

Doesn't this expect nulltermination of its input string?
Perhaps it would be simpler to just unescape inside pref_DoCallback.  Then, you
are not affecting the existing parse loop at all.  Just write a function to
unescape \u sequences, and call it inside pref_DoCallback.  Another choice would
be to move the unicode conversion into PREF_ReaderCallback.  (prefread.cpp was
sort of meant to be usable as a standalone program.)
(In reply to comment #8)
> hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> pair?

It's just a mechanical bitwise conversion. When the string is used later it'll
still be two unicode characters next to each other that can be interpreted in
whatever way is appropriate.

> +                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);
> Doesn't this expect nulltermination of its input string?

No, that's why there's a length.
(In reply to comment #10)
> > Doesn't this expect nulltermination of its input string?
> 
> No, that's why there's a length.

Oh, this is implemented using Substring. I have expected that it uses
nsDependentString.
(http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsString.h#150)
(In reply to comment #10)
> (In reply to comment #8)
> > hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> > pair?
> 
> It's just a mechanical bitwise conversion.

It's this assertion I'm worried about:
http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326
> Should you check the length though?

er. nevermind, you do that already :)
Flags: blocking1.8b5? → blocking1.8b5+
On second thought, I'm happy that you chose to add \uXXXX parsing in the main
parse loop.  It required you to learn my ugly parsing code, and for that I'm
truly sorry.  Hope it wasn't too painful ;-)

Anyways, could you please make sure that this code still compiles with
TEST_PREFREAD defined?  It's okay if you want to just stub NS_ConvertUTF16toUTF8
out for now, but bonus points for using nsUTF8Utils.h

Build command:
cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
-I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp
Flags: blocking1.8b5+ → blocking1.8b5?
Flags: blocking1.8b5? → blocking1.8b5+
Keywords: helpwanted
(In reply to comment #12)
> It's this assertion I'm worried about:
> http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326

Bah, you're right. I figured the utf8 would get correctly reassembled into
unicode and then UTF16 code could deal with it. But the UTF8toUTF16 code will
only convert a 4-byte sequence into a surrogate pair, it won't let you
synthesize a surrogate pair out of two utf8 3-byte sequences. bad assumption...

(In reply to comment #14)
> On second thought, I'm happy that you chose to add \uXXXX parsing in the main
> parse loop.

Glad you agree :-). If we support those escapes we ought to support them at the
lowest stand-alone level, however ugly it ends up. I don't want to write and
maintain separate utf8 code though, so I'll slurp in the lower-level
nsUTF8Utils.h rather than require linking to the string library.

I went ahead and added support for \x escapes (always interpreted as Latin-1)

> Build command:
> cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
> -I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp

This now requires -I$MOZ_OBJDIR\dist\include\string and -DMOZILLA_INTERNAL_API=1
to build, but it does still build standalone. Is that OK?
Attachment #197391 - Attachment is obsolete: true
Attachment #197538 - Flags: superreview?(darin)
Attachment #197538 - Flags: review?(cbiesinger)
Attachment #197391 - Flags: superreview?(darin)
Attachment #197391 - Flags: review?(cbiesinger)
Comment on attachment 197538 [details] [diff] [review]
also handle surrogate pairs, and build standalone

>Index: modules/libpref/src/prefread.cpp

>+            case 'x':
>+                /* hex escape -- always interpreted as Latin-1 */
>+                state = PREF_PARSE_HEX_ESCAPE;
>+                ps->smatch = buf; /* save place in case of bad escape seq */
>+                ps->sindex = HEX_ESC_NUM_DIGITS;

Here, you're overloading the meaning of the smatch and sindex fields, right?
I think that's okay, but it deserves a more explicit comment.


>-                if (ps->lbcur == ps->lbend && !pref_GrowBuf(ps))
>+                if ((ps->lbcur+1) == ps->lbend && !pref_GrowBuf(ps))

good catch!


>+                /* bad escape sequence found, preserve partial escape as-is */
>+                *ps->lbcur++ = '\\';  /* original escape slash */
>+                if ((ps->lbcur+(buf-ps->smatch)) >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+                while (ps->smatch < buf) {
>+                    *ps->lbcur++ = *ps->smatch++;
>+                }
>+                /* push the non-hex character back for re-parsing */
>+                --buf;
>+                state = PREF_PARSE_QUOTED_STRING;
>+                break;
>+            }

Hmm... it's not necessarily possible to decrement buf.	What if buf
points to the start of a segment containing pref data?

Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
seems like this code assumes that it will grow by |buf - ps->smatch|
bytes.


>+                /* actual conversion */
>+                /* make sure there's room, 4 bytes is max utf8 len */
>+                if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+
>+                ConvertUTF16toUTF8 converter(ps->lbcur);
>+                converter.write(ps->utf16, utf16len);
>+                ps->lbcur += converter.Size();
>+                state = PREF_PARSE_QUOTED_STRING;

same deal here.  we need to probably call pref_GrowBuf in a loop
while (ps->lbcur+4 >= ps->lbend) or something like that.


>+        case PREF_PARSE_UTF16_LOW_SURROGATE:
...
>+                 --buf;

I don't think we can always decrement buf here.
Attachment #197538 - Flags: superreview?(darin) → superreview-
(In reply to comment #17)
> Here, you're overloading the meaning of the smatch and sindex fields, right?

Extra state variables or a union seemed like overkill. I'll add comments.

> Hmm... it's not necessarily possible to decrement buf. What if buf
> points to the start of a segment containing pref data?

arg.

> Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
> seems like this code assumes that it will grow by |buf - ps->smatch|
> bytes.

In fact it always doubles, starting with a 128 byte chunk. In the first version
I tried to play nice by calling grow in a byte-by-byte loop, but it's actually
easier to just change the documentation so no-one comes along in the future and
makes the code match the comments.
(In reply to comment #18)
> > Hmm... it's not necessarily possible to decrement buf. What if buf
> > points to the start of a segment containing pref data?

double-arg: the start location pointed at by ps->smatch could be bogus if we've
had to load another chunk :-(
Attachment #197538 - Attachment is obsolete: true
Attachment #197538 - Flags: review?(cbiesinger)
Attachment #197636 - Flags: superreview?(darin)
Attachment #197636 - Flags: review?(cbiesinger)
Status: NEW → ASSIGNED
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

> if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))

I'd change that constant to "6" just to be safe as UTF-8 chars can be as long
as 6 bytes.  Even though we are unlikely to encounter such UTF-8 chars, it's
probably good to be future-proof here since it doesn't cost us much.

Otherwise, this patch looks really good.
sr=darin
Attachment #197636 - Flags: superreview?(darin) → superreview+
Whiteboard: has darin's sr=, need r=, a=
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...
Attachment #197636 - Flags: review?(cbiesinger) → review+
Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...
(In reply to comment #23)
> (From update of attachment 197636 [details] [diff] [review] [edit])
> it looks like you don't ensure that the second \u character is indeed a low
> surrogate?
> might it make more sense to put the \u string back to the buffer, like with a
> malformed \u escape?

It would be slightly more helpful, but IMO not enough so to be worth it. To keep
characters from disappearing I'd have to duplicate all the error checking in the
converter (not just for surrogate pairs). So I figure half a surrogate pair *is*
correctly translated to the non-character it represents.

> It's sorta odd to escape a single character in two \u escapes,

That's UTF16 for ya... The Java/JavaScript syntax was invented back when UCS2
covered all the defined Unicode codepoints.
Attachment #197636 - Flags: approval1.8b5?
Blocks: 309128
Trunk land + verify?  Like to confirm no pref bustage on trunk first.
Keywords: qawanted
could translate half a surrogate to U+FFFD :)

>The Java/JavaScript syntax was invented back when UCS2
>covered all the defined Unicode codepoints.

Well, HTML allows: data:text/html,&#x10300;
but oh well.
(In reply to comment #26)
> HTML allows: data:text/html,&#x10300; but oh well.

Yes, but the entity syntax has a handy termination character. A more similar
case is URL %-escaping and those aren't allowed to be variable length either.

We could invent \Uxxxxxx escapes ("big U" six hex digit) :-)

> could translate half a surrogate to U+FFFD :)

I thought about it but decided to match the converter in case there was some
reason for it. I messed up ignoring surrogate pairs in the first patch, figured
I'm on shaky ground and should copy the experts.

fixed checked into trunk
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Whiteboard: has darin's sr=, need r=, a=
Dan: Could QA get a testcase so we can verify this bug?
Attachment #197636 - Flags: approval1.8b5? → approval1.8b5+
Checked into 1.8 branch
Keywords: fixed1.8
Attached file testcase user.js
Keywords: qawanted
*** Bug 299000 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: