307438 - prefread.cpp doesn't support \u escape sequences

Reporter

Description

•

19 years ago

prefread.cpp doesn't support \u escape sequences

it'd be nice if it did.  see, for example, bug 301694.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 1

•

19 years ago

should \x be supported too?
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals#Using_Special_Characters_in_Strings

Darin Fisher

Reporter

Comment 2

•

19 years ago

yes, i think so.

Neil Harris

Comment 3

•

19 years ago

It would really help some of the proposed work on Unicode safety if this bug
could be fixed. Some of the characters that need to be added to the blocklist
are not easily rendered by any software package, and do nasty things to text
editors. Others look identical to ordinary ASCII characters, but aren't.

This makes maintaining the appropriate patches very difficult, as it requires
dirty tricks like writing programs to edit the source files to contain the
undisplayable or misleading characters, which (in my opinion) renders the source
code effectively unmaintainable. 

Being able to use \u escapes instead would be extremely useful.

Neil Harris

Updated

•

19 years ago

Blocks: 309311

Darin Fisher

Reporter

Comment 4

•

19 years ago

I hear you.  Patches welcome :)

Keywords: helpwanted

Daniel Veditz [:dveditz]

Assignee

Comment 5

•

19 years ago

Nominating for 1.5, entering literal unicode characters is insanely unsafe and
unmaintainable.

Assignee: prefs → dveditz

Flags: blocking1.8b5-

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

19 years ago

Flags: blocking1.8b5- → blocking1.8b5?

Daniel Veditz [:dveditz]

Assignee

Comment 6

•

19 years ago

Attached patch parse \u unicode escapes in prefs (obsolete) — Details — Splinter Review

Supports \u escapes in prefs. I'm not going to support \x because prefs are
supposed to be utf8 and the chances are good people aren't going to use it
correctly.

Attachment #197391 - Flags: superreview?(darin)

Attachment #197391 - Flags: review?(cbiesinger)

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Blocks: 309133

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Blocks: 301694

Christian :Biesinger (don't email me, ping me on IRC)

Comment 7

•

19 years ago

Comment on attachment 197391 [details] [diff] [review]
parse \u unicode escapes in prefs

- what about non-BMP characters?
Hm... it looks like these escape sequences are always 4 characters long, so not
an issue. Should you check the length though?
- I believe that \x is basically equivalent to \u. The page I linked to above
says:
"The character with the Latin-1 encoding specified by the two hexadecimal
digits XX between 00 and FF."
That is compatible with UTF-16.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 8

•

19 years ago

hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
pair?

+                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);

Doesn't this expect nulltermination of its input string?

Darin Fisher

Reporter

Comment 9

•

19 years ago

Perhaps it would be simpler to just unescape inside pref_DoCallback.  Then, you
are not affecting the existing parse loop at all.  Just write a function to
unescape \u sequences, and call it inside pref_DoCallback.  Another choice would
be to move the unicode conversion into PREF_ReaderCallback.  (prefread.cpp was
sort of meant to be usable as a standalone program.)

Daniel Veditz [:dveditz]

Assignee

Comment 10

•

19 years ago

(In reply to comment #8)
> hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> pair?

It's just a mechanical bitwise conversion. When the string is used later it'll
still be two unicode characters next to each other that can be interpreted in
whatever way is appropriate.

> +                NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1);
> Doesn't this expect nulltermination of its input string?

No, that's why there's a length.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 11

•

19 years ago

(In reply to comment #10)
> > Doesn't this expect nulltermination of its input string?
> 
> No, that's why there's a length.

Oh, this is implemented using Substring. I have expected that it uses
nsDependentString.
(http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsString.h#150)

Christian :Biesinger (don't email me, ping me on IRC)

Comment 12

•

19 years ago

(In reply to comment #10)
> (In reply to comment #8)
> > hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate
> > pair?
> 
> It's just a mechanical bitwise conversion.

It's this assertion I'm worried about:
http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326

Christian :Biesinger (don't email me, ping me on IRC)

Comment 13

•

19 years ago

> Should you check the length though?

er. nevermind, you do that already :)

Mike Connor [:mconnor]

Updated

•

19 years ago

Flags: blocking1.8b5? → blocking1.8b5+

Darin Fisher

Reporter

Comment 14

•

19 years ago

On second thought, I'm happy that you chose to add \uXXXX parsing in the main
parse loop.  It required you to learn my ugly parsing code, and for that I'm
truly sorry.  Hope it wasn't too painful ;-)

Anyways, could you please make sure that this code still compiles with
TEST_PREFREAD defined?  It's okay if you want to just stub NS_ConvertUTF16toUTF8
out for now, but bonus points for using nsUTF8Utils.h

Build command:
cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
-I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp

Flags: blocking1.8b5+ → blocking1.8b5?

Mike Connor [:mconnor]

Updated

•

19 years ago

Flags: blocking1.8b5? → blocking1.8b5+

Keywords: helpwanted

Daniel Veditz [:dveditz]

Assignee

Comment 15

•

19 years ago

(In reply to comment #12)
> It's this assertion I'm worried about:
> http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326

Bah, you're right. I figured the utf8 would get correctly reassembled into
unicode and then UTF16 code could deal with it. But the UTF8toUTF16 code will
only convert a 4-byte sequence into a surrogate pair, it won't let you
synthesize a surrogate pair out of two utf8 3-byte sequences. bad assumption...

(In reply to comment #14)
> On second thought, I'm happy that you chose to add \uXXXX parsing in the main
> parse loop.

Glad you agree :-). If we support those escapes we ought to support them at the
lowest stand-alone level, however ugly it ends up. I don't want to write and
maintain separate utf8 code though, so I'll slurp in the lower-level
nsUTF8Utils.h rather than require linking to the string library.

I went ahead and added support for \x escapes (always interpreted as Latin-1)

> Build command:
> cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr
> -I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp

This now requires -I$MOZ_OBJDIR\dist\include\string and -DMOZILLA_INTERNAL_API=1
to build, but it does still build standalone. Is that OK?

Daniel Veditz [:dveditz]

Assignee

Comment 16

•

19 years ago

Attached patch also handle surrogate pairs, and build standalone (obsolete) — Details — Splinter Review

Attachment #197391 - Attachment is obsolete: true

Attachment #197538 - Flags: superreview?(darin)

Attachment #197538 - Flags: review?(cbiesinger)

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Attachment #197391 - Flags: superreview?(darin)

Attachment #197391 - Flags: review?(cbiesinger)

Darin Fisher

Reporter

Comment 17

•

19 years ago

Comment on attachment 197538 [details] [diff] [review]
also handle surrogate pairs, and build standalone

>Index: modules/libpref/src/prefread.cpp

>+            case 'x':
>+                /* hex escape -- always interpreted as Latin-1 */
>+                state = PREF_PARSE_HEX_ESCAPE;
>+                ps->smatch = buf; /* save place in case of bad escape seq */
>+                ps->sindex = HEX_ESC_NUM_DIGITS;

Here, you're overloading the meaning of the smatch and sindex fields, right?
I think that's okay, but it deserves a more explicit comment.


>-                if (ps->lbcur == ps->lbend && !pref_GrowBuf(ps))
>+                if ((ps->lbcur+1) == ps->lbend && !pref_GrowBuf(ps))

good catch!


>+                /* bad escape sequence found, preserve partial escape as-is */
>+                *ps->lbcur++ = '\\';  /* original escape slash */
>+                if ((ps->lbcur+(buf-ps->smatch)) >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+                while (ps->smatch < buf) {
>+                    *ps->lbcur++ = *ps->smatch++;
>+                }
>+                /* push the non-hex character back for re-parsing */
>+                --buf;
>+                state = PREF_PARSE_QUOTED_STRING;
>+                break;
>+            }

Hmm... it's not necessarily possible to decrement buf.	What if buf
points to the start of a segment containing pref data?

Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
seems like this code assumes that it will grow by |buf - ps->smatch|
bytes.


>+                /* actual conversion */
>+                /* make sure there's room, 4 bytes is max utf8 len */
>+                if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))
>+                    return PR_FALSE;
>+
>+                ConvertUTF16toUTF8 converter(ps->lbcur);
>+                converter.write(ps->utf16, utf16len);
>+                ps->lbcur += converter.Size();
>+                state = PREF_PARSE_QUOTED_STRING;

same deal here.  we need to probably call pref_GrowBuf in a loop
while (ps->lbcur+4 >= ps->lbend) or something like that.


>+        case PREF_PARSE_UTF16_LOW_SURROGATE:
...
>+                 --buf;

I don't think we can always decrement buf here.

Attachment #197538 - Flags: superreview?(darin) → superreview-

Daniel Veditz [:dveditz]

Assignee

Comment 18

•

19 years ago

(In reply to comment #17)
> Here, you're overloading the meaning of the smatch and sindex fields, right?

Extra state variables or a union seemed like overkill. I'll add comments.

> Hmm... it's not necessarily possible to decrement buf. What if buf
> points to the start of a segment containing pref data?

arg.

> Also, pref_GrowBuf only promises to grow the buffer by one byte.  it
> seems like this code assumes that it will grow by |buf - ps->smatch|
> bytes.

In fact it always doubles, starting with a 128 byte chunk. In the first version
I tried to play nice by calling grow in a byte-by-byte loop, but it's actually
easier to just change the documentation so no-one comes along in the future and
makes the code match the comments.

Daniel Veditz [:dveditz]

Assignee

Comment 19

•

19 years ago

(In reply to comment #18)
> > Hmm... it's not necessarily possible to decrement buf. What if buf
> > points to the start of a segment containing pref data?

double-arg: the start location pointed at by ps->smatch could be bogus if we've
had to load another chunk :-(

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Attachment #197538 - Attachment is obsolete: true

Attachment #197538 - Flags: review?(cbiesinger)

Daniel Veditz [:dveditz]

Assignee

Comment 20

•

19 years ago

Attached patch third time's the charm? — Details — Splinter Review

Attachment #197636 - Flags: superreview?(darin)

Attachment #197636 - Flags: review?(cbiesinger)

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Status: NEW → ASSIGNED

Darin Fisher

Reporter

Comment 21

•

19 years ago

Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

> if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps))

I'd change that constant to "6" just to be safe as UTF-8 chars can be as long
as 6 bytes.  Even though we are unlikely to encounter such UTF-8 chars, it's
probably good to be future-proof here since it doesn't cost us much.

Otherwise, this patch looks really good.
sr=darin

Attachment #197636 - Flags: superreview?(darin) → superreview+

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Whiteboard: has darin's sr=, need r=, a=

Christian :Biesinger (don't email me, ping me on IRC)

Comment 22

•

19 years ago

Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...

Attachment #197636 - Flags: review?(cbiesinger) → review+

Christian :Biesinger (don't email me, ping me on IRC)

Comment 23

•

19 years ago

Comment on attachment 197636 [details] [diff] [review]
third time's the charm?

it looks like you don't ensure that the second \u character is indeed a low
surrogate?

+		 /* didn't find expected low surrogate. Ignore high surrogate

might it make more sense to put the \u string back to the buffer, like with a
malformed \u escape?


It's sorta odd to escape a single character in two \u escapes, but I suppose JS
doesn't allow anything else...

Daniel Veditz [:dveditz]

Assignee

Comment 24

•

19 years ago

(In reply to comment #23)
> (From update of attachment 197636 [details] [diff] [review] [edit])
> it looks like you don't ensure that the second \u character is indeed a low
> surrogate?
> might it make more sense to put the \u string back to the buffer, like with a
> malformed \u escape?

It would be slightly more helpful, but IMO not enough so to be worth it. To keep
characters from disappearing I'd have to duplicate all the error checking in the
converter (not just for surrogate pairs). So I figure half a surrogate pair *is*
correctly translated to the non-character it represents.

> It's sorta odd to escape a single character in two \u escapes,

That's UTF16 for ya... The Java/JavaScript syntax was invented back when UCS2
covered all the defined Unicode codepoints.

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Attachment #197636 - Flags: approval1.8b5?

Darin Fisher

Reporter

Updated

•

19 years ago

Blocks: 309128

Mike Schroepfer

Comment 25

•

19 years ago

Trunk land + verify?  Like to confirm no pref bustage on trunk first.

Keywords: qawanted

Christian :Biesinger (don't email me, ping me on IRC)

Comment 26

•

19 years ago

could translate half a surrogate to U+FFFD :)

>The Java/JavaScript syntax was invented back when UCS2
>covered all the defined Unicode codepoints.

Well, HTML allows: data:text/html,&#x10300;
but oh well.

Daniel Veditz [:dveditz]

Assignee

Comment 27

•

19 years ago

(In reply to comment #26)
> HTML allows: data:text/html,&#x10300; but oh well.

Yes, but the entity syntax has a handy termination character. A more similar
case is URL %-escaping and those aren't allowed to be variable length either.

We could invent \Uxxxxxx escapes ("big U" six hex digit) :-)

> could translate half a surrogate to U+FFFD :)

I thought about it but decided to match the converter in case there was some
reason for it. I messed up ignoring surrogate pairs in the first patch, figured
I'm on shaky ground and should copy the experts.

fixed checked into trunk

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

Daniel Veditz [:dveditz]

Assignee

Updated

•

19 years ago

Whiteboard: has darin's sr=, need r=, a=

Marcia Knous [:marcia]

Comment 28

•

19 years ago

Dan: Could QA get a testcase so we can verify this bug?

Asa Dotzler [:asa]

Updated

•

19 years ago

Attachment #197636 - Flags: approval1.8b5? → approval1.8b5+

Daniel Veditz [:dveditz]

Assignee

Comment 29

•

19 years ago

Checked into 1.8 branch

Keywords: fixed1.8

Daniel Veditz [:dveditz]

Assignee

Comment 30

•

19 years ago

Attached file testcase user.js — Details

Asa Dotzler [:asa]

Updated

•

19 years ago

Keywords: qawanted

Christian :Biesinger (don't email me, ping me on IRC)

Comment 31

•

19 years ago

*** Bug 299000 has been marked as a duplicate of this bug. ***

parse \u unicode escapes in prefs 19 years ago Daniel Veditz [:dveditz] 6.19 KB, patch		Details \| Diff \| Splinter Review
also handle surrogate pairs, and build standalone 19 years ago Daniel Veditz [:dveditz] 8.41 KB, patch	darin.moz : superreview-	Details \| Diff \| Splinter Review
third time's the charm? 19 years ago Daniel Veditz [:dveditz] 9.40 KB, patch	Biesinger : review+ darin.moz : superreview+ asa : approval1.8b5+	Details \| Diff \| Splinter Review
testcase user.js 19 years ago Daniel Veditz [:dveditz] 441 bytes, text/plain		Details