Closed
Bug 307438
Opened 19 years ago
Closed 19 years ago
prefread.cpp doesn't support \u escape sequences
Categories
(Core :: Preferences: Backend, enhancement)
Core
Preferences: Backend
Tracking
()
RESOLVED
FIXED
People
(Reporter: darin.moz, Assigned: dveditz)
References
Details
(Keywords: fixed1.8)
Attachments
(2 files, 2 obsolete files)
9.40 KB,
patch
|
Biesinger
:
review+
darin.moz
:
superreview+
asa
:
approval1.8b5+
|
Details | Diff | Splinter Review |
441 bytes,
text/plain
|
Details |
prefread.cpp doesn't support \u escape sequences it'd be nice if it did. see, for example, bug 301694.
Comment 1•19 years ago
|
||
should \x be supported too? http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals#Using_Special_Characters_in_Strings
Reporter | ||
Comment 2•19 years ago
|
||
yes, i think so.
Comment 3•19 years ago
|
||
It would really help some of the proposed work on Unicode safety if this bug could be fixed. Some of the characters that need to be added to the blocklist are not easily rendered by any software package, and do nasty things to text editors. Others look identical to ordinary ASCII characters, but aren't. This makes maintaining the appropriate patches very difficult, as it requires dirty tricks like writing programs to edit the source files to contain the undisplayable or misleading characters, which (in my opinion) renders the source code effectively unmaintainable. Being able to use \u escapes instead would be extremely useful.
Assignee | ||
Comment 5•19 years ago
|
||
Nominating for 1.5, entering literal unicode characters is insanely unsafe and unmaintainable.
Assignee: prefs → dveditz
Flags: blocking1.8b5-
Updated•19 years ago
|
Flags: blocking1.8b5- → blocking1.8b5?
Assignee | ||
Comment 6•19 years ago
|
||
Supports \u escapes in prefs. I'm not going to support \x because prefs are supposed to be utf8 and the chances are good people aren't going to use it correctly.
Attachment #197391 -
Flags: superreview?(darin)
Attachment #197391 -
Flags: review?(cbiesinger)
Comment 7•19 years ago
|
||
Comment on attachment 197391 [details] [diff] [review] parse \u unicode escapes in prefs - what about non-BMP characters? Hm... it looks like these escape sequences are always 4 characters long, so not an issue. Should you check the length though? - I believe that \x is basically equivalent to \u. The page I linked to above says: "The character with the Latin-1 encoding specified by the two hexadecimal digits XX between 00 and FF." That is compatible with UTF-16.
Comment 8•19 years ago
|
||
hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate pair? + NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1); Doesn't this expect nulltermination of its input string?
Reporter | ||
Comment 9•19 years ago
|
||
Perhaps it would be simpler to just unescape inside pref_DoCallback. Then, you are not affecting the existing parse loop at all. Just write a function to unescape \u sequences, and call it inside pref_DoCallback. Another choice would be to move the unicode conversion into PREF_ReaderCallback. (prefread.cpp was sort of meant to be usable as a standalone program.)
Assignee | ||
Comment 10•19 years ago
|
||
(In reply to comment #8) > hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate > pair? It's just a mechanical bitwise conversion. When the string is used later it'll still be two unicode characters next to each other that can be interpreted in whatever way is appropriate. > + NS_ConvertUTF16toUTF8 utf8seq(&ps->unichar, 1); > Doesn't this expect nulltermination of its input string? No, that's why there's a length.
Comment 11•19 years ago
|
||
(In reply to comment #10) > > Doesn't this expect nulltermination of its input string? > > No, that's why there's a length. Oh, this is implemented using Substring. I have expected that it uses nsDependentString. (http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsString.h#150)
Comment 12•19 years ago
|
||
(In reply to comment #10) > (In reply to comment #8) > > hrm, what does NS_ConvertUTF16toUTF8 do if it only gets one half of a surrogate > > pair? > > It's just a mechanical bitwise conversion. It's this assertion I'm worried about: http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326
Comment 13•19 years ago
|
||
> Should you check the length though?
er. nevermind, you do that already :)
Updated•19 years ago
|
Flags: blocking1.8b5? → blocking1.8b5+
Reporter | ||
Comment 14•19 years ago
|
||
On second thought, I'm happy that you chose to add \uXXXX parsing in the main parse loop. It required you to learn my ugly parsing code, and for that I'm truly sorry. Hope it wasn't too painful ;-) Anyways, could you please make sure that this code still compiles with TEST_PREFREAD defined? It's okay if you want to just stub NS_ConvertUTF16toUTF8 out for now, but bonus points for using nsUTF8Utils.h Build command: cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr -I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp
Flags: blocking1.8b5+ → blocking1.8b5?
Updated•19 years ago
|
Flags: blocking1.8b5? → blocking1.8b5+
Keywords: helpwanted
Assignee | ||
Comment 15•19 years ago
|
||
(In reply to comment #12) > It's this assertion I'm worried about: > http://lxr.mozilla.org/seamonkey/source/xpcom/string/public/nsUTF8Utils.h#326 Bah, you're right. I figured the utf8 would get correctly reassembled into unicode and then UTF16 code could deal with it. But the UTF8toUTF16 code will only convert a 4-byte sequence into a surrogate pair, it won't let you synthesize a surrogate pair out of two utf8 3-byte sequences. bad assumption... (In reply to comment #14) > On second thought, I'm happy that you chose to add \uXXXX parsing in the main > parse loop. Glad you agree :-). If we support those escapes we ought to support them at the lowest stand-alone level, however ugly it ends up. I don't want to write and maintain separate utf8 code though, so I'll slurp in the lower-level nsUTF8Utils.h rather than require linking to the string library. I went ahead and added support for \x escapes (always interpreted as Latin-1) > Build command: > cl -DTEST_PREFREAD=1 -I. -I$MOZ_OBJDIR\dist\include\nspr > -I$MOZ_OBJDIR\dist\include\xpcom prefread.cpp This now requires -I$MOZ_OBJDIR\dist\include\string and -DMOZILLA_INTERNAL_API=1 to build, but it does still build standalone. Is that OK?
Assignee | ||
Comment 16•19 years ago
|
||
Attachment #197391 -
Attachment is obsolete: true
Attachment #197538 -
Flags: superreview?(darin)
Attachment #197538 -
Flags: review?(cbiesinger)
Assignee | ||
Updated•19 years ago
|
Attachment #197391 -
Flags: superreview?(darin)
Attachment #197391 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 17•19 years ago
|
||
Comment on attachment 197538 [details] [diff] [review] also handle surrogate pairs, and build standalone >Index: modules/libpref/src/prefread.cpp >+ case 'x': >+ /* hex escape -- always interpreted as Latin-1 */ >+ state = PREF_PARSE_HEX_ESCAPE; >+ ps->smatch = buf; /* save place in case of bad escape seq */ >+ ps->sindex = HEX_ESC_NUM_DIGITS; Here, you're overloading the meaning of the smatch and sindex fields, right? I think that's okay, but it deserves a more explicit comment. >- if (ps->lbcur == ps->lbend && !pref_GrowBuf(ps)) >+ if ((ps->lbcur+1) == ps->lbend && !pref_GrowBuf(ps)) good catch! >+ /* bad escape sequence found, preserve partial escape as-is */ >+ *ps->lbcur++ = '\\'; /* original escape slash */ >+ if ((ps->lbcur+(buf-ps->smatch)) >= ps->lbend && !pref_GrowBuf(ps)) >+ return PR_FALSE; >+ while (ps->smatch < buf) { >+ *ps->lbcur++ = *ps->smatch++; >+ } >+ /* push the non-hex character back for re-parsing */ >+ --buf; >+ state = PREF_PARSE_QUOTED_STRING; >+ break; >+ } Hmm... it's not necessarily possible to decrement buf. What if buf points to the start of a segment containing pref data? Also, pref_GrowBuf only promises to grow the buffer by one byte. it seems like this code assumes that it will grow by |buf - ps->smatch| bytes. >+ /* actual conversion */ >+ /* make sure there's room, 4 bytes is max utf8 len */ >+ if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps)) >+ return PR_FALSE; >+ >+ ConvertUTF16toUTF8 converter(ps->lbcur); >+ converter.write(ps->utf16, utf16len); >+ ps->lbcur += converter.Size(); >+ state = PREF_PARSE_QUOTED_STRING; same deal here. we need to probably call pref_GrowBuf in a loop while (ps->lbcur+4 >= ps->lbend) or something like that. >+ case PREF_PARSE_UTF16_LOW_SURROGATE: ... >+ --buf; I don't think we can always decrement buf here.
Attachment #197538 -
Flags: superreview?(darin) → superreview-
Assignee | ||
Comment 18•19 years ago
|
||
(In reply to comment #17) > Here, you're overloading the meaning of the smatch and sindex fields, right? Extra state variables or a union seemed like overkill. I'll add comments. > Hmm... it's not necessarily possible to decrement buf. What if buf > points to the start of a segment containing pref data? arg. > Also, pref_GrowBuf only promises to grow the buffer by one byte. it > seems like this code assumes that it will grow by |buf - ps->smatch| > bytes. In fact it always doubles, starting with a 128 byte chunk. In the first version I tried to play nice by calling grow in a byte-by-byte loop, but it's actually easier to just change the documentation so no-one comes along in the future and makes the code match the comments.
Assignee | ||
Comment 19•19 years ago
|
||
(In reply to comment #18) > > Hmm... it's not necessarily possible to decrement buf. What if buf > > points to the start of a segment containing pref data? double-arg: the start location pointed at by ps->smatch could be bogus if we've had to load another chunk :-(
Assignee | ||
Updated•19 years ago
|
Attachment #197538 -
Attachment is obsolete: true
Attachment #197538 -
Flags: review?(cbiesinger)
Assignee | ||
Comment 20•19 years ago
|
||
Attachment #197636 -
Flags: superreview?(darin)
Attachment #197636 -
Flags: review?(cbiesinger)
Assignee | ||
Updated•19 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 21•19 years ago
|
||
Comment on attachment 197636 [details] [diff] [review] third time's the charm? > if (ps->lbcur+4 >= ps->lbend && !pref_GrowBuf(ps)) I'd change that constant to "6" just to be safe as UTF-8 chars can be as long as 6 bytes. Even though we are unlikely to encounter such UTF-8 chars, it's probably good to be future-proof here since it doesn't cost us much. Otherwise, this patch looks really good. sr=darin
Attachment #197636 -
Flags: superreview?(darin) → superreview+
Assignee | ||
Updated•19 years ago
|
Whiteboard: has darin's sr=, need r=, a=
Comment 22•19 years ago
|
||
Comment on attachment 197636 [details] [diff] [review] third time's the charm? it looks like you don't ensure that the second \u character is indeed a low surrogate? + /* didn't find expected low surrogate. Ignore high surrogate might it make more sense to put the \u string back to the buffer, like with a malformed \u escape? It's sorta odd to escape a single character in two \u escapes, but I suppose JS doesn't allow anything else...
Attachment #197636 -
Flags: review?(cbiesinger) → review+
Comment 23•19 years ago
|
||
Comment on attachment 197636 [details] [diff] [review] third time's the charm? it looks like you don't ensure that the second \u character is indeed a low surrogate? + /* didn't find expected low surrogate. Ignore high surrogate might it make more sense to put the \u string back to the buffer, like with a malformed \u escape? It's sorta odd to escape a single character in two \u escapes, but I suppose JS doesn't allow anything else...
Assignee | ||
Comment 24•19 years ago
|
||
(In reply to comment #23) > (From update of attachment 197636 [details] [diff] [review] [edit]) > it looks like you don't ensure that the second \u character is indeed a low > surrogate? > might it make more sense to put the \u string back to the buffer, like with a > malformed \u escape? It would be slightly more helpful, but IMO not enough so to be worth it. To keep characters from disappearing I'd have to duplicate all the error checking in the converter (not just for surrogate pairs). So I figure half a surrogate pair *is* correctly translated to the non-character it represents. > It's sorta odd to escape a single character in two \u escapes, That's UTF16 for ya... The Java/JavaScript syntax was invented back when UCS2 covered all the defined Unicode codepoints.
Assignee | ||
Updated•19 years ago
|
Attachment #197636 -
Flags: approval1.8b5?
Comment 25•19 years ago
|
||
Trunk land + verify? Like to confirm no pref bustage on trunk first.
Keywords: qawanted
Comment 26•19 years ago
|
||
could translate half a surrogate to U+FFFD :)
>The Java/JavaScript syntax was invented back when UCS2
>covered all the defined Unicode codepoints.
Well, HTML allows: data:text/html,𐌀
but oh well.
Assignee | ||
Comment 27•19 years ago
|
||
(In reply to comment #26) > HTML allows: data:text/html,𐌀 but oh well. Yes, but the entity syntax has a handy termination character. A more similar case is URL %-escaping and those aren't allowed to be variable length either. We could invent \Uxxxxxx escapes ("big U" six hex digit) :-) > could translate half a surrogate to U+FFFD :) I thought about it but decided to match the converter in case there was some reason for it. I messed up ignoring surrogate pairs in the first patch, figured I'm on shaky ground and should copy the experts. fixed checked into trunk
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•19 years ago
|
Whiteboard: has darin's sr=, need r=, a=
Comment 28•19 years ago
|
||
Dan: Could QA get a testcase so we can verify this bug?
Updated•19 years ago
|
Attachment #197636 -
Flags: approval1.8b5? → approval1.8b5+
Assignee | ||
Comment 30•19 years ago
|
||
Comment 31•19 years ago
|
||
*** Bug 299000 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•