Closed
Bug 84186
Opened 24 years ago
Closed 24 years ago
Add ACString, AUTF8String, AString types to XPIDL
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
VERIFIED
FIXED
mozilla0.9.9
People
(Reporter: dmosedale, Assigned: nisheeth_mozilla)
References
Details
Attachments
(4 files, 8 obsolete files)
16.87 KB,
text/plain
|
Details | |
85.61 KB,
patch
|
Details | Diff | Splinter Review | |
61.90 KB,
patch
|
dbradley
:
review+
jband_mozilla
:
superreview+
shaver
:
approval+
|
Details | Diff | Splinter Review |
567 bytes,
patch
|
Details | Diff | Splinter Review |
shaver and I discussed this briefly in #mozilla a while ago. The short story is
the nsIURI is moving in the direction of being UTF8 encoded rather than ASCII
encoded. Since we need to be able to deal with this reasonably from JS, and as
a way to allow folks to write less bloated scriptable interfaces, shaver
suggested, with some hesitation, adding a [utf8] qualifier to xpidl. Thoughts?
Comment 1•24 years ago
|
||
Please No. If we *need* a UTF8 type then add a UTF8 type. Don't make it a funky
attribute. There would still need to be a distinct xpt type anyway. You better
have plans for how to do that before you launch off into this.
scc and I exchanged ideas about extending the AString abstract interface style
string idl types and *trying* to retire the flat "null terminated array of
chars" type as much as possible.
So, did someone want to take this bug and develop a solid plan and
justification for imposing a new type on the world? Or should we just mark this
future/WONTFIX now and forget about it? :)
Comment 2•24 years ago
|
||
I agree with John. I think a new type would make more sense and be cleaner.
Updated•24 years ago
|
Status: NEW → ASSIGNED
Comment 3•24 years ago
|
||
This is ONE of many possible character encodings, that happen to fit in 8-bit
characters. I don't think this is a proper use of the attribute mechanism, to
single out a particular character encoding. Let's either add a new type, or
pressure the nsIURI folks to just move to UNICODE. Either way, the nsIURI
implementation will have to become UNICODE aware.
Reporter | ||
Comment 4•24 years ago
|
||
beard: UTF8 _is_ a Unicode encoding. Do you mean UCS2?
Anyway, I'm fine with a new type. The important thing is that there's some
standard way to write scriptable interfaces that is:
a) i18n friendly
b) doesn't require 16-bit-wide chars
Since i18n is required everywhere, and a non-trivial number of folks (including
the nsIURI owners) are unwilling to accept the footprint cost of 16-bit chars.
UTF8 is a particularly good choice for the encoding in question because at least
a few popular protocols (IMAP, LDAP, ...) use it natively as their wire format,
and this cuts down on the number of conversions necessary.
Reporter | ||
Comment 5•24 years ago
|
||
Updating the summary. So if we're gonna make it a separate type, wouldn't it be
better to base it around an|ACString| type rather than a |string|?
Summary: [utf8] attribute for xpidl |string| type → UTF8 string type for XPIDL
Comment 6•24 years ago
|
||
Marking this won't fix, since an alternative solution was presented and seems to
be accepted. If incorrect feel free to change it back and discuss the issue further.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Updated•24 years ago
|
Status: RESOLVED → VERIFIED
Comment 7•24 years ago
|
||
Marking Verified -
Reporter | ||
Comment 8•24 years ago
|
||
Um, what alternate solution are you referring to?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Comment 9•24 years ago
|
||
Sorry Dan, I forgot you had changed the summary. I still had this tagged as
creating a utf8 qualifier.
Reporter | ||
Comment 10•24 years ago
|
||
I'd like to see this for Mozilla 1.0, as the LDAP string APIs are a mess, and I
think the only sane way out of their current state is to move to UTF8 encoding,
since that's how they come over the wire. I'd be interested in seeing
nsACString exposed to idl, so that we keep all the nifty advantages of the new
string fu.
Comment 11•24 years ago
|
||
John,
Suppose I have a function:
void foo (in string a)
Is 'a' required to be in ASCII? If so, many interface violate this and can we
safely extend the meaning to any encoding ( such as UTF8 )?
Comment 12•24 years ago
|
||
Yes they do and no we can't.
We've had many discussions about this on the xpcom newsgroup.
Take a look at some branches of the thread entitled...
"Encoding Wars --- more in the Big String Story"
...circa 4/13/00.
We might concoct a new type, but we are not going to redefine the conventional
meaning of xpidl's 'string' type. Regardless of the amount of abuse I think it
would be a big mistake to rewrite the contract for 'stirng'.
Comment 13•24 years ago
|
||
sounds fair to me. there are a few interfaces that want to pass around utf8
strings. URI/URL interfaces will want to do this sometime too.
Question for you, is it possible to have a string type that could contain either
UTF8 or ASCII?
/me goes reads ng history....
Comment 14•24 years ago
|
||
Yes, ASCII is a subset of UTF8. If we had a UTF8 type then putting ASCII in it
is fine. The thing we don't want to do is force users of our ASCII type to have
to be prepared to get non-ASCII UTF8. If we *really* need clean support for UTF8
then we need to suffer the pain of bringing a new type into the world.
Comment 15•24 years ago
|
||
watching this bug.
Comment 16•24 years ago
|
||
Moving out past 1.0. If someone wants this for 1.0 let me know and see rearrange
my bug list.
Target Milestone: --- → mozilla1.0.1
Comment 17•24 years ago
|
||
If we are still thinking about making nsIURI/nsIURL/nsIFile intefaces pass utf8
strings, we are going to need this to be fixed.
Comment 18•24 years ago
|
||
Ok, that's what I needed to know. Is there a bug for that change or is that what
86000 infers?
Target Milestone: mozilla1.0.1 → mozilla0.9.7
Comment 19•24 years ago
|
||
If "1.0" means frozen interfaces does that include the typelib structure or
accepted .idl types?
Comment 20•24 years ago
|
||
This is not a bug, it's a feature. I'm convinced that adding it will require
making typelibs that cannot be read by old typelib readers (e.g. NS6.2). I'm
against this.
Comment 21•24 years ago
|
||
UTF-8 can be passed as char* or unsigned char*, so this does not block making
nsIURI to use UTF-8.
Comment 22•24 years ago
|
||
that is illegal to do. The "string" keyword in xpidl is not intended for utf8
encoding. John can give you the reasoning behind this rule.
Comment 23•24 years ago
|
||
One reason is that xpconnect assumes ascii and will royally screw any interface
that's trying to pass UTF8 through a string API.
John's arguments seem to lead to the conclusion that it's better to do it before
1.0 rather than after.
I don't really see the problem if 6.2 gets broken, 6.2 isn't going to work with
components compiled against mozilla1.0 anyway due to interface and c++ signature
changes.
Comment 24•24 years ago
|
||
Let's make an effort to fix this soon, for mozilla1.0. We should of course
already be using the typelib file major version to control compatibility. See
http://www.mozilla.org/scriptable/typelib_file.html.
/be
Comment 25•24 years ago
|
||
dveditz: I think you are confusing form and content. The fact that there will be
additional frozen interfaces has little to do with the distributed overhead of
making a new incompatible typelib version. It is not just the interfaces in our
tree that take the hit. A new version impacts vendors of components and plugins,
and embedded clients. If I ship a plugin with its own interfaces for
scriptability I'd like to be able to ship a version that works in 6.2 as well as
with future clients. This implies that we'll need to support reading and writing
both old and new versions. And, we'll need to add support to the xpidl compiler
to spit out the older versions on demand. We can do this. But it is not cheap.
We also need to add support to all the consumers of type info; xpconnect,
PyXPCOM, xpcom/proxy, etc.
I have yet to see a compelling argument that says we really need to support
moving UTF8 accross scriptable interfaces. How many interfaces need to do this?
Why can't they be converted to use wstring or AString? How many cases can't be
so converted? Why? Perhaps UTF8 can be an implementation detail within
components? Maybe there is some deCOMtamination that ought to be done here? How
much code do we have that has to deal with UTF8ness? Would it not be better to
move more of the code to work with unicode? If we make an AString-ish UTF8
scheme are people going to whine that they can't easily move raw char* through
such interfaces; i.e. are we going to need 5(!) string types? Why are
non-scriptable methods not an option? You can declare a native type right now
for UTF8. You just can't call such methods using JS. But you could have parallel
scriptable methods that use unicode (xpconnect is going to have to convert to
unicode anyway).
Anyway, arguments that say we need something have to be backed up with facts and
need to address the tradeoffs and alternate solutions.
Comment 26•24 years ago
|
||
Can someone come up with a compelling argument for this enhancement that jband's
suggestions wouldn't address?
I'd like get a decision one way or another fairly soon so I can access the impact.
Comment 27•24 years ago
|
||
I'm going to take the silence as agreement. Moving -> 1.1
Target Milestone: mozilla0.9.7 → mozilla1.1
Comment 28•24 years ago
|
||
It seems like the *only* pressing reason to provide a UTF-8 datatype is for the
nsIURI interface -- which must be frozen for mozilla 1.0
I have no problem creating 'parallel'interfaces (ie. nsIURI_UCS2) to access data
as unicode.
I guess my only question is when the unicode versions should be used vs. the
ASCII ones :-)
I suppose, the ASCII ones could *always* return data as Url-Encoded 7-bit data
while the unicode ones would just return the 'raw' data as unicode.
Is this sufficient? Or can we simply ditch the ASCII ones and always access the
data as unicode -- and possibly translate it to Url-Encoded 7-bit ASCII if
necessary?
Thoughts? Opinions?
-- rick
Comment 29•24 years ago
|
||
It really depends on the protocol and we should not escape twice. With the new
urlparser rewrite we already store the url escaped. Depending on the protocol
(could be set by the protocol flags) we should call the escape functions which
do the normal escaping for the first 127 chars always (this is a must!) and
sometimes drop the escaping of the other chars.
Comment 30•24 years ago
|
||
I think the DOM APIs would like to have this too, but i could be wrong.
Comment 31•24 years ago
|
||
Nope, no need for this in the DOM interfaces.
Comment 32•24 years ago
|
||
what we found with nsIURI is that the ASCII network ready string and
string-parts are frequently accessed. this means that the ASCII url-encoded
form is what we should optimize for. if we need unicode [gs]etters then they
can live on an alternate interface (as rick suggests), and the underlying
implementation could be changed if the unicode methods start to get called more
frequently then the ASCII-network-ready ones.
Reporter | ||
Comment 33•24 years ago
|
||
Sorry to get back to this discussion late. The thing that prompted me to file
is bug is that the LDAP XPCOM SDK does currently use AString/wstring in its
interfaces. However, the XPCOM SDK is really just a wrapper around the LDAP C
SDK, which is all UTF8, because that's how LDAP stuff is encoded on the wire.
Typically, clients of the XPCOM SDK call into it a number of times in a row,
using data from each previous call in the subsequent one. So this means that
most data will be unnecessarily converted a bunch of times in a row.
As far as deCOMtamination goes, I'm not sure that's the right thing in this
case, because the LDAP code really is used in more than one part of the app
(xpfe, mailnews, autoconfig).
jband asks:
> Would it not be better to move more of the code to work with unicode?
For data that's often mostly ASCII, isn't this unnecessary bloat?
> If we make an AString-ish UTF8 scheme are people going to whine that they
> can't easily move raw char* through such interfaces; i.e. are we going to need
> 5(!) string types?
This sounds like a straw man to me.
> Why are non-scriptable methods not an option? You can declare a native type
> right now for UTF8. You just can't call such methods using JS.
Well, for one thing, the LDAP datasource (which uses the XPCOM SDK) is written
in JS.
> But you could have parallel scriptable methods that use unicode (xpconnect is
> going to have to convert to unicode anyway).
Well, maybe this is what I'll have to do, or even use arrays of bytes, and then
convert them in JS. But it's worth keeping in mind when asking all these
questions not just "is it possible to do without a new type?" but also "is it
performant and footprint-efficient?"
Comment 34•24 years ago
|
||
After further discussion it was decided 1.0.1 makes more sense as a post 1.0
milestone.
Target Milestone: mozilla1.1 → mozilla1.0.1
Comment 35•24 years ago
|
||
Kevin McClusky and I were talking last week about how much space we potentially
could save if we were to use utf8 instead of UCS2 wherever possible, converting
at output time (like inside of widget) when UCS2 is needed. i just got wind of
this bug, thought I'd at least add my observation that this might be a good
thing for helping to combat bloat.
Comment 36•24 years ago
|
||
Most of the huge UCS2 bloat is inside of layout, not from XPCOM boundary
crossing, I think. We've been talking about this for literally years, but I was
always under the impression that it was Too Hard to fix layout.
We have bugs related to using UCS-2 even when the data is all-ASCII because of
defects in AppendData, if people are looking for a place to start.
Comment 37•24 years ago
|
||
Most of the string data in layout (or at least in the DOM) is stored as char*'s
if it's ASCII data, which is what most data on most pages is. Do we have other
huge consumers of string data in layout? Style system?
Comment 38•24 years ago
|
||
I'm seeing a lot more reasons to *not* do this than reasons to do it.
I do want to reiterate the thought that if we *do* decide to add such support to
xpidl then it should be as a way to talk about an aString-like *class* (yet to
be written!) and not about a pointer to an array of UTF8 chars. We want to get
rid of uses of string and wstring and their ilk, not encourage their use.
Reporter | ||
Comment 39•24 years ago
|
||
Changing bug summary to more accurately describe what is desired after
discussion with bryner and jag in irc.
OS: Linux → All
Hardware: PC → All
Summary: UTF8 string type for XPIDL → 8-bit UTF8-capable string type for XPIDL
Comment 40•24 years ago
|
||
I don't see how the summary change from "UTF8 string type" to "UTF8-capable
string type" helps. The current string type is "utf8 capable", the problem is
that xpconnect or other idl users can't know whether any given string does in
fact contain utf8 and so can't safely convert to UCS2 if desired.
Reporter | ||
Comment 41•24 years ago
|
||
dveditz: according to jband, passing UTF8 through |string| is illegal; why this
is so, I don't entirely understand. Despite it being illegal, it does work.
What we need is a type for which this both works and is legal. Perhaps jband
can shed some light on why this is illegal for |string|.
Comment 42•24 years ago
|
||
It's "illegal" because xpconnect assumes ASCII and will mangle the conversion to
UCS2; utf8 can be made to work, though, in non-scriptable interfaces. My
question was about the distinction between a UTF8 type and a UTF8 "capable"
type, and why that distinction was important enough to change the summary for. I
know why we want a UTF8 type, what I don't know is what else a "capable" type
would do.
I imagine some sort of argument pair (char* plus charset arg) would be utf8
capable, but I suspect most of the people who want a UTF8 type want to declare
"This routine takes UTF8" in their interfaces not "This routine will handle any
arbitrary charset you throw at it". I know some people would like the latter,
but I believe there are other bugs for that, and if we have to wait for the
arbitrary case we might not ever get it. Since UTF8 is an encoding not a
character set it's much easier to deal with than arbitrary character sets, and
in western languages saves memory over UCS2 which is one reason we want it now.
Reporter | ||
Comment 43•24 years ago
|
||
> It's "illegal" because xpconnect assumes ASCII and will mangle the conversion
> to UCS2; utf8 can be made to work, though, in non-scriptable interfaces.
That's actually not correct. It doesn't do any conversion at all, but merely
narrows or widens the chars as appropriate. So as a result, you can take the
string and convert it by hand from inside JS between UCS2 <-> UTF8. There's
even code that does this in the tree now.
> My question was about the distinction between a UTF8 type and a UTF8 "capable"
> type, and why that distinction was important enough to change the summary for.
> I know why we want a UTF8 type, what I don't know is what else a "capable"
> type would do.
This is an interesting distinction because an "encoding-agnostic" |nsACString|
(or even |string|) would fulfill this requirement. Special knowledge of UTF8
would be cool, but is not necessary.
Comment 44•24 years ago
|
||
so, i've got an example of why having a UTF8 string type would be very useful.
nsIURI::spec
nsIURI::host
nsIURI::prePath
all can contain UTF8 characters, yet these are declared |string| in the
nsIURI.idl. all components of an URL with the exception of the hostname are
escaped. this is done to support internationalized domain names, which must be
passed down to the bowls of necko in UTF8 format.
up until very recently, mozilla did not support internationalized domain names
(see bug 42898), and it has always been assumed that these attributes of nsIURI
would just be plain ASCII. now that assumption has changed.
for C++ code, i've had to manually convert consumers from making ASCII
assumptions to treating these attributes more generally as UTF8. no problem so far.
the problem comes in when i consider the JS consumers. what happens when JS
code accesses some of the attributes? if |string| means ASCII, then my JS
consumers will get garbage.
so, i either need |string| to mean UTF8 to xpconnect, or i need some other
string type that i can use to designate UTF8 strings in my IDL. and, since
nsIURI is an interface that must be frozen by mozilla1.0, i think we need some
sort of solution to this bug soon.
oh, and making these interfaces use wstring or AString is not going to cut it.
these strings need to be UTF8... for performance reasons, i can't afford to
convert them everytime someone asks for them. moreover, storing a UCS2 version
along side a UTF8 version is just going to bloat mozilla unnecessarily.
please please please add a UTF8 string type :-)
(my preference would be for |string| to be treated as UTF8, unless it can be
shown that this would be a performance problem.)
Keywords: mozilla1.0
Comment 45•24 years ago
|
||
Can we cut the Gordian knot? Brute-force and ignorance: redefine XPIDL string
to mean UTF-8, instead of ASCII. Make sure all our code that was inflating
ASCII to UCS-2 (or is it UTF-16? Neither, actually) now converts from UTF-8 to
PRUnichar vectors. Darin has had to go down this path. Can we follow and take
our chances with the performance hit? My little finger says "yes".
Jband, throw your trump card. Backward compatibility in the pre-1.0 world,
which mixed ASCII and UTF-8 in string, doesn't count with me -- except to
underscore that (by design) UTF-8 is a superset of ASCIIZ, so the only hazard is
the possible loss of performance converting from string to wstring.
/be
Comment 46•24 years ago
|
||
I don't suppose I've going to go on fighting the whole friggin' world. But...
I've argued from the beginning that the real reason for 'string' even existing
is to interface with code that needs to use flat C style strings (even in a
Unicode world with a modern 16 bit string class available). And, the point of
the ASCII restrictions is so that code using these strings can reasonable make
assumptions about the nature of the strings. If you say that 'string' now means
UTF8 then you've erased any way of tracking that part of the *explicit* contract
about the strings. And, *all* code that receives such strings is on its own to
figure out if they are ASCII or not - and have *some* sort of strategy to deal
with the unexpected. There is no way to track mismatches between code that vends
UTF8 and code that does not know how to deal with UTF8 (and has always been
promised ASCII!).
I was serious in my comments above that I think it is simply *wrong* to unleash
another 'dumb' string type. It is also wrong (but apparently an unavoidable
legacy) that we are passing around flat ASCII buffers rather than objects whose
length is determined *once*. It is even whackier to reused that ancient C string
paradigm for UTF8 when we *ought* to have an object that can answer both the
question of length and the question of whether or not any 'extended' chars are
present (when usually they are not).
I was arguing that if people think support for UTF8 is so important then they
should commit the resources to construct a modern class for representing them.
*Then* we can decide if we are willing to pay the cost of extending the typelib
spec (and client code) to support the new class at interface boundaries.
Just changing the spec and hacking xpconnect (et al.) a little will get you UTF8
support. But this is like proclaiming that there is no difference betwen signed
and unsigned numbers and turning off static type checking. It will mostly work,
but some code will invariably get screwed when handed an instance whose value
does not conform to expectations. And you'd be throwing away the tools to track
the distinction.
Comment 47•24 years ago
|
||
jband: I think we're already in the world you describe in your last paragraph,
and XPConnect's string-must-be-ASCII checks won't save C++ code called directly
from C++ code (of which we have plenty).
I agree we're also risking handing UTF-8 with high bit set off to old ASCII-only
stdio routines, or whatnot (though aren't some platforms modernizing? Or do the
C and C++ standards dictate that char[] memory passed, e.g., to printf, is ASCII
only?). This in addition to any perf hit from string => wstring where now we
merely inflate.
But worse-is-better is a harsh master, and it is beating us about the face and
shoulders here. Unless someone (darin with jag help?) is willing to evolve a
"strongly typed" UTF-8 nsAString subclass and get dbradley to extend typelib
format and XPConnect *quickly*, I predict string will be redefined to be UTF-8.
Then at least our current ASCII vs. UTF-8 confusion will yield to UTF-8 unity,
even where we do not want it. :-/.
/be
Comment 48•24 years ago
|
||
i'm a little confused about darin's argument for needing a utf-8 string type.
As far as i can tell, he is saying that we need a utf-8 datatype for javascript
callers.
but then, he says that AString or wstring won't be sufficient for *performance*
reasons. performace for what? the javascript consumers?
i can't imagine that performing a utf-8 to unicode conversion before calling
into javascript will kill our performance.
Right now, nsIURI is the *only* public API that could use utf-8 strings. It
seems crazy to completely eliminate ASCII strings just for this one interface !!
-- rick
Comment 49•24 years ago
|
||
>i can't imagine that performing a utf-8 to unicode conversion before calling
>into javascript will kill our performance.
Right, that's not the problem. I think darin meant C++ code performance if we
used wstring or anything widening for utf-8.
dmose has use-cases for utf-8 strings, it's not just nsIURI.
But what I would like to emphasize (not sure this helps, tell me why not, if
not) is that we aren't losing ASCII strings -- UTF-8 is a proper superset of
ASCII (ASCIZ, I mean). We are losing the current string=>wstring performance
profile of inflating char to PRUnichar, if we go this route. Instead, for an
ASCII string, we'll inflate only after testing bit 7 in each char. I don't fear
that right now.
I'm probably missing something, however! There are lots of different cases.
/be
Comment 50•24 years ago
|
||
> we aren't losing ASCII strings
But you *are* losing ASCII strings.
You are making *everyone* pay the cost of not being able to have a contract that
says ASCII and only ASCII when a *few* want a superset. The cost is more
checking everywhere and/or more uncertainty in general.
Comment 51•24 years ago
|
||
I see, we're losing a "contract" that only XPConnect enforced.
I'm full of insecurity right now. UTF-8 strings will leak, I bet they are now.
How less secure are we by facing this problem and switching string from "I hope
it's ASCII and XPConnect debug builds may spank me" to "I know it's UTF-8 and
maybe it'll leak into some old stdio and print garbage (to a debug-only log)"?
/be
Comment 52•24 years ago
|
||
I don't understand this concern about a contract for "ASCII and only ASCII".
Few if any contracts will accept arbitrary strings taken from the entire range
of ASCII, including control characters. A contract will normally specify some
strict subset of the language defined by NUL-terminated ASCII strings as being
legal.
Comment 53•24 years ago
|
||
rpotts: what brendan said.
currently nsIURI::host is declared |string|, but it's really UTF-8 to allow for
internationalized domain names (this also applies to nsIURI::spec and nsIURI::
prePath since they contain the hostname). at one time the hostname was URL
escaped, but in order to solve a security problem with embedded %00 sequences, i
(perhaps naively) decided to do away with URL escaping the hostname. (this
actually matches the behavior of other browsers (such as IE and Nav4) as far as
i can tell.)
i say that this was perhaps done naively because i was unaware at the time that
|string| was meant to "strictly?" represent a null terminated string of ASCII
characters. as i've learned more about this issue, i see that there are many
pitfalls resulting from this... namely having to do with xpconnect, since it
will not know to treat nsIURI::host as UTF-8.
so, if this usage of |string| is invalid (and if there will never be a UTF-8
string type), then i need to find some other solution for nsIURI. it probably
needs to have AString attributes in addition to ASCII attributes for the
portions that need to be in network ready format. i'm not sure what this will
mean to the implementation of nsStandardURL. there will likely be foot-print /
run-time performance issues to consider.
maybe all i need to do is go back to URL escaping the hostname and then try to
solve the issues w.r.t. %00 and other fun escape sequences in another way :-/
Comment 54•24 years ago
|
||
darin, it's not just nsIURI.
I wrote "UTF-8 strings will leak" and probably confused people -- by "leak", I
meant that these char arrays will flow through C++ and even C code into places
that do not expect UTF-8, and be wrongly widened to PRUnichar arrays simply by
inflating each char with a zero high byte.
I don't know that this is a worse problem than the "leakage" of UTF-8 strings
into code that expects only ASCII, but I suspect it is. Mozilla is a fairly
closed system, apart from debug logging, for portability and localizability.
Code tends to do AssignWithConversion more than pass a char array off to a stdio
or other legacy ASCII-only routine.
Anyway, I think a "contract" that's enforced only by debug builds for data
flowing through XPConnect is likely to be observed in the breach more than we
would like. I prefer to think we can change such a contract still, document the
change, and flush out bugs. Who is with me?
/be
Comment 55•24 years ago
|
||
darin: how does UTF8 help with embedded nulls? If you're using 0xC080 to
represent null then you're violating the UTF8 spec which says that each char
must be represented in shortest form. NULL must be 0x00 as in ASCII.
Our UTF8 converters probably don't validate for speed reasons, but don't count
on it staying that way. We don't want to leak invalid UTF8 outside Mozilla so we
may end up adding validation to debug builds.
Comment 56•24 years ago
|
||
Our UTF8 converters validate strictly for security reasons.
Comment 57•24 years ago
|
||
internally, our xpcom-centered utf8 converters do not validate (i.e.
NS_ConvertUTF8toUCS2)..
Comment 58•24 years ago
|
||
NS_ConvertUTF8toUCS2() calls class ConvertUTF8ToUCS2, which does validate. Note
the use of minUCS4 in ConvertUTF8ToUCS2::write() -- a C0 80 sequence will decode
to U+FFFD.
Reporter | ||
Comment 59•24 years ago
|
||
It seems like we need to choose our evil here. We need UTF8 functionality one
way or another. If noone can afford the time to evolve a new UTF8 type &
typelib functionality (and I don't yet see any volunteers :-), then I agree with
brendan that we need to redefine |string| slightly.
If the extra checking turns out to be a real perf problem, we could go the
more-uncertainty route:|string| could be defined in general to be UTF8, but
XPconnect could continue to treat |string| as it does today, as ASCII: |string|
would be inflated with zeros and UCS2 would be truncated to get back to
|string|. This would force the few JS callers that need to use these interfaces
with UTF8 data to do the UCS2<->UTF8 munging by hand.
Reporter | ||
Comment 60•24 years ago
|
||
Note that in my previous comment, when I say "ASCII", I really mean "8-bit
ASCII". XPconnect currently does treat |string| that way.
Comment 61•24 years ago
|
||
By "8-bit ASCII" I think you really mean "ISO-8859-1".
Comment 62•24 years ago
|
||
dveditz: i should have been more specific: i was talking about the case of a
href with an embedded %00 sequence in the hostname, such as:
www.amazon.com%00www.nasty.com
back in the days when we used to URL escape hostnames, this would have gotten
unescaped in certain places... which, has the unintended side effect of
introducing an embedded null. a security hole was discovered that exploited
this unescaping, so we decided to disable URL escaping and unescaping of the
hostname to counter this exploit (see bug 110418).
ok, that aside... i've thought more about my previous comments. URL escaping is
not the solution for nsIURI. perhaps it's worth it for me to explain why.
nsIURI setters currently allow for unescaped input, and nsIURI getters are
expected to generate unescaped output. this means, that for attributes like
nsIURI::host, escaping will not help us, because the nsIURI::host getter will
still want to return a UTF-8 string.
so, what does this mean for nsIURI? well, if |string| were modified to mean
UTF-8, then nsIURI, it's consumers, and all of it's implementations would remain
as is.
if |string| were to mean ASCII (or ISO-Latin-1) exclusively, then nsIURI would
have to be changed to use wstring (or AString) attributes. this would be a
costly change, not only in development time but also in performance (resulting
in either increased foot-print or increased execution time).
you might be tempted to solve nsIURI's problems by changing the escaping policy,
such that the attribute values would always be in escaped form, but this would
also be a very difficult solution to implement.
OK, i think i've convinced myself (perhaps twice now) that nsIURI and it's
consumers would really benefit from |string| becoming UTF-8. existing code that
uses |string| to represent raw network data can be modified to use [array]
octet, without even breaking any existing C++ code :-)
so, brendan's plan get's my vote!
Comment 63•24 years ago
|
||
This bug needs to be pulled back from mozilla1.0.1 to mozilla0.9.9 -- dbradley,
can you do it? We can certainly muster helpers, jag, darin, dmose, and myself
among them, to check for NS_ConvertASCIItoUCS2 calls that need to be converted
to NS_ConvertUTF8toUCS2.
/be
Comment 64•24 years ago
|
||
Pulling back to 0.9.9.
I'll be in Mountainview next week. We can discuss it then. I'll take a look this
weekend at it.
Target Milestone: mozilla1.0.1 → mozilla0.9.9
Comment 65•24 years ago
|
||
yeah, if i can help out with any of the grunt work just let me know.
Comment 66•24 years ago
|
||
I certainly see where jband is coming from - tracking down encoding related
errors could be gruesome. Take PyXPCOM for example; this change gives me a
number of dilemmas about how to handle string conversions. Every option has a
significant downside, including too much encoding magic (Python has a default
encoding (generally strict 7-bit ASCII), Moz uses UTF-8, on Windows all the
narrow CRT functions use "mbcs" encoding, etc) causing strange and hard to
reproduce failures when first tested with non 7 bit ascii data.
So I would prefer to see this not go ahead, but jband's idea of a new IDL type
explored. I do, however, appreciate the reality of the situation and that my
2c doesn't change it :)
Comment 67•24 years ago
|
||
Trying to put the finishing touches on a patch, I ran into something that I
wanted to get a concensus on. For a type T_CHAR (single character) that is being
translated into UCS2, how should I handle that? Should I still call the UTF8 to
UCS2 function or assume that it can only be ASCII and leave it as it is. There's
warning code for testing the high bit, but I'm not familiar enough with UTF8
encoding to know if there's a valid UTF8 character that is one byte and has the
high bit set. My take is that it doesn't make sense to use T_CHAR for UTF8 and
that it should be assumed ASCII, leaving the assert in.
Conversely if a conversion from UCS2 character converted to T_CHAR I probably
should fail the conversion if it would result in a non-ASCII character?
Comment 68•24 years ago
|
||
Any UTF8 byte with the high-bit set is part of a multi-byte sequence (or invalid).
Comment 69•24 years ago
|
||
Here's a first whack. I haven't double checked all the uses of ASCII strings
yet. What I wanted to get feedback on is the use of NS_ConvertXXXtoXXX classes
and if there are more efficient methods to do this.
The complicated part was the fact that the JSData2Native is externally told
whether or not to allocate the string. And if told not to alloc it just returns
back the deflated string JS provides which from what I could tell truncates and
does not translate the UCS2 characters. So I adjusted the callers (where I
think it matters) to call it with useAllocator to true and set the is allocated
flag on the variant. It's not a great solution if someone comes up with a good
alternative let me know. In the mean time I'll double check to see if I missed
other conversion and if I can come up with a better solution for the
useAllocator issue.
Also need some UTF8 test cases to pump true UTF8 through this code.
Comment 70•24 years ago
|
||
dbradley, please contact ftang, I think he wrote the old code.
Reporter | ||
Comment 71•24 years ago
|
||
dbradley writes:
> My take is that it doesn't make sense to use T_CHAR for UTF8 and
> that it should be assumed ASCII, leaving the assert in.
>
> Conversely if a conversion from UCS2 character converted to T_CHAR I probably
> should fail the conversion if it would result in a non-ASCII character?
Both of these behaviors sound to me like the right thing.
Comment 72•24 years ago
|
||
I'm w/ jband here. 'string' has strong meaning of ASCII only. Changing that to
include UTF8 changes the ground rules and would be bad (too many conversion/leak
bugs to track down w/ consumers). It's a pre-1.0 contract sure, but, a contract
none-the-less. There's grey area around contract integrity pre-1.0, but, this
one lands on the "let's not break this contract" side of things IMO. We have
lots of pre-1.0 contracts that can't be broken (see
http://lxr.mozilla.org/seamonkey/search?string=us+frozen).
Brendan: Mozilla is not a "fairly closed system." We have *many* consumers of
'string' (both JS and C++). Pulling a switcheroo (sp?) on them would be rude.
"Oh, that's UTF8 all of a sudden? Now you tell me; I've been leaking all along.
Oh, I need to do all this encoding/conversion magic; that bites!" Of course, if
they're going to be thanking us later for it, that's another thing, but, my
crystal ball broke a couple of years ago.
As for the re-negging on the nsIURI::host 'string' contract (now it can include
UTF8), Darin, please add a comment to the interface indicating that it can be
UTF8 so unsuspecting consumers know it's an exception (for now).
I see LDAP and nsIURI::host's need for UTF8 to pale in comparison to the system
(open and closed) wide impact of this mod.
IMHO, we pound out (I can't sign up for the work :-/ ) a new string type, or
special case the UTF8 needs in a new interface. dmose, customers needing your
LDAP data can use said interface, and we can either leave exceptions
(nsIURI::host) as exceptions, or push them (it.?) to use the new interface.
Comment 73•24 years ago
|
||
Extending string from US-ASCII only to be capable of handling UTF-8 does not
"break" existing contracts unless such contracts use the current undocumented
semantics of permitting ISO-8859-1. Existing contracts restrict the set of
legal values for string arguments to some subset of the language of US-ASCII
strings; these languages are still valid subsets of the language of UTF-8 strings.
Put another way, extending string to handle UTF-8 does not itself require any
existing contract to start accepting UTF-8.
Interfaces that handle data which originate from untrusted sources cannot now
rely on the "string" type carrying data conforming to the US-ASCII subset. They
are already required to fail safe when given non-ASCII data.
Comment 74•24 years ago
|
||
No. The proposed change *would* change contracts. The current xpidl spec is that
*all* 'string' values are zero terminated arrays of 7bit ASCII chars. Any
consumer of type 'string' can reasonably expect to only recieve 7bit ASCII
strings. That is part of the contract. Saying that 'stirng' is extended to
include other values can result in those implementations being given values that
fall outside of range of values they were promised. They had no preexisting need
to protect themselves from such values. This is not only possible, but we'd be
explicitly changing XPConnect code such that JSStrings that previoulsy did not
convert to anything other than 7bit ASCII would now create UTF-8. This is a big
change. (Yes, it can be argued that the existing conversion code *ought* to
detect the loss of information). This is absolutely a unilateral change of
contract for all 'string' consumers.
Comment 75•24 years ago
|
||
No to both point.
The current xpidl spec may say that all 'string' values are zero terminated
arrays of 7bit ASCII characters, but the current implementation treats these as
zero terminated arrays of 8bit ISO-8859-1 characters. Security sensitive
implementations currently *must* safely deal with non 7bit characters in these
strings as the current implementation permits such values to be passed from
untrusted sources.
Changing the definition of 'string' does not change existing contracts using
'string'. Take for example nsIClassInfo::contractID. A contractID has a
specific syntax. Consumers of nsIClassInfo::contractID do not currently have to
handle the full range of US-ASCII strings in values received from this
interface, they only have to handle values that are valid per the syntax
specified by the nsIClassInfo contract. Just as consumers of this interface
don't have to handle contractID values consisting of, say, 103945 consecutive
Control-G characters, consumers of this interface after the proposed interface
would still not be required to handle a contractID value containing, say, a
U+0234 character.
Comment 76•24 years ago
|
||
All this talk about contracts is making me queasy; it reminds me of the scene in
"Play It Again, Sam" where Woody Allen's wife is leaving him:
Wife: "My lawyer will call your lawyer."
Woody: "I don't have a lawyer -- have him call my doctor."
We have more than a "little bit" of contract violation in the current codebase,
and even a little bit can hurt. I don't see how valeski's comment 78 request
that darin warn consumers of nsIURI attributes about the UTF-8 hazard helps, if
those callers are written in JS or another language that current inflates char
arrays to 16-bit-character-type arrays. Such callers will get bogus ISO-Latin-1
code points instead of proper UTF-8 decoding (and JS ones will be spanked by
XPConnect's 8th-bit-must-be-zero check in debug builds).
Jud, is the proposal that we remove the XPConnect sanity check and make all such
callers use a UTF-8 decoder? That removes all "contract" enforcement and adds
avoidable costs to such callers. See comment 44 for darin's rejection of using
other existing, scriptable string types.
I am not proposing a utopia here. I still view switching string as the lesser
evil, given all the constraints.
/be
Comment 77•24 years ago
|
||
John Myers, the point I'm making (and that I believe jband is making as well) is
that regardless of ASCII being a subset of UTF8, there is an implied usage
semantic that 'string' is null terminated, and that no decoding has to occur.
Wrong or right (technically it's wrong, I agree), it's a contract we'd be breaking.
Brendan, if you're queasy now, maybe you should jump into some of the UA string
mod discussions going on elsewhere ;-). I'm not proposing switching XPConnect's
definition of 'string' (from ASCII to UTF8) and pushing the UTF8 decoding onto
callers. IMO, that's asking callers to do a lot of work for what, AFAICT, is an
edge case that can be handled by a "new" (non existing) string type that
explicitly defines itself as UTF8. If I can be convinced this isn't an edge
case, then my view changes. If there is true benefit to exposing UTF8 all over
the place, then, asking callers to explicitly decode doesn't feel as bad.
I'll note here that whatever version of IE I'm running has a preference checkbox
to "treat all URL strings as UTF8." That to me feels like IE is testing the
water w/ the very contract change we're discussing here (changing char * to mean
UTF8 instead of ASCII).
Comment 78•24 years ago
|
||
The semantic that 'string' is null terminated would not be changed by the proposal.
I have no idea what you mean by "no decoding has to occur". Any language that
uses 16-bit string types has to decode 'string' somehow.
Comment 79•24 years ago
|
||
(I just bet that IE already has their entire story except "what does the user
expect us to do with the incompatible content on the web?" nicely in hand.)
I want us to have a widely-used UTF8 type, because I want to be able to gently
migrate us away from the pervasive bloat of UCS2 when the vast (vast!) majority
of the data we handle is ASCII. If this change (string == utf8) will get us
closer to that (and I think it will), I am _all_for_it_.
(Even if we have to sacrifice counted strings to do it.)
Reporter | ||
Comment 80•24 years ago
|
||
It may be interesting to think about exactly how the additional XPConnect
decoding load would affect interfaces that don't need it. As jgmyers just
pointed out, we need to traverse the string once anyway to inflate it even in
the current world. So for interfaces that only actually use 7-bit ASCII, the
cost would be still be O(n) for strings of length n, with the constant being
made slightly larger because each char will have a single test of the high bit,
which will fail. Exactly what this means in terms of real-time cost is left as
an exercise to the reader, but my suspicion is that this is gonna be lost in the
noise of crossing XPConnect. In languages that support 8-bit strings natively
(eg C++), there's no "decoding" that needs to be done at all.
Comment 81•24 years ago
|
||
I wanted to comment briefly saying that jband set me straight on XPConnect's
current enforcement code being not just for DEBUG builds -- he gleefully clears
the eighth bit in each char when inflating string to JS string. Sorry if this
was known, I missed it and kept babbling as if it weren't the case. It means
that our scripted uses of nsIURI attributes are not going to work with IDNS, but
that's also a big "duh".
So a bunch of people got together and hashed out an alternative proposal that
rpotts, jband, and vidur presented: implement an nsAString subclass for UTF-8
and use that in nsIURI and LDAP. Darin'll be commenting with the details soon,
but while I was here I thought I'd spread the word about the new proposal.
/be
Comment 82•24 years ago
|
||
brendan: sorry, we miscommunicated. I did not mean that xpconnect clears the
high bit of each 8bit char, but that it uses only the low 8 bits of each 16bit
char - throwing away the high 8 bits. Really it just lets the JS engine do the
conversion in its simple way by calling JS_GetStringBytes...
http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcconvert.cpp#637
XPConnect does not currently enforce in release builds that nothing other than
7bit ASCII is passed. It just punishes you (by mangling your data) if you do
pass something else. This is certainly not ideal. There is an old bug of mine on
dbradley's list about defining and enforcing a coherent set of data conversion
rules for xpconnect - esp. in edge cases where the caller is actually passing
'illegal' values.
John.
Comment 83•24 years ago
|
||
Ok, so the proposal that brendan aluded to in comment #81 is the following:
1) modify nsAString to have setters that take UTF-8 strings |const char *|.
2) create a nsAString implementation (nsSharableUTF8String??) that implements
nsAString with UTF-8 storage instead of UCS2 storage.
3) change all uses of |string| that really intend UTF-8 to use AString.
This solution is good for a number of reasons. First of all, it does not impact
the typelibs at all, since we can use an existing string type (namely AString).
This means that there would be no changes required to xpconnect. Also, callers
of the nsIURI getters could now choose between the different implementations of
AString and either get a UTF-8 result or a UCS2 result. It is even possible that
some of the nsIURI attributes could now be "shared" instead of strdup'd.
This feels like the right compromise between adding a new UTF-8 specific string
type to xpidl and forcing |string| to mean UTF-8.
If there is rough concensus that this is the right thing to do, then I'll file a
bug and get started :-)
Comment 84•24 years ago
|
||
so jag and shaver demonstrated on IRC why this solution is not so straight-
forward...
it goes against the idea that charset conversions should be explicit for one
thing, and that can mean trouble. another problem: nsAString::BeginReading
cannot be easily implemented if the storage type is UTF-8.
perhaps jag and/or shaver will want to elaborate further on these problems.
Comment 85•24 years ago
|
||
See also the log I'll attach...
The point of BeginReading was that once a UTF8String is passed through a
nsAString interface, you don't know whether it's a UTF8String or a nsString or
whatever, so it should behave like a UCS-2 string and hand you an iterator on a
UCS-2 buffer there. In other words, you're going to end up with a lot of
implicit conversions.
The alternative is adding UTF-8 (read: const char*) methods to nsAString (and
all its derivatives), which means adding support for implicit (expensive)
conversions back and forth to UCS-2, which is something we've been trying to
avoid in the current string design.
My question is, say we accept the above as a necessary complication to avoid
having to add another xpidl type and to avoid having to go through and fix our
functions that can't cleanly deal with UTF-8 input ('coz they're expecting pure
ASCII or some form of extended ASCII), and prevent our code from handing
non-ASCII input to outside systems, how is xpconnect supposed to know that for
one |AString| it's supposed to hand over the string as 16-bit UCS-2 encoded
characters, and for another |AString| it's supposed to take the string and
perform a UCS-2 to UTF-8 conversion on it and store it in a UTF8String? Or
alternatively, assuming that the string is already in UTF-8 (but in 16-bit
characters), chop off the high 8 bits and store it in a UTF8String? One way
would be to mark the parameter somehow (which I think was rejected earlier as an
option on |string|, but if possible and chosen I'd prefer the generated
interface to have |nsUTF8String&| as its parameter), the other is that you don't
discriminate between the two, pass UCS-2 across the XPCOM boundary, and let
UTF-8 conversion happen automatically (since all nsAStrings then will have the
hidden auto UCS-2 to UTF-8 conversion), in which case I suggest you set up a
non-scriptable interface for the |const char*| version of these methods that
takes UTF-8 encoded strings, and a scriptable version that takes UCS-2 encoded
strings, which would do the conversion explicitely (or not until necessary, as
suggested by shaver).
Comment 86•24 years ago
|
||
Comment 87•24 years ago
|
||
ok, i'm very discouraged by all of this. there does not seem to be a good
solution for nsIURI given the current constraints of XPCOM.
people at yesterdays meeting all seemed to think that nsIURI should use AString
attributes (myself included). now jag tells me that AString cannot be used
without their being implicit or explicit conversions to UCS-2 (and i agree),
which is NOT acceptable for nsIURI. why should we bloat our URL structures to
use UCS-2 just to support intl domain names? mozilla cannot even use intl
domain names w/o a plugin from i-dns.net! and their plugin is not open-source!
any solution to the nsIURI problem that imposes an UCS-2 encoding is an
unnecessary performance burden. all we need is a way to safely transmit UTF-8
across a XPCOM border w/o converting to another character encoding. are we
really happy with the fact that there is no way to do this in an efficient
manner? why is it ok that one should have to convert UTF-8 to UCS-2 just to
play with XPCOM?
i do not think it is correct to slow down mozilla just to be conformant w/ the
current rules of XPCOM when XPCOM itself could change to avoid the cost.
the original bug description by dmose sums up this issue the best, and i think
we've gone around in circles trying to get around it:
"The short story is the nsIURI is moving in the direction of being UTF8 encoded
rather than ASCII encoded. Since we need to be able to deal with this
*reasonably* from JS, and as a way to allow folks to write *less bloated*
scriptable interfaces, shaver suggested, with some hesitation, adding a [utf8]
qualifier to xpidl."
i really believe that this problem would be best served by the addition of a new
string type that is UTF-8 specific. i don't care too much if this manifests
itself as |char *| in C++ or if it uses some special string class (how about
nsACString?). i think our C++ code can deal... besides we already have things
like NS_ConvertUCS2toUTF8, which introduce UTF-8 into our C++ code, and there is
no protection against using the results from NS_ConvertUCS2toUTF8 with code that
doesn't support UTF-8... grep 'printf.*UCS2toUTF8' if you're not convinced!!
so what's the problem with this simple approach.
if making a change to the typelibs is difficult now... i imagine it will only
get more difficult later on. why not do the grunt work of expanding the
typelibs now (before 1.0)? add a utf-8 string type... we have good use for it.
and |string| should not imply any particular charset encoding IMO. it seems
wrong to put any restrictions on |string|. anyways, we don't enforce a charset
in optimized code, so what's the big deal? there are so many places in our code
where |string| is used for passing around byte arrays that make use of the 8th
bit... i think it is unreasonable to say that we should go now and "fix" all of
these "broken" uses of |string|. there's nothing broken about them. |string|
just means null terminated array of bytes, and |wstring| just means null
terminated array of double-bytes... can't we just live with this definition?
i'm suggesting that we really have no choice... this definition has grown up all
around us... interfaces make use of it, and if mozilla is ever to reach 1.0 we
can't start enforcing a particular charset on |string| now... not ASCII, not
UTF8. i think we really have to just leave |string| alone... it is a null
terminated array of bytes whether we like it or not. too late to change this
now, but we can add another string type that would enforce a specific character
encoding, namely UTF-8.
why are we making our lives so difficult? let's just add a UTF-8 string type
and be done with it. haven't we tested all other options? i don't see any
other solution besides something along the lines of the original proposal.
Comment 88•24 years ago
|
||
darin:
> there are so many places in our code where |string| is used for passing
> around byte arrays that make use of the 8th bit...
Really? Apart from image data going through nsIInputStream, where else?
> i think it is unreasonable to say
> that we should go now and "fix" all of these "broken" uses of |string|.
> there's nothing broken about them. |string| just means null terminated array
> of bytes, and |wstring| just means null terminated array of double-bytes...
> can't we just live with this definition?
First, and perhaps least important, you are arguing against history. Jband is
right that string originally and mostly meant ASCII, even if that meaning wasn't
enforced well. Second, the stronger reason we can't just treat string as a
nul-terminated byte string is that XPConnect and layers like it need to know how
to convert string to JS string or a similar UCS2-like string (or to UTF-16, or
to UCS-4, or whatever wider code another language or subsystem may require). If
we merely inflate by interleaving zeroes, then we *are* saying something more
about such strings: that they are ISO-Latin-1 character sequences. That
contradicts your premise.
We can't have it both ways: string means a character sequence, not a byte array
with no zero bytes. This character sequence definition begs the question of
"which charset" (as well as what about NULs, which we have answered both ways:
NUL terminated unless a [size_is(aCount)] property is used). The answer has
been "ASCII". I am proposing "UTF-8" without much conviction, but to cut the
Gordian knot or smoke out a better plan.
Say we leave string as ASCII and fix the places that want a byte array to use
[array, size_is(aCount)] in octet aBuffer, or whatever. For "inplaceout" type
data transfer, _a la_ Unix read(2), we'd need these methods to be [noscript] or
even [notxpcom]. Ok, back to the question of UTF-8.
Can we not do as darin says, and extend typelib format to include a UTF-8 string
type *now*, before 1.0, more easily than later, after 1.0? Won't it always be
harder, never easier? Jband, dbradley: can you guys provide more detail on the
costs (like, it would take dbradley three weeks) of adding a new string type and
supporting old typelibs?
/be
Comment 89•24 years ago
|
||
> Really? Apart from image data going through nsIInputStream, where else?
ok, some examples from netwerk/base/public where the ASCII requirement of
|string| is currently violated:
interface nsIIDNService {
// input is UTF-8
string UTF8ToIDNHostName(in string input);
};
interface nsIIOService {
// all of the uses of |string| here are actually UTF-8
nsIURI newURI(in string aSpec, in nsIURI aBaseURI);
nsIChannel newChannel(in string aSpec, in nsIURI aBaseURI);
string extractScheme(in string urlString,
out unsigned long schemeStartPos,
out unsigned long schemeEndPos);
string extractUrlPart(in string urlString,
in short flag,
out unsigned long startPos,
out unsigned long endPos);
long extractPort(in string str);
string getURLSpecFromFile(in nsIFile file);
void initFileFromURLSpec(in nsIFile file, in string url);
};
interface nsIProtocolHandler {
// aSpec is UTF-8 encoded
nsIURI newURI(in string aSpec, in nsIURI aBaseURI);
};
interface nsIProtocolProxyService {
// host is UTF-8 encoded
nsIProxyInfo newProxyInfo(in string type, in string host, in long port);
void addNoProxyFor(in string host, in long port);
void removeNoProxyFor(in string host, in long port);
// url is UTF-8 encoded
void configureFromPAC(in string url);
};
interface nsISocketTransportService {
// host is UTF-8 encoded
nsITransport createTransport(in string host, ...);
nsITransport createTransportOfType(in string socketType,
in string host, ...);
nsITransport createTransportOfTypes(..., in string host, ...);
};
interface nsIStreamLoaderObserver {
// result is not limited to ASCII... it is network data.
void onStreamComplete(...,
in unsigned long resultLength,
[size_is(resultLength)] in string result);
};
interface nsIURIChecker {
// uri is UTF-8 encoded
nsIRequest asyncCheckURI(in string uri, ...);
};
interface nsIURI {
// these are all UTF-8 encoded
attribute string spec;
attribute string prePath;
attribute string host;
};
interface nsIStandardURL {
// initialSpec is UTF-8 encoded
void init(..., in string initialSpec, ...);
};
interface nsIURLParser {
// spec is UTF-8 encoded
void parseURL(in string spec, ...);
// authority is UTF-8 encoded
void parseAuthority(in string authority, ...);
// serverinfo is UTF-8 encoded
void parseServerInfo(in string serverinfo, ...);
};
here's some examples from xpcom/io where the ASCII requirement of |string| is
violated:
interface nsIStringInputStream {
void setData(in string data, in long dataLen);
[noscript] void shareData(in string data, in long dataLen);
};
interface nsIScriptableInputStream {
// returns null terminated data
string read(in unsigned long aCount);
};
interface nsIOutputStream {
// buf is just data, length is count
unsigned long write(in string buf, in unsigned long count);
};
interface nsISearchableInputStream {
// forString can be any null terminated hunk of bytes
void search(in string forString, ...);
};
interface nsIFastLoadService {
// not that fastload should include iDNS hostnames, but...
void startMuxedDocument(..., in string aURISpec, ...);
};
interface nsIFastLoadFileControl {
// not that fastload should include iDNS hostnames, but...
void startMuxedDocument(..., in string aURISpec);
};
interface nsIBinaryInputStream {
// documentation says that this is a 8-bit string
string readStringZ();
};
interface nsIBinaryOutputStream {
// documentation says that this is a 8-bit string
void writeStringZ(in string aString);
};
OK, so if |string| means 7-bit ASCII, then all of these interfaces would need to
change. I really don't look forward to changing all of this code before mozilla
1.0.
Why place such strict requirements on what |string| can be? I don't see what it
buys us. Just let it be an opaque string of bytes. Always inflate to double
byte using zero padding. If this results in something that cannot be displayed,
so be it. It probably doesn't matter since the |string| value probably wasn't
meant to be displayed.
This will work nicely if we have a UTF-8 string type. Then we can fix the
string attributes/parameters that know they will not be opaque but rather UTF-8
if 8-bit at all.
Comment 90•24 years ago
|
||
Darin, I was asking for places where we pump opaque, zero-terminated (or not if
you use size_is, as I pointed out) octet arrays. The UTF-8 cases you cite,
which are relevant to this bug, are not germane to the question of "should
string be an octet array, whether terminated or counted?"
The xpcom/io cases that matter should use octet arrays, I think. Then a caller
such as JS won't get the unwanted inflation that presumes the octets are code
points in a certain character set (never mind that JS-without-JS2-or-extensions
will get a big fat JS Array of uint8-domain elements). If the transfer mode is
not scriptable, then forget about JS -- the C++ signature will still be char *.
It seems to me you are trying to simplify the work before 1.0 at the expense of
the (string) => (JS or other, wider-char string) problem. But I admit, you are
matching current (ab)usage and minizing effort! You're in danger of winning
right there.
So perhaps the opaque octet array type is off-topic for the current bug,
although I hope it's still of interest to the cc: list. In all this, you and I
have not disagreed on the need for a new XPIDL and typelib string type for
UTF-8, but we're not helping design or implement that. We need someone to say
"it'll take me two weeks" or "I did it, here's the patch!" dbradley, jband: can
you spew some words of advice? Maybe someone else will do the grunt-work (I
swear, I will soon if I get fed up enough with this logjam).
/be
Comment 91•24 years ago
|
||
brendan:
right, my examples do not exactly answer your question, but i needed to spew out
that list because i think of lot of folks don't appreciate how many interfaces
actually "mis-use" |string|. you're question just prompted me to spew out the
list of such instances.
but, my point stands unanswered: why is it important to be strict about the
character encoding implied by |string| attributes? why not just let it be the
case that playing w/ the 8th bit leaves you to fend for yourself... that XPCOM
will not guarantee that the string will be interpreted as anything but an array
of bytes (meaning that |string| to double-byte representation would result in 0
padding, and double-byte to |string| would result in truncation of the upper 8
bits). this is what we currently do (with the exception of a debug warning),
and i think it is OK. it works if the data is 7-bit ASCII... ie. the result in
JS land can be displayed... can be matched to a charset. but, when people play
with the 8th bit they are on their own. would this really be too flexible? do
we need the added structure of requiring |string| to be 7-bit ASCII? i don't
see why. moving to the octet solution requires support for inplaceout, which we
don't have. moreover it'll slow down JS code (like chatzilla) which doesn't
need to be any slower. JS2 is not here yet, so we need to deal with performance
issues given the JS that we got. afterall, if we've gotten this far without
enforcing |string| == 7-bit ASCII, then what's the real problem?
anyways, back to the discussion of UTF-8... jag has said that he is willing to
work on and support a separate UTF-8 string heirarchy. he'll get some help from
me, but as for the required changes to XPIDL to support a UTF-8 string class...
i'm obviously not going to be of much help there.
Comment 92•24 years ago
|
||
Will the new variant class need to handle UTF8 strings or can it get away with
ignoring it?
Comment 93•24 years ago
|
||
Also would adding a type to the idl affect/break the Python and maybe Perl
connects?
Comment 94•24 years ago
|
||
It may possibly break Python :( I will ensure that whatever is decided will
work eventually, though :)
Comment 95•24 years ago
|
||
I want to make sure I understand the latest direction. We're looking at building
a new string class to support UTF8. We need a new type in the idl that will map
to this. We're willing to break typelib compatibility. XPConnect conversion
routines will have to be adjusted to convert this type. Python, nsIVariant, and
Perl may be affected.
At this point I have a pretty full list of 0.9.9 bugs, plus I'm tasked with a
non-mozilla project as well. I can get this done, but I suspect some of my 0.9.9
bugs may have to be sacrificed.
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 96•24 years ago
|
||
David, I can help out. If you list out the to do items, I'll go hack on the
code. Hopefully, you won't delay your 0.9.9 bugs and we'll get this in too.
Comment 97•24 years ago
|
||
What are we going to call the string class and the type in IDL? Is UTF8String
for IDL ok?
Comment 98•24 years ago
|
||
UTF8String sounds good to me as the name.
I'm psyched that nisheeth can help -- big thanks to him!
/be
Comment 99•24 years ago
|
||
UTF8String or AUTF8String... given that we have AString, shouldn't it be
AUTF8String? however, i'd have to say that UTF8String is slightly easier to read.
Comment 100•24 years ago
|
||
The 'A' in "AString" is for "Abstract". If UTF8String is a concrete type
implementing (derived from) nsAString, then I see no need for an 'A'.
/be
Comment 101•24 years ago
|
||
concrete class, really? i didn't think we'd want to pass concrete classes via
XPCOM. shouldn't it be an abstract class, with virtual functions only? so as
to hide the implementation, etc, etc.? maybe i'm missing something.
Comment 102•24 years ago
|
||
On the one hand, we don't bother to hide the "implementation" (memory layout) of
string or wstring right now, so why do we need it for utf8string?
On the other, we apparently needed a bunch of options for AString, so we only
specified the abstract base. Do we not need all these options (Sliding,
Sharing, COW, whatever) for UTF-8 encoded stuff? If so, I think we need
UTF8String to be abstract. If not, why not just specify it as a sequence of bytes?
Comment 103•24 years ago
|
||
I mentioned concrete because we want static type checking benefits: we don't
want the compiler to silently swallow lots of hidden runtime conversions to and
from UTF-8. That's an argument against implementing nsAString with UTF-8
storage and declaring victory in this bug.
But if we support nsACString in typelibs, and make a concrete class nsUTF8String
that implements nsACString, do we need a more specific IDL type? The issue is
what .Length() and iteration over elements numbered by [0, length) count --
bytes or characters? I understand current nsACString users want Length to count
bytes. But UTF8String users would want Length to count characters, and string
users are too comfortable already with calling Length (to the point of ignoring
IsEmpty). Jag, please advise.
If we do need a more specific IDL type for UTF-8, I don't see the need to put a
gratuitous 'A' in its name.
We were talking about all this just now, and I believe a consensus in favor of
UTF-8 first class support in typelibs and strings emerged. We (I especially)
feel that UTF-8 should have been used in the ancient days instead of PRUnichar
vectors (wstring, "UCS-2" but not really). We're facing plane 1 character codes
coming in via certain locales (Korean, I hear), and if we have to switch to an
encoding that doesn't store characters in equal-sized units, we might as well
use UTF-8 and save space in this hemisphere.
/be
Comment 104•24 years ago
|
||
"Suicide is Painless"
Words by Mike Altman
Music by Johnny Mandel
Through early morning fog I see
Visions of the things to be
The pains that are withheld for me
I realize and I can see that
Chorus: Suicide is painless
It brings on many changes
And I can take or leave it if I please.
I try to find a way to make
All our little joys relate
Without that ever-present hate
But now I know that it's too late, and
(chorus)
The game of life is hard to play
I'm going to lose it anyway
The losing card I'll someday lay
And this is all I have to say, that
(chorus)
The only way to win is cheat
And lay it down before I'm beat
And to another give a seat
For that's the only painless feat, cause
(chorus)
The sword of time will pierce our skins
It doesn't hurt when it begins
But as it works its way on in
The pain grows stronger - watch it grin
(chorus)
A brave man once requested me
To answer questions that are key
Is it to be or not to be?
And I replied, "Oh why ask me?", cause
(chorus)
And you can do the same thing if you please.
======
Let's choose carefully :-)
-- rick
Assignee | ||
Comment 105•24 years ago
|
||
OK, just so that everyone is on the same page, here's what I plan to do:
a) Add two new types to XPIDL: UTF8String and CString. (I know there's a holy
war on to decide what to name them. I'm gonna avoid the "A" prefix for now.
I'll change it to the consensus demand before I land the patch)
b) In XPConnect, hook up UTF8String and CString types to the native nsACString
object for now.c) Add a cmd line option to XPIDL that generates old style type
libs.
d) Make XPIDL barf at IDL that uses the new types if the old style type lib cmd
line option is set.e) Jag will create a new nsUTF8String class that inherits
from the ACString abstract base class. Once that is done, I will change
XPConnect to hook up the UTF8String type to native nsUTF8String objects.
Sounds good?
For (e), shouldn't it be distinct from both nsACString and nsAString (i.e,
nsAUTF8String or something of the sort)?
Comment 107•24 years ago
|
||
weren't people concered with adding lots of new types? (not that i don't think
we need them)
Comment 108•24 years ago
|
||
dbaron: that was actually the plan. Make a new nsAUTF8String line parallel to
nsAString and nsACString.
Comment 109•24 years ago
|
||
We should also bump the version number as well ->
http://lxr.mozilla.org/seamonkey/source/xpcom/typelib/xpt/public/xpt_struct.h#118
I think we should bump the major, the interface manager will then refuse to
process the file. This will keep a new component from causing problems with an
older version of Netscape. Also read the comment below that and bump the
incompatible value as well.
Comment 110•24 years ago
|
||
dbradley: Brendan, Nisheeth, and I convinced ourselves that we could get away
with only incrementing the minor version number. See the discusion in the first
comment of bug 65762. We are flirting with tripping up much older browsers by
doing this - especially if we add more than one type. Except for the issue with
*real* old browsers with the shorter lookup table in xpcconvert.cpp, we are free
(as far as xpconnect is concerned) to add more type tags. XPConnect will simply
fail to reflect methods with unsupported types. We *may* be playing a little
fast and loose here, but my hope is to avoid crossing the
XPT_MAJOR_INCOMPATIBLE_VERSION threshold until we find a compelling reason to
actually change the pattern of bytes in the structs that get deserialized from
the typelibs. At that point the problem would be that the old readers would
literally be unable to correctly read the files, and not just a problem of types
that clients of the reader would be unable to proxy. Certainly other classes of
change might come along that might make us cross that threshold. But, we are
thinking that we can get away with not doing it now.
Also, before we go hogwild adding new types, let's not forget that there is
still a very short list of available type tags. And that adding *any* types adds
work for supporting those types at various places in our xpconnect and proxy
code. Also, as Brendan pointed out, the last available tag should absolutely be
reserved for use as a future use when we (may) reach the point where some type
tags will require an additional byte. That tag value would allow us to flag that
case while still leaving the existing tags unchanged. (Of course, that *would*
be an incompatible version change - and would be hell on typeinfo clients too!)
As I said in that other bug, the tools (xpidl compiler *and* xpt_link) should
support creating the current xpt version 1.1 files on request and fail with an
error if non-1.1 types are used. I think this is necessary. Though, I suspect,
some plugin vendors (etc.) who might have use of this feature will likely fail
to discover it and will be blindly playing hit and miss depending on what types
they happen to attempt to pass thorough their interfaces.
Comment 112•24 years ago
|
||
nisheeth: what's the ETA on this one looking like? i'd like to squeeze the
nsIURI changes in by 0.9.9 if possible.
Assignee | ||
Comment 113•24 years ago
|
||
I expect to land this by Wednesday (Feb 13) of next week. Would that work for you?
Comment 114•24 years ago
|
||
sure, that'll be fine.
Assignee | ||
Comment 115•24 years ago
|
||
Just to clarify. I am only gonna get the XPIDL and XPConnect changes landed.
Jag or some string guy needs to do the new string classes needed for the world
to be happy.
No longer blocks: 124042
Assignee | ||
Comment 116•24 years ago
|
||
Also want to acknowledge David Bradley's contribution. Even though he is bogged
down with 0.9.9 bugs, he's sent me the XPIDL/XPConnect changes for adding
UTF8String. Hopefully, I'm just gonna have to replicate them (famous last
words! :)) for CString and then test, test, test.
Comment 117•24 years ago
|
||
here is my comment:
1. I think adding a new type UTF8String is better than changing of an existing type
so... we are in the right path
2. even we add UTF8String, I want to make sure all the caller be careful
when they use such data type.
Assignee | ||
Comment 118•24 years ago
|
||
Darin, Rick, Jag and I had a meeting today to firm up what needs to happen here.
Here's what we decided:
1) "AUTF8String and ACString" in XPIDL will map into nsACString objects in the
native world.
2) For AUTF8Strings, XPConnect will convert JS data from UCS2 to UTF8 and back.
3) For ACStrings, XPConnect will convert JS Data from UCS2 to ASCII and back.
3) We will not implement a new nsAUTF8String class. Till now, abstract string
classes have only specified the storage size of the string, not enforced the
encoding. nsACString and nsAString dictate single byte and double byte storage
respectively. We do not want to break this model and create a new abstract
string class that enforces encoding. We will only add the notion of a
UTF8String to XPIDL so that XPConnect can convert strings appropriately. In the
future, we can get the benefit of type checking, admittedly at run time not
compile time, by implemementing a concrete class, nsUTF8String, that enforces
the encoding behind the nsACString interface.
The fact that we need to do less work this way is also a big factor in making
this decision.
Comments?
Status: NEW → ASSIGNED
Comment 119•24 years ago
|
||
> 3) For ACStrings, XPConnect will convert JS Data from UCS2 to ASCII and back.
also, XPConnect will preserve the 8-th bit as it currently does for |string|.
Comment 120•24 years ago
|
||
So in other words, for ACStrings, XPConnect will convert JS data from UCS2 and
ISO-8859-1 and back.
Comment 121•24 years ago
|
||
Um, this has nothing to do with ISO-8859-1, UCS2->UTF8->UCS2 I believe the
convesion will be in XPConnect...
Comment 122•24 years ago
|
||
sure, that would be one interpretation... but i really just meant that ACString
should be sufficient for passing raw data provided both ends understand that
that's what they're getting ;-)
Comment 123•24 years ago
|
||
If you don't expose a new interface and implmentation for UTF8 now, you'll
probably never be able to role one out without breaking the world. You're giving
up on the whole idea of C++ static type checking to distinguish between ASCII v.
UTF8 expectations?
ACString is going to support embedded nulls?
The fact that xpconnect does not enforce the 0 in the high bit of any 8bit
string types does not mean that it *shouldn't* do that enforcement. What do
people *really* want the rules to be for the various string types?
Comment 124•24 years ago
|
||
jband: darin sought my comments on the plan in comment #118, and I lobbied him
and jag strongly in favor of static type checking. Jag was game. We didn't
want to rock the boat too much in the bug yet, but since you raised the same
issue, I thought I'd give a brief update. For details, jag should say more.
/be
Comment 125•24 years ago
|
||
jband: 'You're giving up on the whole idea of C++ static type checking to
distinguish between ASCII v. UTF8 expectations?'
yes :-( at the C++ level i've given up on trying to maintain static type
checking. Part of the reason is simply time pressure -- it requires more work
to create a C++ abstract UTF8 interface and concrete implementation...
the other part of the reason is that treating aCString as an encoding-free
representation of the data is (actually) following the same design pattern as
was chosen for aString.
Currently, we are using aString to represent *both* UCS-2 and UTF-16 data.
Unfortunately, at the IDL level we do *not* have a UTF16 string type... so
xpconnect is potentially broken :-( this is unfortunate, but does not alter the
fact that we have chosen aString to be representation of 16-bit storage units
(without explicitly enforcing an encoding)...
so, at the C++ interface level, we are expanding that notion by saying that
aCString represents 8-bit storage units without specifying a particular encoding.
I know what you're going to say -- this is just BS. and i agree with you :-)
but at this point, we are in so deep and so screwed up, that i think that this
is the best that we can hope for (at this point in time)...
at least, at the IDL level we are providing encoding information - allowing
navive <-> JS string conversions to work.
anything more will require MAJOR work on the codebase :-( and we just don't
have the time/resources to make it happen :-(
the one issue that we may want to address is whether we should *also* introduce
a UTF-16 string type at the IDL level. This would allow us to properly
encode/decode UTF16 strings at the XPConnect boundary (in the future)... Of
course the C++ interface would still translate to aString :-(
i think that everyone will agree that this is not an 'ideal' solution... i just
need to decide if this approach solves 'enough' of the problem.
-- rick
Comment 126•24 years ago
|
||
*i* just need to decide if this approach solves 'enough' of the problem.
should be changed to:
*everyone* here needs to decide if this pproach solves 'enough' of the problem.
-- rick
Comment 127•24 years ago
|
||
On ACString: yes, it supports embedded nulls and is really just a container of
8-bit data, though the methods on it heavily lean towards interpreting that data
as strings ;-)
I think we should do a C++ nsAUTF8String branch, where we could just copy the
nsACString branch and do appropriate renames, which shouldn't take more than a
few hours, at least for 0.9.9, and then decide what we need to do to give it
more value (other than being able to use the type system to discern between any
8-bit string type and UTF8 strings). The 1.0 cycle should give us plenty of time
for that :-)
Yes, that all sounds rather ugly, but it will allow us to move forward without
painting ourselves into a corner. I don't think we need to make all of our code
use this nsAUTF8String type immediately, and for darin it'll be just as much
work to convert to AUTF8String as it would be to convert to ACString.
Comment 128•24 years ago
|
||
Rick: I fear what happens when we change the meaning of single-byte within our
codebase, without compiler/xpconnect assistance in tracking down mismatched
expectations. Are we going to be repaying that effort savings with interest,
tracking down bugs that only occur on crazy international pages, or when people
have non-ASCII characters in passwords, etc.? Wasn't that the whole point
behind not just changing the |string| documentation to permit UTF-8?
Comment 129•24 years ago
|
||
This begs the question of why people shouldn't be able to have non-ASCII
characters in passwords.
Comment 130•24 years ago
|
||
I thought we permitted non-ASCII characters in passwords now. A patch I
recently reviewed certainly seemed to think so, since it was using |wstring| for
password values in interfaces. But I can see us wanting to use UTF8String for
those interfaces, especially in places like IMAP where the on-the-wire encoding
will need to be UTF8 anyway.
Comment 131•24 years ago
|
||
rick: brendan did a pretty good job of convincing me that we shouldn't miss this
albeit rushed opportunity to somewhat fix our string story... that we should try
if possible to work in a way to have static type checking. jag says he'll be
able to make it happen... the scariest thing is perhaps the fact that
nsAUTF8String will also need to be frozen for mozilla 1.0 :-/
Comment 132•24 years ago
|
||
In comment #130, it is mentioned:
> I thought we permitted non-ASCII characters in passwords now.
> A patch I recently reviewed certainly seemed to think so,
I would like to see us support non-ASCII characters in password and
if Mozilla is supporting it, that is good. Note that some protcols
such as HTTP has a recommended way of dealing with such password
input.
(Cf. http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2617.txt)
Who said any nsA*String is frozen for 1.0? If they are going to be, we've got
some work to do.
Comment 134•24 years ago
|
||
dbaron: if nsIURI is to be frozen and if its attributes are of type
nsAUTF8String & nsACString, then it seems like we have no choice but to also
freeze these string interfaces. i thought this situation was well understood...
afterall, isn't nsAString freezing? as i understood it, nsAString is used on
some interfaces that are frozen or are going to be frozen.
If nsAString is going to be frozen for 1.0, then there are some changes that
need to be made first. I was under the impression that it was considered
acceptable not to freeze it.
Reporter | ||
Comment 136•24 years ago
|
||
Rick mentioned both UCS2 and UTF16. How is the code in the tree split between
these two?
dmose (re: comment 136): See bug 118000, particularly bug 118000 comment 19
through bug 118000 comment 27.
Comment 138•24 years ago
|
||
hey dbaron,
unfortunately since we are freezing the DOM interfaces we need to freeze
nsAString as well... since it is used extensively in these interfaces...
similarly, since nsIURI is freezing, we'll need to freeze nsAUTFString too.
Currently, no public interfaces expose the nsACString interface so this one is
still up in the air ;-) but realistically, if AString and AUTF8String freeze,
then ACString may end up being frozen too...
-- rick
All I've heard about freezing the DOM interfaces is bug 110795 comment 11.
Who's willing to make the changes needed to nsAString needed before we freeze
it? (At a minimum we need the changes that are needed to implement
nsStackString (the nsAutoString in the new world), PromiseSingleFragmentString
(15 minutes of work), API cleanup for Mid/Right/Left, and that's just what I
thought of off the top of my head without even looking at the header file.)
Anyone who thinks nsAString needs to be frozen should file a bug on the string
component requesting that it happen. (And please cc: me, and/or note the bug here.)
Comment 140•24 years ago
|
||
rick: nsIURI will be using nsACString as well for the ASCII compatible versions
of the spec and host attributes. this is not necessary i guess since it could
just use |string| for these, but i would prefer |ACString| since it would result
in something potentially more efficient.
Assignee | ||
Comment 141•24 years ago
|
||
The bad news is that I couldn't meet tonight's deadline to land my changes. The
good news is that there is enough done to unblock Darin.
Darin, the patch I'm about to attach should get you going with your nsIURI
related changes. With it in your tree, you should be able to define
'AUTF8String' and 'ACString' types in your IDL files. These types will map into
'nsACString' in the generated header file and XPConnect will perform appropriate
string conversions for them. Just as a sanity check run the following test to
make sure things work for you:
1) Run dist/bin/xpcshell and pass in the JS file
"js/src/xpconnect/tests/js/old/xpctest_echo.js" as an argument.
2) The js file does a bunch of tests and prints the results to the console.
Make sure that all lines that begin with "In2OutOneUTF8String" and
"EchoIn2OutOneUTF8String" end with "passed".
If the above works, you are all set to start using these new types in IDL and
revamping your nsIURI interfaces.
Here's whats I plan to do next *before* I land:
1) Make a XPIDL compiler flag that forces old style typelib generation and
reports errors when new types are used.
2) Add a similar flag to the XPT linker that forces it to report errors if it
comes across an XPT file that uses these new types.
I will post the patches for these two changes to this bug.
Here's what I will do *after* I land:
1) Change nsIVariant and XPCOM's proxy event mechanism to support these new
string types. I have filed bug 125465 and bug 125466 on myself to track these
two changes respectively. I'll post patches on those bugs.
General question: The DOMString type uses an XPCVoidableString class that
inherits from nsAutoString and adds the ability to store a JS null value. So, a
boolean comparison of a null DOMString with the null JS value yields TRUE. When
the DOMString is passed into the JS print() method, it prints out "null". Do we
need to replicate this behavior for CStrings and UTF8Strings?
jband, dbradley, shaver, I'd like you guys to r/sr this patch and the upcoming
compiler/linker patches. You can start on this one now or wait until all the
patches are on here. Up to you... :-)
An extra but currently unavoidable copy of UTF8Strings is made in the XPConnect
conversion routines. Jag is aware of this problem and will fix them when he
lands his string changes. CString conversions are fine, AFAIK.
Assignee | ||
Comment 142•24 years ago
|
||
Attachment #66086 -
Attachment is obsolete: true
Comment 143•24 years ago
|
||
Just a blind copy of nsACString and kids so you can see the ramifications of
this addition and the changes needed to ns{Reading,Writing}Iterator.
Assignee | ||
Comment 144•24 years ago
|
||
I've made and tested XPIDL and XPTLink changes that generate typelibs of a
version specified on the command line. I'm about to attach a unified patch.
Now, I really am ready for a code review. jband, dbradley, shaver, please jump
in! Thanks!
Assignee | ||
Comment 145•24 years ago
|
||
This patch subsumes attachment 69457 [details] [diff] [review] and adds the [-t output version number]
command line argument to the XPIDL compiler and the XPT Linker. When the
output version number is specified, the compiler checks that the input IDL file
only uses constructs that are supported in that version. The linker checks
that all typelib files it reads are of the version specified or below. Both
the compiler and the linker report errors and abort if these checks fail.
Attachment #69457 -
Attachment is obsolete: true
Comment 146•24 years ago
|
||
We'd hold mozilla1.0 for this, but I think that means holding for followup
bugfixes to the main patches that land in 0.9.9. The main patches need talkback
and ~300-400K downloads' worth of testing.
/be
Blocks: 122050
Keywords: mozilla0.9.9
Assignee | ||
Comment 147•24 years ago
|
||
I forgot to add this earlier. Thanks a lot to David Bradley, Jag, and John
Bandhauer for their help on this bug!
Comment 148•24 years ago
|
||
This is a variation on just the xpcom/typelib part of Nisheeth's patch.
It changes the following:
- don't use 3 different schemes to parse the version string.
- combine that parsing into shared code in libxpt
- don't declare vars in library code and implement them in exes
- don't use // comments in .c files
- don't have big comment blocks that go past column 80
- allow -t to specify *current* version
Nisheeth, I didn't really test this. If you like these changes then please try
them using the tests you have. I'll look at the xpconnect part of the changes
soon.
Comment 149•24 years ago
|
||
I assume the OS/2 changes are needed. I notice that your lines use tab chars and
the lines in that file use spaces. I have no idea if it matters, but you ought
to not use the tabs just in case.
In xpconnect we yse 'if(' not 'if ('. Please fix those.
Please wrap lines longer than 80 columns.
In XPCConvert::NativeData2JS the T_UTF8STRING case looks reasonable. The
T_CSTRING case makes me uncomfortable. I like avoiding the extra string copy,
but I don't like the implicit assumption that the JS string system uses the same
allocator as the xpcom global allocator. I'd rather see us do a one-time
registration of a finalizer using JS_AddExternalStringFinalizer and then use
JS_NewExternalString in this T_CSTRING string creation case.
For JSData2Native I don't think you gain anything by trying to factor out the
GetJSStringInternals because... You tried to copy the T_DOMSTRING stuff here,
but that stuff has wierd behavior specifically for the DOM. I think you should
follow the pattern used for the other string types here and just convert void
and null jsvals to nsnull and be done with them. Also, because in the T_CSTRING
case you can use JS_GetStringBytes to get the 8-bit char arrays that the JS
engine caches for us. This saves us a convertion when the same string is passed
through more than once.
Also, Is the plan to land all of this using nCSting for both types? Or is the
AUTF8String going in too?
Please address these issues (and whatever you want from my previous attachement)
and post us another patch when you have it. Thanks!
I'll look at doing the nsIVariant changes.
Comment 150•24 years ago
|
||
Looks like John got most of the big things. Some minor issues:
I'm wondering if we really need the create_old_typelib flag used outside of the
parameter checking for additional -t's. It eliminates some version checks, but
those don't look too costly. From what I can tell, all the other checks for
create_old_typelib can be eliminated safely and would make the code a little
simpler.
The outFileName declared in the main function in xpt_link.c should be const char *.
You commented out the Foo attribute in interface nsITestXPCFoo in xpctest.idl.
Just curious if that was accidental or not.
Comment 151•24 years ago
|
||
One more issue came up in a short offline discussion between nisheeth, jband and
myself.
Currently we have special code in XPConnect that deals with making string
conversions from JS to native DOMString's special case |undefined| for backwards
compatibility in the DOM. Currently passing |undefined| from JS as a DOMString
into native code will convert the |undefined| value into the string "undefined".
This is different from how we're treating |undefined| for string and wstring
(where |undefined| is treated same as |null|), and it will be different from how
we treat |undefined| in AUTF8String and ACString. My suggestion for avoiding
this special case for the probably most commonly used (non-DOM) XPIDL string
type is to separate DOMString from AString. Now that we're extending the
typelibs anyway taking this additional step for consistensy between non-DOM
strings would IMO be worth the trouble, and others seemd to agree with me.
DOMString is after all the special case from an XPConnect point of view, not the
other way around. The amount of changes needed to do this should be trivial,
since the treatment of DOMString would remain almost identical to the treatment
of AString, the only difference beeing how we treat conversion from JS
|undefined| to native strings.
IOW, let's add one more string type, AString (!= DOMString) :-)
Comment 152•24 years ago
|
||
WOOT!
Please do :-)
Assignee | ||
Comment 153•24 years ago
|
||
Sorry for the delay. I was only able to start looking at jband and dbradley's
comments today (Tuesday) afternoon.
Replies to jband's comments:
--
Your patch to xpcom/typelib looks good except that <math.h> was included in
xpt_link.c. Now that you've done away with the math based parsing, we no longer
need to include that file.
> Also, Is the plan to land all of this using nCSting for both types? Or is the
> AUTF8String going in too?
The plan was to land with using nsACString for both types. When Jag lands
nsUTF8String, he will change XPConnect to use it for the AUTF8String type.
But, the final decision on this depends on how far along Jag and Darin think
they are with the nsUTFString class and nsIURI interface changes respectively.
Darin, Jag, please add your 2 cents about what you think the landing strategy
for our changes should be. Thanks!
>For JSData2Native I don't think you gain anything by trying to factor out the
>GetJSStringInternals because... You tried to copy the T_DOMSTRING stuff here,
>but that stuff has wierd behavior specifically for the DOM. I think you should
>follow the pattern used for the other string types here and just convert void
>and null jsvals to nsnull and be done with them. Also, because in the T_CSTRING
>case you can use JS_GetStringBytes to get the 8-bit char arrays that the JS
>engine caches for us. This saves us a convertion when the same string is passed
>through more than once.
After consulting with Johnny and other DOM hackers, we had decided to make the
null and void JS string cases behave the same across DOMString, UTF8String, and
CString. That is why you see the DOMString code copied for the UTF8String and
CString cases. Bug 125481 (Make ACStrings and AUTF8Strings voidable) was filed
with the same strategy in mind.
On a followup discussion with Jband and Johnny, we have changed our minds on
this. We will now make UTF8Strings and CStrings behave like the char and wchar
strings. The patch I'm about to attach reflects this change. Bug 125481 now
becomes invalid.
I've used JS_GetStringBytes() as you suggested to reduce a conversion for the
CString case.
Replies to dbradley's comments:
--
>You commented out the Foo attribute in interface nsITestXPCFoo in xpctest.idl.
>Just curious if that was accidental or not.
Yup, this was accidental. I caught this when I did a clobber build, fixed it,
but didn't attach the new patch just for the one line change.
>I'm wondering if we really need the create_old_typelib flag used outside of the
>parameter checking for additional -t's. It eliminates some version checks, but
>those don't look too costly. From what I can tell, all the other checks for
>create_old_typelib can be eliminated safely and would make the code a little
>simpler.
I agree that create_old_typelib isn't necessary anymore. I've left it as a
local variable in xpidl.c and removed it from other places.
>The outFileName declared in the main function in xpt_link.c should be const char
>*.
Done!
Miscellaneous comments:
--
- I've line wrapped better, fixed if statement whitespace in XPConnect, and
converted tabs to spaces in the OS2 .asm file.
- I've used JS_ExternalStringFinalizer() for the case where the unicode string
buffer is created by XPCOM.
Assignee | ||
Comment 154•24 years ago
|
||
Attachment #69802 -
Attachment is obsolete: true
Attachment #70041 -
Attachment is obsolete: true
Assignee | ||
Comment 155•24 years ago
|
||
In comment 153, s/bug 125481/bug 125841.
Darin, how are your nsIURI changes coming along? The plan is still to our
changes together for 0.9.9 without waiting for nsUTF8String, right?
jband/dbradley/shaver, please review attachment 70474 [details] [diff] [review]. Thanks!
Comment 156•24 years ago
|
||
Yes, that's the plan as far as I know.
Assignee | ||
Comment 157•24 years ago
|
||
OK, I think this patch addresses everything so far except Johnny's request to
add a new string type. Please look through this patch to see if anything bad
jumps out at you. I will go through the code to add a new string type
tomorrow. You guys have one day to object! :-)
Once I'm done there will be four IDL string types (among others):
DOMString - handles void (undefined) and null JS string values in a special DOM
compatible way as described in comment 151 from Johnny.
AString, ACString, and AUTF8String - will treat null and void JS string values
as empty strings.
Attachment #70474 -
Attachment is obsolete: true
Comment 158•24 years ago
|
||
Um, I had the |undefined| conversion a bit wrong in my mind, non-DOM strings
throw an exception when asked to convert |undefined| into a string value,
whereas when converted into a DOMString it converts into the string "undefined".
I vote for parity between all the non-DOMString string types.
Comment 159•24 years ago
|
||
jst: You mean when converting from JS to native, right? For the normal
non-domstring types JSVAL_VOID *does* convert to null. It is only in the
(unlikely) case of a 'out' [w]string declared as a C++ reference that this is an
error - only because it is not legal for a C++ reference to hold null.
http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcconvert.cpp#690
Comment 160•24 years ago
|
||
Um, my previous comment was based on what Nisheeth told me happens when we try
to convert |undefined| in JS into a native string or wstring, maybe I didn't
understand what he was testing...
Either way, my vote is for consistency between the non-DOMString string types...
Assignee | ||
Comment 161•24 years ago
|
||
In attachment 70479 [details] [diff] [review], xpt_link.c still includes <math.h> even though it doesn't
need to. Please ignore that nit because I've fixed it in my local build. The
rest of the patch looks good.
Hang in there for a final patch that will add an additional string type to IDL.
Assignee | ||
Comment 162•24 years ago
|
||
I should have realized this earlier but better late than never. Just to clarify:
I am not really adding a new string type (AString) in addition to DOMString,
AUTF8String, and ACString. AString was already an accepted string type in XPIDL
but it was mapped behind the scenes to do *exactly* what DOMStrings do. What I
am doing is changing things so that AStrings behave *mostly* like DOMStrings
except that they assign empty strings into themseleves when null and void js
vals are passed into them.
So, please ignore my earlier posts about a whole new string type.
Comment 163•24 years ago
|
||
Sure, you are not adding *another* new string C++ type for AString (abstract or
otherwise). But you *are* adding a new xpidl type.
Assignee | ||
Comment 164•24 years ago
|
||
Thanks for the clarification, jband.
Here's the final patch with support for the new AString XPIDL type.
jband, shaver, dbradley are requested to r/sr. Thanks!
Attachment #70479 -
Attachment is obsolete: true
Comment 165•24 years ago
|
||
Just a nit, if the dipper/idl type is "astring", shouldn't it be "acstring" and
"autf8string" too for consistency?
Comment 166•24 years ago
|
||
Looks good to me. We need someone with a good handle on strings to make sure
everything string wise is correct. Might want someone familar with the OS/2
assembler code take a quick look at the small addition.
Some minor cleanup issues:
Looks like you interjected some additional spaces here in the xpidl.c
- break;
-
+ break;
+
In XPCConvert::JSData2Native the jschar* chars = nsnull; initialization is
unnecessary in the T_UTF8STRING case.
Looks like you may have some tab/space issues in xpctest_echo.js. Braces and
code look a little misaligned.
Assignee | ||
Comment 167•24 years ago
|
||
Thanks for the comments, David. I'll make the changes you suggest in my build.
I don't know anyone familiar with OS/2 assembler code off the top of my head.
Anyone out there know anyone?
Jag, would you please give me a code review of the string related code in the
XPConnect conversion routines? Thanks!
Jband, would you please sr? Thanks!
Assignee | ||
Comment 168•24 years ago
|
||
Jband pointed out a performance problem in JSData2Native() which is fixed in
this patch. Dbradley's comments are also addressed.
David, would you please give this patch a review stamp also? You need only
look at the xpcconvert.cpp diffs. The rest of the files haven't changed since
the last patch.
Jag, please review the string code in the conversion routines in
xpcconvert.cpp.
Jband, please sr.
Thanks!
Attachment #70643 -
Attachment is obsolete: true
Assignee | ||
Comment 169•24 years ago
|
||
Rearranged code in the UTF8String and CString cases inside JSData2Native() in
xpcconvert.cpp per jband's request. Rest of the patch is the same as earlier.
David, please r=.
Jband, please sr=.
Thanks!
Attachment #71031 -
Attachment is obsolete: true
Comment 170•24 years ago
|
||
Comment on attachment 71051 [details] [diff] [review]
version 3.0 of patch (I've given up on calling it final.)
sr=jband This is the one. I agree that we should land it. Note that I will be
very annoyed if we *don't follow through with any of the following:
- utf8 string class.
- changes to this to use that class.
- changes to nsVariant.
This change by itself leaves things in a state that we should pass through
quickly. But, let's get it in and get on with things.
Attachment #71051 -
Flags: superreview+
Comment 171•24 years ago
|
||
Comment on attachment 71051 [details] [diff] [review]
version 3.0 of patch (I've given up on calling it final.)
I thought the non-DOM strings were going to do what |string| and |wstring| do,
that is to say, no support for null/void/undefined converting from or to.
>Index: js/src/xpconnect/src/xpcconvert.cpp
>===================================================================
>RCS file: /cvsroot/mozilla/js/src/xpconnect/src/xpcconvert.cpp,v
>retrieving revision 1.73
>diff -u -r1.73 xpcconvert.cpp
>--- js/src/xpconnect/src/xpcconvert.cpp 5 Feb 2002 06:45:02 -0000 1.73
>+++ js/src/xpconnect/src/xpcconvert.cpp 23 Feb 2002 06:04:30 -0000
>@@ -623,7 +740,7 @@
> {
> nsAWritableString* ws = *((nsAWritableString**)d);
>
>- if(JSVAL_IS_NULL(s))
>+ if(JSVAL_IS_NULL(s) || (!isDOMString && JSVAL_IS_VOID(s)))
> {
> ws->Truncate();
> ws->SetIsVoid(PR_TRUE);
>@@ -726,6 +843,116 @@
> return JS_TRUE;
> }
>
>+ case nsXPTType::T_UTF8STRING:
>+ {
>+ jschar* chars;
>+ PRUint32 length;
>+ JSString* str;
>+
>+ if(JSVAL_IS_NULL(s) || JSVAL_IS_VOID(s))
>+ {
>+ if(useAllocator)
>+ {
>+ nsACString *rs = new nsCString();
>+ if(!rs)
>+ return JS_FALSE;
>+
>+ rs->SetIsVoid(PR_TRUE);
nsCString's SetIsVoid does nothing, there's no point in setting this.
>+ *((nsACString**)d) = rs;
>+ }
>+ else
>+ {
>+ nsCString* rs = *((nsCString**)d);
>+ rs->Truncate();
>+ rs->SetIsVoid(PR_TRUE);
Same here and further below.
r=jag on the string changes, but check with jst on the use of DOMString-like
IsVoid.
Assignee | ||
Comment 172•24 years ago
|
||
Jag, we will have the notion of "voidness" for AStrings, CStrings and
UTF8Strings (see bug 125841). You are right that the SetISVoid() don't do
anything for CStrings and UTF8Strings currently. The plan is to implement
something similar to XPCVoidableString for CStrings and UTF8String that *do*
implement "voidness". The calls to SetIsVoid() are no-ops right now but will do
real work in the future.
Thanks for the review!
Comment 173•24 years ago
|
||
Comment on attachment 71051 [details] [diff] [review]
version 3.0 of patch (I've given up on calling it final.)
r=dbradley
Attachment #71051 -
Flags: review+
Comment 174•24 years ago
|
||
when i tried the very first version of this patch, i had to explicitly
%{C++
#include "nsAString.h"
%}
at the top of each .idl file that referenced AUTF8String. is this still required?
Comment 175•24 years ago
|
||
Darin: that has long been the case for AString idl users. It is debatable
whether that block should be added to the idl or if the #include'rs of the
generated .h should be expected to pre-#include the string header. Nevertheless,
there is no automatic generation of that #include line. There is a bug around
suggesting that this would be nice, but it requires xpidl hacking far beyond the
scope of this bug.
Comment 176•24 years ago
|
||
Let #includers of those headers #include "nsAString.h". Scott Meyers says so,
if you need a reason!
Comment 177•24 years ago
|
||
Comment on attachment 71051 [details] [diff] [review]
version 3.0 of patch (I've given up on calling it final.)
a=shaver for 0.9.9, but I won't cry if you take out those %{C++%} header warts.
Nice work, all.
Attachment #71051 -
Flags: approval+
Assignee | ||
Comment 178•24 years ago
|
||
The fix is in. Finally! :-)
I've updated the status summary to reflect reality.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → FIXED
Summary: 8-bit UTF8-capable string type for XPIDL → Add ACString, AUTF8String, and AString to XPIDL
Comment 179•24 years ago
|
||
jband,shaver,"scott-meyers": so adding #include "nsAString.h" to nsrootidl.idl
is bad because it causes everyone to include that file? how about at least adding
class nsAString;
class nsACString;
to nsrootidl.idl?
Summary: Add ACString, AUTF8String, and AString to XPIDL → 8-bit UTF8-capable string type for XPIDL
Assignee | ||
Updated•24 years ago
|
Summary: 8-bit UTF8-capable string type for XPIDL → Add ACString, AUTF8String, AString types to XPIDL
Assignee | ||
Comment 180•24 years ago
|
||
>sr=jband This is the one. I agree that we should land it. Note that I will be
>very annoyed if we *don't follow through with any of the following:
>- utf8 string class.
>- changes to this to use that class.
I've filed bug 127789 on jag to implement utf8strings. I've added a comment on
it to go fix up the extra string copies in xpcconvert.cpp.
>- changes to nsVariant.
I'd filed bug 125465 on myself last week to track this.
Reporter | ||
Comment 181•24 years ago
|
||
I like darin's most recent suggestion. The bug in question that jag mentioned
earlier is bug 78848.
Comment 182•24 years ago
|
||
I can't think of any good reason to not do the forward declarations that Darin
suggested.
Assignee | ||
Comment 183•24 years ago
|
||
OK, I'll make the nsrootidl.idl change and check it in with my changes to add
the new string types to nsVariant (bug 125465).
Assignee | ||
Comment 184•24 years ago
|
||
I'm attaching the patch suggested by Darin that will get checked in as part of
bug 125465's fix. Please holler if you see problems with it. Thanks!
Comment 185•24 years ago
|
||
Yeah, don't #include when you can forward-declare the class name.
/be
Assignee | ||
Comment 186•24 years ago
|
||
Brendan, please clarify. Exactly what should I not #include? Are you saying
that I should remove the #include lines that were there in nsrootidl.idl before
my changes?
Comment 187•24 years ago
|
||
nisheeth, sorry I was unclear. I was supporting your final patch, which adds
class nsACString; to nsrootidl.idl's C++ section, thereby relieving others from
having to #include "nsACString.h" or repeat the class forward decl. IOW, I'm
all happy here.
/be
Assignee | ||
Comment 188•23 years ago
|
||
Attachment 71980 [details] [diff] got checked into the 0.9.9 branch and the trunk just now as
part of the fix to bug 125465.
Jag, when you implement nsAUTF8String, please remember to add a forward
declaration to it in nsrootidl.idl. Thanks!
You need to log in
before you can comment on or make changes to this bug.
Description
•