Closed
Bug 534136
Opened 15 years ago
Closed 15 years ago
Switch atoms back to being UTF-16; have them hold string buffers (dromaeo)
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.9.3a3
Tracking | Status | |
---|---|---|
blocking2.0 | --- | - |
People
(Reporter: bzbarsky, Assigned: sicking)
References
Details
Attachments
(8 files, 8 obsolete files)
79.72 KB,
patch
|
benjamin
:
review+
|
Details | Diff | Splinter Review |
11.75 KB,
patch
|
peterv
:
review+
jst
:
superreview+
|
Details | Diff | Splinter Review |
13.89 KB,
patch
|
jst
:
review+
|
Details | Diff | Splinter Review |
3.47 KB,
patch
|
jst
:
review+
|
Details | Diff | Splinter Review |
9.82 KB,
patch
|
jst
:
review+
|
Details | Diff | Splinter Review |
77.38 KB,
patch
|
bzbarsky
:
review+
mrbkap
:
superreview+
|
Details | Diff | Splinter Review |
64.95 KB,
patch
|
bzbarsky
:
review+
hsivonen
:
review+
|
Details | Diff | Splinter Review |
1.66 KB,
patch
|
sicking
:
review+
|
Details | Diff | Splinter Review |
I suspect that a lot more consumers get them as UTF-16, and the string copies involved in the DOM's access to atoms (for attribute names, values, etc) show up in DOM benchmark profiles quite noticeably.
I guess I should measure to see how many AtomImpls we create with UTF-16 as opposed to UTF-8.
Assignee | ||
Comment 1•15 years ago
|
||
I ran some numbers a long time ago, which I think I put somewhere in bugzilla, on how often we have UTF16 data and want to get or compare to an atom, vs. how often we have UTF8 data. It was a bit tricky since many places, especially performance critical ones, are aware of the fact that atoms are UTF8 based and adjust the code to this fact. For example by converting UTF16->UTF8 before comparing to an atom.
Assignee | ||
Comment 2•15 years ago
|
||
Assignee | ||
Comment 3•15 years ago
|
||
See also bug 277479 comment 9
![]() |
Reporter | |
Comment 4•15 years ago
|
||
Ah. Sounds like we should do it (per bug 277479 comment 9). Do you want me to dup this bug there, or vice versa? Doing this, and fixing GetExistingAttrNameFromQName to not use ACString, would speed up DOM attribute perf a good bit...
Again, I think we should make the atoms hold shareable string buffers, not just PRUnichar* strings.
Summary: Switch atoms back to being UTF-16; have them hold string buffers → Switch atoms back to being UTF-16; have them hold string buffers (dromaeo)
Assignee | ||
Comment 5•15 years ago
|
||
Lets keep this bug open, the other covers enough topics as it is. Don't really care if we dupe the other one or just mark it dependent.
Totally agree on the holding sharable buffers part. Not sure how to make that happen for static atoms though? Can we create static "fake" buffers which we give an extra addref to ensure that they are freed.
![]() |
Reporter | |
Comment 6•15 years ago
|
||
OK. Let's mark dependent for now.
As far as how to make static atoms happy, that sounds like a good plan. Basically create an nsStringBuffer (not nsStringBuffer*) member and pre-generate not only the PRUnichars but the mRefCnt (1) and the mStorageSize (whatever length is plus 1 and then times 2). Might have to stick the nsStringBuffer in a union with a struct that has the relevant members and use an initializer on the struct, or something....
Jonas, want to just do this?
Blocks: 277479
blocking2.0: --- → ?
Assignee | ||
Comment 7•15 years ago
|
||
Yup, I can take this
Assignee: nobody → jonas
Status: NEW → ASSIGNED
![]() |
Reporter | |
Comment 8•15 years ago
|
||
![]() |
Reporter | |
Comment 9•15 years ago
|
||
Maybe we should have a UTF16** getter on atoms, instead of refcounting the string buffer in some of these cases?
Assignee | ||
Comment 10•15 years ago
|
||
Use nsStringBuffers rather than inline-allocated buffers. Still uses 8-bit chars so that the same nsIAtom API can be maintained. Mostly changes the macro-magic around at the places where we use static atoms.
This also allows getting rid of the static-atom-wrapped implementation of nsIAtom. So static atoms are now simply permanent atoms that use a buffer that's statically allocated on startup.
Attachment #417283 -
Attachment is obsolete: true
Attachment #417284 -
Attachment is obsolete: true
Attachment #419641 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 11•15 years ago
|
||
Changes to 16-bit PRUnichars to store string data inside the atoms, rather than 8bit chars. This requires changing nsIAtom::getUTF8String to nsIAtom::getUTF16String, and thus all callers of getUTF8String.
Note that this doesn't change any callers more than necessary, later patches will change code to be more performant.
Attachment #419642 -
Flags: review?(mrbkap)
Assignee | ||
Comment 12•15 years ago
|
||
Change callers to take advantage of the fact that we now use 16bit data in atoms. And use the new nsDependentAtomString/nsAtomString/nsAtomCString classes where it makes sense.
I went through all callsites of nsIAtom::EqualsUTF8 and nsIAtom::ToString (not sure if I did nsIAtom::Equals). Should also go through NS_NewAtom/do_GetAtom, but haven't done so yet.
Attachment #419643 -
Flags: review?(bzbarsky)
Assignee | ||
Updated•15 years ago
|
Attachment #419643 -
Attachment is patch: true
Attachment #419643 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Comment 13•15 years ago
|
||
This is sort of ugly, but seems to work. Allows inlining the most perf critical methods on nsIAtom, but means that nsIAtom can't be implemented from script. Although that's already impossible since double-wrapping kills pointer identity.
Attachment #419645 -
Flags: superreview?(jst)
Attachment #419645 -
Flags: review?(benjamin)
Assignee | ||
Updated•15 years ago
|
Attachment #419641 -
Flags: review?(bzbarsky) → review?(benjamin)
Assignee | ||
Updated•15 years ago
|
Attachment #419642 -
Flags: review?(mrbkap) → review?(bzbarsky)
Assignee | ||
Comment 14•15 years ago
|
||
There's really no reason to to use the hashkey trickeries. We don't use "real" hashentries as keys ever anyway.
Attachment #419646 -
Flags: review?(jst)
Assignee | ||
Comment 15•15 years ago
|
||
The downside here is that we'll never be able to reclaim memory used by atoms (though we do reclaim the memory used by the string data inside the atom). So if a page creates 100000 different atom names, we'll forever use 1200000 bytes of memory for atom pools.
Attachment #419647 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 16•15 years ago
|
||
Alternative solution to "Part6" patch. Instead of using pools for all atoms, just use a pool for the first X atoms. This applies on top of the Part6 patch.
I do think we should use some type of pool since we currently do on trunk for static atoms, which probably saves us some cycles during startup.
Attachment #419649 -
Flags: review?(bzbarsky)
Assignee | ||
Updated•15 years ago
|
Attachment #419649 -
Attachment description: Part6b: Use an PLArenaPool to store atoms → Part6b: Use an pool to store first X atoms
Assignee | ||
Comment 17•15 years ago
|
||
Note for posterity: Why does nsSVGDataParser use 8-bit strings rather than 16-bit?
![]() |
Reporter | |
Comment 18•15 years ago
|
||
How many static atom allocations do we do from that arena on startup right now? I would rather avoid the arena complexity, for the reason described in comment 15 and because picking a good cap is hard, unless there's actually measurable Ts impact...
Assignee | ||
Comment 19•15 years ago
|
||
We allocate 2158 "static" atoms during startup. This does not include duplicates, i.e. if the same string appears in multiple atom tables passed to NS_RegisterStaticAtoms, that atom is only counted once.
During a run when I opened a bunch of different websites, including youtube, gmail, and lots of bbc news tabs, I ended up with a maximum of 10266 atoms existing at once. Including static atoms.
Simply opening and closing the browser, only displaying the default minefield startup page, I got a maximum of 3604 atoms.
We never seem to create any permanent atoms (other than through calls to NS_RegisterStaticAtoms). Nor do we ever promote a non-permanent atom to be permanent, i.e. NS_RegisterStaticAtoms never results in any atoms being promoted. So we should probably remove the ability to create permanent atoms other than through NS_RegisterStaticAtoms.
Dunno how long calling malloc 2158 times takes, but that is how much Ts would be hit by not using an arena.
![]() |
Reporter | |
Comment 20•15 years ago
|
||
> Dunno how long calling malloc 2158 times takes
I just checked on mac (where we complain all the time about the slow allocator) and it looks like a test program that does |malloc(20)| 1e7 times (allocating a total of 200MB or so) and stores all the results in an array takes about 0.7 seconds to run on my machine. So 2158 mallocs would take about 0.2ms.
Things would get a bit worse if we had to free() as well, since it looks like free() takes 2-3x as long as malloc() on Mac.
My gut feeling is that we should try without the arena, and if that shows up in Ts add a capped arena sized to somewhat above the static atom count.
Assignee | ||
Updated•15 years ago
|
Attachment #419647 -
Flags: review?(bzbarsky)
Assignee | ||
Updated•15 years ago
|
Attachment #419649 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 21•15 years ago
|
||
Sounds good
Comment 22•15 years ago
|
||
Comment on attachment 419645 [details] [diff] [review]
Part4: Inline accessors into nsIAtom
Ugh. The reason this is an interface at all is that there is still some script code which needs to manipulate atoms? Do we know what that code is, and whether we can remove it short-term?
Absent that, we should really have some xpconnect magic for nsIAtom so that script can't pretend to implement it.
![]() |
Reporter | |
Comment 23•15 years ago
|
||
nsITreeView hands back arrays of atoms (getRow/Cell/ColumnProperties). There are JS implementations of nsITreeView. Then again, they just need to be able to create atoms from a given string but not DO anything with them.
There are also editor APIs that might be using atoms. On the other hand, XPConnect will convert strings to atoms for in params, so those should be usable from JS without nsIAtom as-is, right?
Full list of idl that uses atoms at http://mxr.mozilla.org/mozilla-central/search?string=nsIAtom&find=idl&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
Assignee | ||
Comment 24•15 years ago
|
||
For what it's worth, script has never been able to properly implement nsIAtom, since that would produce objects with the wrong pointer identity.
I think medium to long term we could maybe get rid of all places in script that use nsIAtom, but we would have to look out for performance.
OTOH I don't think this hack has particularly bad consequences. The uglyness is mostly confined to the nsIAtom.idl and nsAtomTable.cpp files.
Comment 25•15 years ago
|
||
Sure, it couldn't *properly* implement it, but if it wanted to crash or cause other problems it could try to improperly implement. If we're absolutely certain that content script can never implement nsITreeView, maybe that's ok, but I'd really like a followup bug for that in particular.
![]() |
Reporter | |
Comment 26•15 years ago
|
||
Untrusted script is not allowed to implement nsITreeView.
Comment 27•15 years ago
|
||
FWIW, I was pretty seriously considering converting the CSS parser to work entirely in UTF-8, in order to avoid the conversion overhead (nearly all CSS is pure ASCII and is served over the network as such). It has to look up lots of atoms, though, and I'm not sure where the tradeoff would lie if atom lookup has to convert to UTF-16.
I thought the long-term goal was to stop using UTF-16 altogether.
Assignee | ||
Comment 28•15 years ago
|
||
How are you planning on dealing with the fact that all of our charset converters output UTF16? Are you going to have a fast-path UTF8 parser, and a second UTF16 parser? Or were you planning on taking the output from the charset converters and convert to UTF8?
While I agree that it would be better if our entire codebase was UTF8 based, I think converting code here and there does not make sense. The atom experiment was a good example of that where using UTF8 in just a part of the code has lead to slower performance and more complex code.
Comment 29•15 years ago
|
||
(In reply to comment #28)
> How are you planning on dealing with the fact that all of our charset
> converters output UTF16? Are you going to have a fast-path UTF8 parser, and a
> second UTF16 parser? Or were you planning on taking the output from the charset
> converters and convert to UTF8?
I hadn't got that far yet, but I probably would have added a second set of charset converters that target UTF8; even more important from my perspective is to not have to do conversion at all for input labeled as UTF8 (note that this is the default for CSS).
> While I agree that it would be better if our entire codebase was UTF8 based, I
> think converting code here and there does not make sense. The atom experiment
> was a good example of that where using UTF8 in just a part of the code has lead
> to slower performance and more complex code.
I don't think we will ever make the switch if we have to do it all at once.
Assignee | ||
Comment 30•15 years ago
|
||
The other thing I should note is that atom-lookup using a UTF8 string is not much slower than atom-lookup using a UTF16 string, if the atom already exists. So unless you're creating a lot of atoms based on author provided random strings, you shouldn't see much of a difference.
Assignee | ||
Comment 31•15 years ago
|
||
> I don't think we will ever make the switch if we have to do it all at once.
If you want to work towards switching our codebase to UTF8 in a progressive manner I'm all for that. But I think we need a better plan than just converting code here and there and hope that it doesn't regress performance.
Comment 32•15 years ago
|
||
DOM and JS standards married UTF-16 (via UCS-2) in the '90s. This is hard to change. The trade-off between transcoding up front to UTF-16 vs. mapping the UCS-2 semantics onto UTF-8 strings under the JS and DOM hoods is not clear to me, but if sicking's right that piecewise UTF-8 regressed perf recently, that seems meaningful.
/be
Comment 33•15 years ago
|
||
(In reply to comment #30)
> The other thing I should note is that atom-lookup using a UTF8 string is not
> much slower than atom-lookup using a UTF16 string, if the atom already exists.
> So unless you're creating a lot of atoms based on author provided random
> strings, you shouldn't see much of a difference.
Good to know -- mostly these are atoms for element names and the like, so they should exist.
(In reply to comment #31)
> > I don't think we will ever make the switch if we have to do it all at once.
>
> If you want to work towards switching our codebase to UTF8 in a progressive
> manner I'm all for that. But I think we need a better plan than just converting
> code here and there and hope that it doesn't regress performance.
It seems to me that if switching to UTF8 is on the table *at all*, there should be no conversions in the other direction. Granted that nsIAtom lookups from JS/DOM currently hurt because of conversion, my choice would therefore have been to tackle switching JS/DOM over to UTF8 rather than to go backward.
(In reply to comment #32)
> DOM and JS standards married UTF-16 (via UCS-2) in the '90s. This is hard to
> change. The trade-off between transcoding up front to UTF-16 vs. mapping the
> UCS-2 semantics onto UTF-8 strings under the JS and DOM hoods is not clear to
> me, but if sicking's right that piecewise UTF-8 regressed perf recently, that
> seems meaningful.
I'm prepared to bet that the mapping is the way to go, because the only thing that loses is array indexing on string objects when those strings actually contain non-ASCII characters, ne?
Assignee | ||
Comment 34•15 years ago
|
||
(In reply to comment #33)
> > > I don't think we will ever make the switch if we have to do it all at once.
> >
> > If you want to work towards switching our codebase to UTF8 in a progressive
> > manner I'm all for that. But I think we need a better plan than just converting
> > code here and there and hope that it doesn't regress performance.
>
> It seems to me that if switching to UTF8 is on the table *at all*, there should
> be no conversions in the other direction. Granted that nsIAtom lookups from
> JS/DOM currently hurt because of conversion, my choice would therefore have
> been to tackle switching JS/DOM over to UTF8 rather than to go backward.
We need to be pragmatic. Switching to UTF8 isn't a goal in and of itself. It's a means towards getting better performance and lower footprint. Having atoms in UTF8 does not currently fulfill that goal and so we should not do it at this time.
Comment 35•15 years ago
|
||
(In reply to comment #33)
> I'm prepared to bet that the mapping is the way to go, because the only thing
> that loses is array indexing on string objects when those strings actually
> contain non-ASCII characters, ne?
Not just indexing, regular expression matching, and don't discount other parts of the world where 7 bits ain't enough. The asymptotic cost could be bad compared to just sucking up more memory and using "UTF-16".
/be
Assignee | ||
Updated•15 years ago
|
Attachment #419647 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #419649 -
Attachment is obsolete: true
Assignee | ||
Comment 36•15 years ago
|
||
So I tested what happens if invalid UTF8 strings are passed to NS_NewAtom(char*). For strings that contain "recoverable" errors, such as overlong surrogates, the fixes in bug 541245 will handle that.
However for errors like broken surrogates, where the UTF8->UTF16 conversion code simply returns an empty UTF16 string, we need special code.
So what this patch does is if it detects that a string is "too" broken while hashing the UTF8 string, we do an upfront conversion to UTF16 (which means simply taking the empty string) and use that to look up an atom.
Attachment #422899 -
Flags: review?(jst)
Assignee | ||
Comment 37•15 years ago
|
||
This contains tests for all patches in this bug. Applies on top of tests in bug 541245.
Attachment #422901 -
Flags: review?(jst)
Assignee | ||
Updated•15 years ago
|
Attachment #422901 -
Attachment description: Atom tests → Part7: Atom tests
![]() |
Reporter | |
Comment 38•15 years ago
|
||
Comment on attachment 419642 [details] [diff] [review]
Part2: Use 16bit chars in buffers
>+++ b/xpcom/tests/TestAtoms.cpp
>@@ -76,19 +76,19 @@ int main(int argc, char** argv)
> ids[i] = NS_NewAtom(strings[i]);
> }
> PRUnichar qqs[1]; qqs[0] = 0;
> nsIAtom* qq = NS_NewAtom(qqs);
> PRTime end1 = PR_Now();
>
> // Now make sure we can find all the idents we just made
> for (i = 0; i < count; i++) {
>- const char *utf8String;
>- ids[i]->GetUTF8String(&utf8String);
>- nsIAtom* id = NS_NewAtom(utf8String);
>+ const PRUnichar *utf16String;
>+ ids[i]->GetUTF16String(&utf16String);
>+ nsIAtom* id = NS_NewAtom(utf16String);
> if (id != ids[i]) {
> id->ToString(s1);
> ids[i]->ToString(s2);
> printf("find failed: id='%s' ids[%d]='%s'\n",
> NS_LossyConvertUTF16toASCII(s1).get(), i, NS_LossyConvertUTF16toASCII(s2).get());
> return -1;
> }
> NS_RELEASE(id);
>diff --git a/xpcom/tests/TestPermanentAtoms.cpp b/xpcom/tests/TestPermanentAtoms.cpp
>--- a/xpcom/tests/TestPermanentAtoms.cpp
>+++ b/xpcom/tests/TestPermanentAtoms.cpp
>@@ -46,40 +46,38 @@
>
> static void Assert(PRBool aCondition, const char* aStatement)
> {
> printf("%s: %s\n", aCondition?"PASS":"FAIL", aStatement);
> }
>
> static void AssertString(nsIAtom *aAtom, const nsACString& aString)
> {
>- const char *str;
>- static_cast<AtomImpl*>(aAtom)->GetUTF8String(&str);
>- Assert(nsDependentCString(str) == aString, "string is correct");
>+ Assert(aAtom->EqualsUTF8(aString), "string is correct");
> }
>
> static void AssertPermanence(nsIAtom *aAtom, PRBool aPermanence)
> {
> Assert(static_cast<AtomImpl*>(aAtom)->IsPermanent() == aPermanence,
> aPermanence ? "atom is permanent" : "atom is not permanent");
> }
>
> int main()
> {
> nsCOMPtr<nsIAtom> foo = do_GetAtom("foo");
> AssertString(foo, NS_LITERAL_CSTRING("foo"));
> AssertPermanence(foo, PR_FALSE);
>
>- nsCOMPtr<nsIAtom> foop = do_GetPermanentAtom("foo");
>+ nsCOMPtr<nsIAtom> foop = NS_NewPermanentAtom(NS_LITERAL_STRING("foo"));
> AssertString(foop, NS_LITERAL_CSTRING("foo"));
> AssertPermanence(foop, PR_TRUE);
>
> Assert(foo == foop, "atoms are equal");
>
>- nsCOMPtr<nsIAtom> barp = do_GetPermanentAtom("bar");
>+ nsCOMPtr<nsIAtom> barp = NS_NewPermanentAtom(NS_LITERAL_STRING("bar"));
> AssertString(barp, NS_LITERAL_CSTRING("bar"));
> AssertPermanence(barp, PR_TRUE);
>
> nsCOMPtr<nsIAtom> bar = do_GetAtom("bar");
> AssertString(bar, NS_LITERAL_CSTRING("bar"));
> AssertPermanence(bar, PR_TRUE);
>
> Assert(bar == barp, "atoms are equal");
>
Attachment #419642 -
Flags: review?(bzbarsky) → review-
![]() |
Reporter | |
Comment 39•15 years ago
|
||
Er... those review comments were _supposed_ to be:
>+++ b/content/base/src/nsAttrValue.cpp
>@@ -803,36 +804,34 @@ nsAttrValue::Contains(nsIAtom* aValue, n
>+ return nsLayoutUtils::CheatingEqualsIgnoreCase(val1, val2);
I would prefer that:
1) We call the function EqualsIgnoreASCIICase or some such.
2) We document at all callsites why we're using it instead of the usual
unicode-aware case-insensitive compare.
Is there a reason this is in nsLayoutUtils and not nsContentUtils?
>+++ b/content/smil/nsSMILCompositor.cpp
>+ // XXX This used to hash on the string value rather than the atom.
>+ // Check that there wasn't a reason for that.
Check with dholbert, but I would guess not.
That said, is just adding two pointers going to give good keys? Would it make sense to at least shift one of them or something?
>+++ b/layout/base/nsLayoutUtils.cpp
>+nsLayoutUtils::CheatingEqualsIgnoreCase(const PRUnichar* aStr1, const PRUnichar* aStr2)
>+ // the same ascii char, but just differeing in case
"differing"
>+ // They are the same. Check if we're at the end of the strings.
This assumes that the two strings don't contain embedded nulls, right? In particular, if they do, this could return true for strings that aren't actually equal.
This somewhat worries me, but the old behavior was no different at the callsites where this happens... Add some warnings in the comments, though? If caller doesn't control at least one of the strings, this comparison can lie.
And honestly, I think at least some callers don't control these strings sufficiently. Maybe a followup patch to just do the comparison on nsStrings?
>+++ b/layout/style/nsComputedDOMStyle.cpp
>@@ -4278,19 +4278,18 @@ nsComputedDOMStyle::GetTransitionPropert
>+ property->SetString(nsAtomCString(transition->GetUnknownProperty()));
This can use nsAtomString. Save two conversions.
>+++ b/parser/htmlparser/src/nsHTMLTags.cpp
>+#ifdef DEBUG && NS_STATIC_ATOM_USE_WIDE_STRINGS
Shouldn't that be:
#if defined(DEBUG) && defined(NS_STATIC_ATOM_USE_WIDE_STRINGS)
?
>+++ b/xpcom/ds/nsAtomTable.cpp
>@@ -552,16 +559,32 @@ NS_RegisterStaticAtoms(const nsStaticAto
> *aAtoms[i].mAtom = atom;
>
> if (!gStaticAtomTableSealed) {
> nsAutoString key;
> atom->ToString(key);
> gStaticAtomTable->Put(key, atom);
You can use nsAtomString or nsDependentAtomString there to avoid the extra copy, right?
>+++ b/xpcom/ds/nsCRT.cpp
>+PRUint32 nsCRT::HashCodeAsUTF16(const char* start, PRUint32 length)
>+ else if ((c & 0xF0) == 0xE0) { // 4-byte
>+ ucs4 = (PRUint32(c) << 18) & 0x001F0000L;
Those both look wrong to me. Where did they come from? I would expect something like:
else if ((c & 0xF8) == 0xF0) { // 4-byte
ucs4 = (PRUint32(c) << 18) & 0x001C0000L;
Where did your numbers come from?
>+ else if ((c & 0xF0) == 0xE0) { // 5-byte
(c & 0xFC) == 0xF8
right?
>+ else if ((c & 0xF0) == 0xE0) { // 6-byte
(c & 0xFE) == 0xFC
>+ while (l--) {
I would almost prefer we named this |len| instead of l. |l*6| looked too much like |1*6| over here.
>+ char c = *s++;
This is shadowing the |char c| in the outer loop. Call it something else?
I have to ask... is there a reason we're reinventing UTF8CharEnumerator here? Especially in a buggy way?
If we really do need this separate copy of the code, please get smontagu or someone else intl-competent to look at it? And please write some tests for it!
>+ if (ucs4 < minUcs4) {
>+ ADD_TO_HASHVAL(h, UCS2_REPLACEMENT_CHAR);
Comment that this is an overlong sequence?
>+ else if ( ucs4 <= 0xD7FF ) {
>+ ADD_TO_HASHVAL(h, ucs4);
>+ }
>+ else if ( ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF ) {
>+ // Surrogates and prohibited characters
>+ ADD_TO_HASHVAL(h, UCS2_REPLACEMENT_CHAR);
>+ }
This is fine, but I assume there was a reason to not just test IS_SURROGATE() instead of the 0xD7FF and 0xDFFF business? This looks like it was more or less copy/pasted from ConvertUTF8toUTF16, right?
>+ // Computes a hashcode for a length number of UTF8
> // characters. Returns the same hash code as the HashCode method
>- // taking a |char*| would if the string were converted to UTF8. This
>- // hash function treats invalid UTF16 data as 0xFFFD (0xEFBFBD in
>+ // taking a |char*| would if the string were converted to UTF16. This
|PRUnichar*|, right?
>+class nsDependentAtomString : public nsDependentString
>+{
>+public:
>+ nsDependentAtomString(nsIAtom* aAtom)
>+ : nsDependentString(aAtom->GetUTF16String(), aAtom->GetLength())
Do we know whether this is cheaper than nsAtomString in practice? I guess it doesn't have to refcount, but might have to copy depending on what the consumer uses it for... OK for now.
Assignee | ||
Comment 40•15 years ago
|
||
(In reply to comment #39)
> Is there a reason this is in nsLayoutUtils and not nsContentUtils?
IIRC it's mostly (or only) used by layout related code. I don't care at all though, I'm totally fine with putting it in nsContentUtils if you prefer.
> That said, is just adding two pointers going to give good keys? Would it make
> sense to at least shift one of them or something?
I suspect that shifting away the two/three (or so) lowest bits on at least one pointer would be a good idea.
> >+++ b/layout/base/nsLayoutUtils.cpp
> >+ // They are the same. Check if we're at the end of the strings.
>
> This assumes that the two strings don't contain embedded nulls, right? In
> particular, if they do, this could return true for strings that aren't actually
> equal.
>
> This somewhat worries me, but the old behavior was no different at the
> callsites where this happens... Add some warnings in the comments, though? If
> caller doesn't control at least one of the strings, this comparison can lie.
>
> And honestly, I think at least some callers don't control these strings
> sufficiently. Maybe a followup patch to just do the comparison on nsStrings?
I think you're right. I'm fine with fixing this in this patch even?
> >+++ b/parser/htmlparser/src/nsHTMLTags.cpp
> >+#ifdef DEBUG && NS_STATIC_ATOM_USE_WIDE_STRINGS
>
> Shouldn't that be:
>
> #if defined(DEBUG) && defined(NS_STATIC_ATOM_USE_WIDE_STRINGS)
Yes! Thanks.
> >+++ b/xpcom/ds/nsAtomTable.cpp
> >@@ -552,16 +559,32 @@ NS_RegisterStaticAtoms(const nsStaticAto
> > *aAtoms[i].mAtom = atom;
> >
> > if (!gStaticAtomTableSealed) {
> > nsAutoString key;
> > atom->ToString(key);
> > gStaticAtomTable->Put(key, atom);
>
> You can use nsAtomString or nsDependentAtomString there to avoid the extra
> copy, right?
This doesn't actually copy, but it does addref/release the buffer. Will use an nsDependentAtomString.
> >+++ b/xpcom/ds/nsCRT.cpp
> >+PRUint32 nsCRT::HashCodeAsUTF16(const char* start, PRUint32 length)
> >+ else if ((c & 0xF0) == 0xE0) { // 4-byte
> >+ ucs4 = (PRUint32(c) << 18) & 0x001F0000L;
>
> Those both look wrong to me. Where did they come from? I would expect
> something like:
>
> else if ((c & 0xF8) == 0xF0) { // 4-byte
> ucs4 = (PRUint32(c) << 18) & 0x001C0000L;
>
> Where did your numbers come from?
Indeed. There's some copy'n'paste errors in this. Found and fixed when I wrote tests!
> I have to ask... is there a reason we're reinventing UTF8CharEnumerator here?
> Especially in a buggy way?
I tried #including nsUTF8Utils.h, but I think I ran into macro magic errors. Though I can't see now what those errors would be. I'll have to re-try.
> If we really do need this separate copy of the code, please get smontagu or
> someone else intl-competent to look at it? And please write some tests for
> it!
Tests already attached.
> >+class nsDependentAtomString : public nsDependentString
> >+{
> >+public:
> >+ nsDependentAtomString(nsIAtom* aAtom)
> >+ : nsDependentString(aAtom->GetUTF16String(), aAtom->GetLength())
>
> Do we know whether this is cheaper than nsAtomString in practice? I guess it
> doesn't have to refcount, but might have to copy depending on what the
> consumer uses it for... OK for now.
Yeah, this depends entirely on what the consumer does. I guessed that in most cases the string isn't further assigned elsewhere, which is when having a buffer available would be faster.
Ideally nsString would support pointing to an "unowned" buffer. Should be pretty easy to implement. I'll investigate in a follow-up bug.
Anything I didn't reply to, just assume a "Agreed, Will do" comment :)
Comment 41•15 years ago
|
||
The low 3 bits on most of our pointers are 0 all the time -- probably not a useful contribution to the hash, so shifting them out as a first step seems virtuous.
Assignee | ||
Comment 42•15 years ago
|
||
Assignee | ||
Comment 43•15 years ago
|
||
Above attachment is an interdiff to address bzs review comments.
Assignee | ||
Comment 44•15 years ago
|
||
The resulting Part 2
Attachment #419642 -
Attachment is obsolete: true
Attachment #423584 -
Flags: review?(bzbarsky)
![]() |
Reporter | |
Comment 45•15 years ago
|
||
Are you sure the nsAttrValue::Contains stuff only happens in quirks mode? Note that doing ASCII-only folding may be required for correctness in some cases (e.g. the turkish i/I cases).
Assignee | ||
Comment 46•15 years ago
|
||
(In reply to comment #45)
> Are you sure the nsAttrValue::Contains stuff only happens in quirks mode? Note
> that doing ASCII-only folding may be required for correctness in some cases
> (e.g. the turkish i/I cases).
No, I don't know that. Do you have a suggestion for what comment to put there instead? I'm fine with simply saying "I don't know why we only do ascii-only folding, but that's what the old code did. Perf maybe?" sine that is case :)
![]() |
Reporter | |
Comment 47•15 years ago
|
||
> No, I don't know that.
Can you check? Do we have any callsites other than getElementsByClassName?
Assignee | ||
Comment 48•15 years ago
|
||
It appears that the only callers we have are getElementsByClassName and DOMTokenList
http://scotland.proximity.on.ca/dxr/search.cgi?tree=mozilla-central&callers=nsAttrValue%3A%3AContains
(dxr is the roxxors)
So it is in fact true that this is only used in quirks mode.
![]() |
Reporter | |
Comment 49•15 years ago
|
||
Comment on attachment 423584 [details] [diff] [review]
Part2: Use 16bit chars in buffers (updated)
Lovely.
Attachment #423584 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 50•15 years ago
|
||
Comment on attachment 423584 [details] [diff] [review]
Part2: Use 16bit chars in buffers (updated)
Given the interface change, this should get an sr too.
FWIW I forgot to change the interface IID. Fixed locally.
Attachment #423584 -
Flags: superreview?(mrbkap)
Comment 51•15 years ago
|
||
Comment on attachment 419645 [details] [diff] [review]
Part4: Inline accessors into nsIAtom
- In AtomImpl::AtomImpl(const nsAString& aString):
+ NS_ASSERTION(buf && buf->StorageSize() >= (mLength+1) * 2, "enough storage");
Maybe replace the magic '2' there with sizeof(PRUnichar)...
And given that this patch changes nsIAtom, the uuid needs to be bumped, which I know you've done locally, so just a reminder to make sure it makes it in :)
And also, this patch makes it so that bad things happen if someone is passing a random object from JS as an argument to a method that takes an nsIAtom argument. Granted, that should only ever be doable from chrome, but I'd still argue we should throw in that case. If you file the bug on that, I'll help with the XPConnect details to make that happen.
sr=jst
Attachment #419645 -
Flags: superreview?(jst) → superreview+
Assignee | ||
Comment 52•15 years ago
|
||
Filed bug 543613 on the XPConnect feature.
Updated•15 years ago
|
Attachment #419646 -
Flags: review?(jst) → review+
Assignee | ||
Updated•15 years ago
|
Attachment #423583 -
Attachment is obsolete: true
Updated•15 years ago
|
Attachment #422899 -
Flags: review?(jst) → review+
Comment 53•15 years ago
|
||
Comment on attachment 422901 [details] [diff] [review]
Part7: Atom tests
Looks good.
Attachment #422901 -
Flags: review?(jst) → review+
![]() |
Reporter | |
Comment 54•15 years ago
|
||
Comment on attachment 419643 [details] [diff] [review]
Part3: Take advantage of new API
In nsHTMLContentSink, don't we need to ASCII-lowercase the attr name instead of Unicode-lowercasing? It should be possible to write a testcase for this (using turkish Is and such). r- until this is resolved.
Check with hsivonen about the html5 bits?
Attachment #419643 -
Flags: review?(bzbarsky) → review-
(In reply to comment #54)
> (From update of attachment 419643 [details] [diff] [review])
> In nsHTMLContentSink, don't we need to ASCII-lowercase the attr name instead of
> Unicode-lowercasing? It should be possible to write a testcase for this (using
> turkish Is and such). r- until this is resolved.
Unicode case conversion could be language-dependent or language-independent; I think our ToLowerCase should be language-independent. However, either of those is different from doing case transformation only within the ASCII range, which may well be what the old code did.
![]() |
Reporter | |
Comment 56•15 years ago
|
||
I believe ours is language-independent, but it language-independently lowercases İ to i, as I recall. So for example something like this:
<span DİR="rtl" dir="ltr">
will set the "dir" attribute to "rtl" with the patch in question applied, but should be setting it to "ltr". The old code only performed case conversion in the ASCII range, yes.
Assignee | ||
Comment 57•15 years ago
|
||
Assignee | ||
Comment 58•15 years ago
|
||
Same as before, but with review comment fixed. Interdiff is in attachment 428409 [details] [diff] [review]
Attachment #428410 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 59•15 years ago
|
||
I should add that this now depends on bug 547850, where nsContentUtils::ASCIIToLower is added.
Depends on: 547850
![]() |
Reporter | |
Comment 60•15 years ago
|
||
Comment on attachment 428410 [details] [diff] [review]
Part3: Take advantage of new API
r=bzbarsky
Attachment #428410 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Updated•15 years ago
|
Attachment #428409 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #419643 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #419645 -
Flags: review?(benjamin) → review?(peterv)
Assignee | ||
Comment 61•15 years ago
|
||
Comment on attachment 428410 [details] [diff] [review]
Part3: Take advantage of new API
Henri, can you take a look at the HTML5-parser related changes here.
Attachment #428410 -
Flags: review?(hsivonen)
Comment 62•15 years ago
|
||
(In reply to comment #61)
> (From update of attachment 428410 [details] [diff] [review])
> Henri, can you take a look at the HTML5-parser related changes here.
>diff --git a/parser/html/nsHtml5Portability.cpp b/parser/html/nsHtml5Portability.cpp
>--- a/parser/html/nsHtml5Portability.cpp
>+++ b/parser/html/nsHtml5Portability.cpp
>@@ -74,16 +74,17 @@ nsHtml5Portability::newStringFromString(
> nsString* newStr = new nsString();
> newStr->Assign(*string);
> return newStr;
> }
>
> jArray<PRUnichar,PRInt32>
> nsHtml5Portability::newCharArrayFromLocal(nsIAtom* local)
> {
>+ // XXX what type of atom is this? Is this a real one or a fake HTML5 one?
This always happens to be a real, static one.
> nsAutoString temp;
> local->ToString(temp);
> PRInt32 len = temp.Length();
> jArray<PRUnichar,PRInt32> arr = jArray<PRUnichar,PRInt32>(len);
> memcpy(arr, temp.BeginReading(), len * sizeof(PRUnichar));
> return arr;
> }
>@@ -97,16 +98,17 @@ nsHtml5Portability::newCharArrayFromStri
> }
>
> nsIAtom*
> nsHtml5Portability::newLocalFromLocal(nsIAtom* local, nsHtml5AtomTable* interner)
> {
> NS_PRECONDITION(local, "Atom was null.");
> NS_PRECONDITION(interner, "Atom table was null");
> if (local->IsStaticAtom()) {
>+ // XXX what type of atom is this? Is this a real one or a fake HTML5 one?
Inside the if, the atom is supposed to be fake, but the if condition is the wrong way round (and apparently copied and pasted the wrong way twice below). It should be if(!local->IsStaticAtom()). I'm very surprised to find I've accidentally omitted the negation here and even more surprised that there have been no crashers filed about it.
> nsAutoString str;
> local->ToString(str);
> local = interner->GetAtom(str);
> }
> return local;
> }
>
> void
>diff --git a/parser/html/nsHtml5ReleasableAttributeName.cpp b/parser/html/nsHtml5ReleasableAttributeName.cpp
>--- a/parser/html/nsHtml5ReleasableAttributeName.cpp
>+++ b/parser/html/nsHtml5ReleasableAttributeName.cpp
>@@ -45,19 +45,17 @@ nsHtml5ReleasableAttributeName::nsHtml5R
> }
>
> nsHtml5AttributeName*
> nsHtml5ReleasableAttributeName::cloneAttributeName(nsHtml5AtomTable* aInterner)
> {
> nsIAtom* l = getLocal(0);
> if (aInterner) {
> if (l->IsStaticAtom()) {
>- nsAutoString str;
>- l->ToString(str);
>- l = aInterner->GetAtom(str);
>+ l = aInterner->GetAtom(nsDependentAtomString(l));
This can't use nsDependentAtomString, because if (l->IsStaticAtom()) is bogus and should be if (!l->IsStaticAtom()).
> }
> }
> return new nsHtml5ReleasableAttributeName(nsHtml5AttributeName::ALL_NO_NS,
> nsHtml5AttributeName::SAME_LOCAL(l),
> nsHtml5AttributeName::ALL_NO_PREFIX);
> }
>
> void
>diff --git a/parser/html/nsHtml5ReleasableElementName.cpp b/parser/html/nsHtml5ReleasableElementName.cpp
>--- a/parser/html/nsHtml5ReleasableElementName.cpp
>+++ b/parser/html/nsHtml5ReleasableElementName.cpp
>@@ -49,15 +49,13 @@ nsHtml5ReleasableElementName::release()
> }
>
> nsHtml5ElementName*
> nsHtml5ReleasableElementName::cloneElementName(nsHtml5AtomTable* aInterner)
> {
> nsIAtom* l = name;
> if (aInterner) {
> if (l->IsStaticAtom()) {
>- nsAutoString str;
>- l->ToString(str);
>- l = aInterner->GetAtom(str);
>+ l = aInterner->GetAtom(nsDependentAtomString(l));
This can't use nsDependentAtomString, either, because if (l->IsStaticAtom()) is bogus and should be if (!l->IsStaticAtom()).
> }
> }
> return new nsHtml5ReleasableElementName(l);
> }
>diff --git a/parser/html/nsHtml5TreeOperation.h b/parser/html/nsHtml5TreeOperation.h
>--- a/parser/html/nsHtml5TreeOperation.h
>+++ b/parser/html/nsHtml5TreeOperation.h
>@@ -297,16 +297,17 @@ class nsHtml5TreeOperation {
> }
>
> nsresult Perform(nsHtml5TreeOpExecutor* aBuilder, nsIContent** aScriptElement);
>
> inline already_AddRefed<nsIAtom> Reget(nsIAtom* aAtom) {
> if (!aAtom || aAtom->IsStaticAtom()) {
> return aAtom;
> }
>+ // XXX This appears to be a fake HTML5 atom, so can't use nsAtomString
Correct.
> nsAutoString str;
> aAtom->ToString(str);
> return do_GetAtom(str);
> }
>
> private:
>
> nsresult AppendTextToTextNode(PRUnichar* aBuffer,
Comment 63•15 years ago
|
||
(In reply to comment #62)
> >+ // XXX This appears to be a fake HTML5 atom, so can't use nsAtomString
Actually, the fake atoms could use nsDependentAtomString, too, if the fake atoms implemented GetUTF16String() and GetLength().
Comment 64•15 years ago
|
||
Comment on attachment 419645 [details] [diff] [review]
Part4: Inline accessors into nsIAtom
> diff --git a/parser/html/nsHtml5Atom.cpp b/parser/html/nsHtml5Atom.cpp
> nsHtml5Atom::nsHtml5Atom(const nsAString& aString)
> + }
Do we want the same assertions here as the ones you added in AtomImpl?
> }
> diff --git a/xpcom/ds/nsAtomTable.cpp b/xpcom/ds/nsAtomTable.cpp
> + NS_ASSERTION(buf && buf->StorageSize() >= (mLength+1) * 2, "enough storage");
Spaces around +
s/2/sizeof(PRUnichar)/
> + NS_ASSERTION(aStringBuffer && aStringBuffer->StorageSize() == (mLength+1) * 2, "correct storage");
See above, also long line.
Attachment #419645 -
Flags: review?(peterv) → review+
Comment 65•15 years ago
|
||
Comment on attachment 423584 [details] [diff] [review]
Part2: Use 16bit chars in buffers (updated)
sr=mrbkap with the missing iid bump fixed.
Attachment #423584 -
Flags: superreview?(mrbkap) → superreview+
Comment 66•15 years ago
|
||
Is there a reason nsFakeStringBuffer couldn't include nsStringBuffer as the first member, with a little friendship, e.g.
template<size N>
struct nsStaticAtomStringBuffer
{
nsStringBuffer mStringBuffer;
char mStringData[N];
};
The initialization would then be, I think:
static nsStaticAtomStringBuffer<sizeof(str_data)> buffer_name = {{1, sizeof(str_data)}, str_data};
If this really isn't possible, you should use static assertions that the sizeof and offsetof the members of nsFakeStringBuffer match nsStringBuffer.
Perhaps more importantly, though, I thought I understood that when NS_RegisterStaticAtoms is called, the buffer was going to be ASCII but have room to store a null-expanded version of the string, and that NS_RegisterStaticAtoms would do the null-expansion. But it appears that length=sizeof("literal string") for nsFakeStringBuffer<length>, which doesn't leave enough room for the expansion.
I may be totally confused about that point, in which case I need more/better comments in nsStaticAtom.h explaining what actually goes on.
Assignee | ||
Comment 67•15 years ago
|
||
(In reply to comment #66)
> Is there a reason nsFakeStringBuffer couldn't include nsStringBuffer as the
> first member, with a little friendship, e.g.
>
> template<size N>
> struct nsStaticAtomStringBuffer
> {
> nsStringBuffer mStringBuffer;
> char mStringData[N];
> };
>
> The initialization would then be, I think:
>
> static nsStaticAtomStringBuffer<sizeof(str_data)> buffer_name = {{1,
> sizeof(str_data)}, str_data};
I believe that this requires the members in nsStringBuffer to be public. I don't think there is a way to friend a global initializer. Suggestions welcome.
> If this really isn't possible, you should use static assertions that the
> sizeof and offsetof the members of nsFakeStringBuffer match nsStringBuffer.
That is a good idea.
> Perhaps more importantly, though, I thought I understood that when
> NS_RegisterStaticAtoms is called, the buffer was going to be ASCII but have
> room to store a null-expanded version of the string, and that
> NS_RegisterStaticAtoms would do the null-expansion. But it appears that
> length=sizeof("literal string") for nsFakeStringBuffer<length>, which doesn't
> leave enough room for the expansion.
>
> I may be totally confused about that point, in which case I need more/better
> comments in nsStaticAtom.h explaining what actually goes on.
The "Part 1" patch here does not make us use 16bit atoms. That happens in the "Part 2" patch. There I use the NS_L macro to expand the string at compile time, when that is available. When it's not available we fall back to a slow path where we allocate string buffers on startup. Shouldn't be a big deal since all our primary platforms support expanding NS_L.
See the nsStaticAtom.h changes in attachment 423584 [details] [diff] [review].
Comment 68•15 years ago
|
||
Comment on attachment 419641 [details] [diff] [review]
Part1: Use nsStringBuffers
r=me with static assertions
Attachment #419641 -
Flags: review?(benjamin) → review+
Assignee | ||
Comment 69•15 years ago
|
||
Benjamin: Just to be clear, does your r+ cover all of the "Part 1" spec, or just the nsStaticAtom.h changes? I.e. would you like me to get someone to review the rest of the patch?
Comment 70•15 years ago
|
||
I did not do a line-by-line review of the entire patch, but I did skim it and it looked like a pretty mechanical sort of change. I don't think another review is necessary unless you do.
![]() |
Reporter | |
Comment 71•15 years ago
|
||
So question.... Is it ok that we're assuming nsStringBuffer::Alloc will never return null?
Updated•15 years ago
|
Attachment #428410 -
Flags: review?(hsivonen) → review+
Comment 72•15 years ago
|
||
Comment on attachment 428410 [details] [diff] [review]
Part3: Take advantage of new API
Per Tuesday's telecon, marking r+ in the sense that r+ from me for the nsHtml5 parts provided that comment 62 or comment 63 is addressed.
Assignee | ||
Comment 73•15 years ago
|
||
Checked in!!!
I ended up removing the HTML5-parser changes, I wasn't 100% sure what to do everywhere, and there's no harm in waiting to make those changes until later. Might even be easier then given the changes in part 4.
So here are the parts, in order:
1: http://hg.mozilla.org/mozilla-central/rev/5b152f233117
2: http://hg.mozilla.org/mozilla-central/rev/cb17f4d92942
3: http://hg.mozilla.org/mozilla-central/rev/1995edaefd3f
4: http://hg.mozilla.org/mozilla-central/rev/5901c6b98f83
5: http://hg.mozilla.org/mozilla-central/rev/62bc77046ac9
6: http://hg.mozilla.org/mozilla-central/rev/9cb12f3ff156
7: http://hg.mozilla.org/mozilla-central/rev/f8dd7b5b02ef
Additionally, I forgot to actually fix petervs review comments before checking in. So landed those separately in
http://hg.mozilla.org/mozilla-central/rev/a5de049fc2fb
Thanks everyone for reviewing!
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 74•15 years ago
|
||
(for what it's worth, i don't think this is a blocker, so marking minus)
blocking2.0: ? → -
Comment 75•15 years ago
|
||
I checked in a bustage fix for the TestStaticAtoms test (which only runs on non-libxul builds), with r=sicking:
http://hg.mozilla.org/mozilla-central/rev/dcfcfca09b45
Comment 76•15 years ago
|
||
TestUTF fails too on SeaMonkey: see bug 550980 comment 9.
Comment 77•15 years ago
|
||
(In reply to comment #76)
> TestUTF fails too on SeaMonkey: see bug 550980 comment 9.
Filed bug 551153 as originally suggested.
Updated•15 years ago
|
Flags: in-testsuite+
Target Milestone: --- → mozilla1.9.3a3
Comment 78•15 years ago
|
||
Build fails in
http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/src/gfxFT2Fonts.cpp#667
if MOZ_PANGO isn't enabled.
That's probably a rather unsupported build configuration and I only ran into it by accident but anyway something is wrong if it's possible to trigger a build error in that case.
Comment 79•15 years ago
|
||
Attachment #431659 -
Flags: review?
Updated•15 years ago
|
Attachment #431659 -
Flags: review? → review?(jonas)
Assignee | ||
Comment 80•15 years ago
|
||
Comment on attachment 431659 [details] [diff] [review]
patch for winmo debug build
Please use
nsAtomCString(langGroup).get()
instead That will remove the need for the #ifdef.
r=me with that.
Attachment #431659 -
Flags: review?(jonas) → review+
You need to log in
before you can comment on or make changes to this bug.
Description
•