Closed Bug 412320 (gqi) Opened 12 years ago Closed 12 years ago

Investigate code-generation for QI impls

Categories

(Core :: XPCOM, defect, P1)

x86
Linux
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: benjamin, Assigned: benjamin)

References

(Depends on 1 open bug)

Details

(Keywords: perf)

Attachments

(7 files, 4 obsolete files)

Spun off from an IRC conversation regarding bug 412300: We could probably avoid a lot of churn and load/save instructions by optimizing code based on all the IIDs being compared at once, rather than comparing each one individually.

roc suggested something like a perfect hash generator, but I don't think that's going to work without tweaking: we don't know the set of input data offhand, so we would still have to compare for equality after the hash lookup... and data shows that misses are far more common that hits in most QI impls (70% misses).

I'm currently thinking about a series of generated switch statements: and also, because we're generating code, we can easily flatten hierarchical QI so that nsHTMLElement::QI doesn't have to call nsGenericElement::QI at the tail.

I'm going to spend a day or so on this as a feasibility thing.
(In reply to comment #0)
> roc suggested something like a perfect hash generator, but I don't think
> that's going to work without tweaking: we don't know the set of input data
> offhand, so we would still have to compare for equality after the hash
> lookup...

Right.

> and data shows that misses are far more common that hits in most QI impls (70%
> misses).

It still reduces the number of unpredictable branches to 1. That's the minimum no matter what scheme you choose.

Psuedocode for the QI impl, assuming an IID is an array of 4 32-bit words:

Code-gen supplies constant IIDs[N], adjust[N], int indexWord, int indexBitStart, int indexBitMask

QueryInterface(aIID) {
  int index = (aIID[indexWord] >> indexBitStart) & indexBitMask;
  if (aIID[0] != IIDs[index][0]) return null; // sole unpredictable branch
  if (aIID[1] != IIDs[index][1]) return null;
  if (aIID[2] != IIDs[index][2]) return null;
  if (aIID[3] != IIDs[index][3]) return null;
  return (char*)this + adjust[index];
}
Depending on the number of interfaces, you might need to increase the width of indexBitMask beyond the number of bits in N to ensure you don't have collisions. If you do that, you could either live with bigger tables or use a slightly slower QI (albeit with no additional unpredictable branches) that uses "index" to look up an array of bytes, which are indices into the actual IIDs array.
Keywords: perf
Attached patch Checkpoint (obsolete) — Splinter Review
This is a checkpoint. It processes NS_IMPL_QUERY_INTERFACEn correctly, with restrictions like the macro has to be on one line etc...

Next step is to make it a little less insane, and then add support for _INHERITING and other variants.
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Looks good.

+        print >>fd, "  if (!entry->iid || entry->iid->Equals(aIID))"

Why not insist that the entry always have an iid pointer? Saves one unpredictable branch.
Also, why set the minimum number of bits to 3? If I have a class implementing, say, nsISupports and nsIFooBar (fairly common?), then 1 bit will be enough.
+                l = list(iter_none(2 << bits))

My Python is weak, but why 2 << bits instead of 1 << bits?
The iid is empty in the "miss" case (when there is no entry in that bucket). What would you put in there instead?

The minimum number of bits is 3 to avoid false-positive hits: if you only have a two-entry table (e.g. for NS_IMPL_ISUPPORTS1) you would have a 50% false-positive rate where you have to compare the IIDs.

And the 2 << bits thing was just a bug.
(In reply to comment #7)
> The iid is empty in the "miss" case (when there is no entry in that bucket).
> What would you put in there instead?

A pointer to a dummy IID that is not associated with any interface. Checking against a null IID pointer will save a load for the "miss" case but it will penalize the "hit" case by forcing an extra pipeline stall when the null-check is mispredicted.

> The minimum number of bits is 3 to avoid false-positive hits: if you only have
> a two-entry table (e.g. for NS_IMPL_ISUPPORTS1) you would have a 50%
> false-positive rate where you have to compare the IIDs.

That shouldn't matter. You eat your one unpredictable branch by comparing the first word of the two IIDs and finding they don't match. It should be no more costly than checking a null iid pointer or a dummy IID.

Feel free to test my hypotheses with a microbenchmark (or a bigger benchmark), of course.
Is there a "safe" dummy IID that is invalid? (00000000-0000-0000-0000-000000000000} may be invalid, I'm not sure.

I tend to think that we want to optimize the "miss" case over the "hit" case if we can, because that's more common in general, but there's certainly lots of room to tweak the algorithm.
Why are misses more common than hits?  (I assume that by "misses", you mean trying to QI an object to an interface it does not support.)  Does it help to predict ahead of time which interfaces a given object will be given QI requests for, e.g. as input to the perfect-hash generator?
IIRC, xpconnect, cycle collector and other "generic" clients will query every object whether it support various common interfaces such as nsIXPCSecurityCheckedComponent, nsIClassInfo, nsICycleCollectionPartipant, etc

There are other callers who walk around a tree looking for forms, form elements, etc... I don't remember all of the cases, and it's very hard to instrument.

We could attempt to hard-code a miss in some of these cases, if it turns out to measure well, but I think that a simple table-driven lookup will perform better in general.
Attached patch Checkpoint that builds (obsolete) — Splinter Review
This checkpoint actually works, I think, at least until you start compiling non-XPCOM code, which fails linking nsIByteBuffer::COMTypeInfo<int>::kIID... I've made an "xpcomds.h" file which they need to include in that case, but that's a little invasive. I'm going to see if I can avoid the extra header file and use pr static asserts to enforce the declared pseudo-iid (in the gqi file) matches the one in the .h file.
Attachment #297227 - Attachment is obsolete: true
(In reply to comment #9)
> Is there a "safe" dummy IID that is invalid?
> (00000000-0000-0000-0000-000000000000} may be invalid, I'm not sure.

actually it's not (i tried using this, to my peril, when instrumenting QI).

just pick a random IID and define that as your dummy? we do this already (although this is a pretty ugly example), see e.g. http://mxr.mozilla.org/mozilla/source/js/src/xpconnect/src/xpcjsid.cpp#206
(In reply to comment #9)
> Is there a "safe" dummy IID that is invalid?
> (00000000-0000-0000-0000-000000000000} may be invalid, I'm not sure.

Just uuidgen your own and don't use it as an interface :-). Having said that, all-zeroes can't be used in a real interface ... if it is, someone cheated and should be punished :-).

> I tend to think that we want to optimize the "miss" case over the "hit" case if
> we can, because that's more common in general, but there's certainly lots of
> room to tweak the algorithm.

Yeah but you're only making the miss case faster by one load, and the penalty for the hit case is probably quite a bit more than that. Needs benchmarking I guess.

(In reply to comment #10)
> Why are misses more common than hits?  (I assume that by "misses", you mean
> trying to QI an object to an interface it does not support.)  Does it help to
> predict ahead of time which interfaces a given object will be given QI
> requests for, e.g. as input to the perfect-hash generator?

Probably not, but it might help to know for a particular QI implementation whether we should be optimizing for hits or misses. Although that might be a premature optimization at this stage.
(In reply to comment #14)
> if it is, someone cheated and should be punished :-).

(i actually tried to track down who was using a {0...} iid, but didn't have much luck because i couldn't get gdb to break there... someone calls QI asking for it during component registration, somewhere in the BOOT module i suspect.)
Attached patch Checkpoint that works, alpha (obsolete) — Splinter Review
This one passes all the XPCOM unit tests, and has support for basic interface-map syntax.
Attachment #297258 - Attachment is obsolete: true
This is a pretty good checkpoint. It works for nsGenericElement and nsHTMLAnchorElement... it has a declaration of QI for nsGenericHTMLElement but not an implementation because nsGenericHTMLElement doesn't actually have a QI impl.

roc's going to play with tweaking the hash algorithm to avoid the 128-entry tables that I'm seeing currently... it may be worth varying on two words and not just one.
Attachment #297354 - Attachment is obsolete: true
Oh just a note that deep dependencies aren't working (at all) yet, so if you change an imported .gqi or .idl file you'll need to remove the generated .cpp file.
nsHTMLAnchorElement has 22 interfaces, so I did an experiment with 1000 trials of 22 randomly-generated IIDs. The table sizes had the following distribution:
32 entries:  7
64 entries:  767
128 entries: 226
100000 trials, just for kicks:
32 entries:  5956
64 entries:  79100
128 entries: 20304
and there was one case of a 256-entry table.
Hmm.  Creating separate tables for all HTML elements might lead to a good bit more data size.  The tables have a _lot_ of overlap, basically...  That's why there was the helper stuff in generichtmlelement instead of just stuffing everything into the subclass QI tables.

Not sure what the performance effect is one way or the other, of course.
10000 trials, allowing ourselves to use two bitfields of the same 32-bit word of the IID. The fields are just concatenated to form the table index.
32 entries: 2693
64 entries: 7307
So it reduces the average table size from about 77 to about 55. It might be possible to reduce the average table size further by judiciously changing the IIDs of our private interfaces.

One way to compress things a lot more and address bz's issue would be to combine all the IID/action pairs common to HTML elements into a single master table (and combining the action code into a single helper function). Then each class could have a QI table that's just an array of 1-byte entries, where values 0-127 index the master table, values 128-255 index a per-class IID/action table.
That also reduces the cost of an empty slot to 1 byte, of course.
The master table sounds scary... the offsets change per-class, and even things like "new nsDOM3NodeTearoff(this);" generate different code per-class because of different static-cast offsets. But, you could get similar savings from having a per-QI-impl table:

static const PRUint8 kLookupTable[]
{
  0xFFFF, // not found
  0x0,    // kActionTable entry 0
  ...
};

static const QIAction kActionTable[] =
{
  {
    ACTION_OFFSET,
    2
  },
  {
    ACTION_POINTER,
    &MY_CCHELPER_WHATEVER
  }
};
> and even things like "new nsDOM3NodeTearoff(this);" generate different code
> per-class because of different static-cast offsets

I don't see that. If it did, the current code couldn't live in nsGenericElement.cpp.

Life would be so much better if we didn't implement the DOM interfaces as XPCOM C++ interfaces on each node, but instead the JS wrapper implemented them in terms of nsINode/nsIContent/etc. 
When you're calling a method on nsGenericElement, "this" is already cast to nsGenericElement*... you would have to explicitly offset "this" to the correct subtype table element, and since we don't know that offset at code-generation time each element would need a different table element.

Yeah, we need lots of forth wrappers for moz2!
This is a completed patch that does what I want... unfortunately there's an issue with endianness: because nsID is not four 32-bit words I have to byte-swap UUIDs to make them work... and AFAICT I don't have build-time knowledge of endianness.

So in any case, this patch should work on intel, and I'd like to consider test-landing it (even with a PPC hack) to get some real-world perf numbers.
(In reply to comment #28)
> When you're calling a method on nsGenericElement, "this" is already cast to
> nsGenericElement*... you would have to explicitly offset "this" to the correct
> subtype table element, and since we don't know that offset at code-generation
> time each element would need a different table element.

But nsGenericHTMLElement is always first in the inheritance list, so the offset is always 0, assuming a sane compiler. (Or if it's not always first, we can make it so!)
I'm surprised that we don't have any other endian-knowing code in the tree (compile-time, that is).  sqlite? mork? JS XDR?
Are the IS_BIG_ENDIAN/IS_LITTLE_ENDIAN ifdefs not what you want?
If I could get that into the code-generation tool, sure... it can't just be at compile-time... it has to be at code generation time.
Ah, you did say build-time.

It looks like there are NSPR header-like things that end up defining it correctly.  Can we make use of that somehow?
Or if all else fails, can codegen do a configure-like test (or could configure itself do the test?) to determine endianness?
Can we just make the codegen part of the build pass, such that you can cpp -E a trivial file to figure out the endianness?

Or could you generate for both endiannesses, bracketed by #ifdefs?
This generates both endiannesses and selects the right one via #ifdef... I'm not certain what #includes need to be pulled in for IS_BIG_ENDIAN... I think it's defined in NSPR and prtypes.h should be sufficient (which is always included through nsISupports.h -> nscore.h).

I don't have perf numbers yet but I have a tryserver run going now and will hopefully be able to run those to get perf numbers.
Attachment #297811 - Flags: review?(roc)
Am I really the right reviewer for this? I've never written a line of Python code :-)
Well, you suggested the algorithm... I'm happy to get Ted or... somebody to review the build-fu, as long as the algorithm is implemented correctly.
Alias: gqi
Flags: blocking1.9+
Priority: -- → P1
(In reply to comment #14)
> Yeah but you're only making the miss case faster by one load, and the penalty
> for the hit case is probably quite a bit more than that. Needs benchmarking I
> guess.

If the branch target is in the pipe, and it should be here, the penalty is small. Going by memory here, benchmarks better by a lot. I agree with roc that we should not hurt hit perf if we can avoid it with a reserved IID.

/be
Comment on attachment 297811 [details] [diff] [review]
Generate both endiannesses, rev. 1

The algorithm looks correct to me, and my own implementation gives similar results, so I guess it's good. There are of course details we've quibbled about that we should experiment with later.

It would help everyone if you attached another generated code sample.

We're targeting this for 1.9 now? Don't tell me Schrep's going soft!
Attachment #297811 - Flags: review?(roc) → review+
(In reply to comment #41)
> (From update of attachment 297811 [details] [diff] [review])
> We're targeting this for 1.9 now? Don't tell me Schrep's going soft!

I didn't know this was on the radar, but in its defense, generating code via an algorithm we debug beats hand-coding and debugging the larger body of handwritten code. Can we use the generation alg and .gqi files to create all-paths testcases?

/be
(In reply to comment #41)
> (From update of attachment 297811 [details] [diff] [review])
> The algorithm looks correct to me, and my own implementation gives similar
> results, so I guess it's good. There are of course details we've quibbled about
> that we should experiment with later.
> 
> It would help everyone if you attached another generated code sample.
> 
> We're targeting this for 1.9 now? Don't tell me Schrep's going soft!
> 

I'm soft when it comes to across-the-board perf wins, for sure.  We've got a bunch of ground to makeup in Win start time.
Attachment #297811 - Flags: review?(ted.mielczarek)
Attached patch Fix some style things, rev. 1.5 (obsolete) — Splinter Review
This version fixes a couple of inconsistent comments that crept in during development, as well as making this script work with python 2.3 and 2.4 (mainly using sets.Set() instead of the set() builtin).
Attachment #298390 - Flags: review?(ted.mielczarek)
Attachment #297811 - Flags: review?(ted.mielczarek)
Comment on attachment 298390 [details] [diff] [review]
Fix some style things, rev. 1.5

>Index: xpcom/base/gqi.py
>===================================================================

For consistency with other Python scripts in the build system, could you stuff the top-level code into a main() method, and have the usual:
if __name__ == '__main__':
  main()

>+def dump_hex_tuple(t):
>+    return "(%s)" % ", ".join(["%#x" % i for i in t])

Probably should just remove this and the commented out debug prints below, if you're not using it anymore, unless you want to support some sort of debug output mode.
>+    def asguid(self):
>+        return "{ 0x%s, 0x%s, 0x%s, { 0x%s, 0x%s, 0x%s, 0x%s, 0x%s, 0x%s, 0x%s, 0x%s } }" % self.iid

This method name doesn't do much for me.  Maybe "as struct" or something like that?

>+        if line.startswith('%{'):
>+            for line in f:
>+                if line.startswith('%}'):
>+                    break
>+            continue

A brief comment here would be nice.

>+def gqi(ifaces, endian):
>+    """Find an algorithm that uniquely identifies interfaces by a single
>+    word. Returns (indexfunction, table)"""

Any reason this function isn't just a member of |class QIImpl|?

>+                l = list(iter_none(1 << bits))

You could just use a list comprehension instead of that iter_none generator:
l = [None for x in xrange(1 << bits)]

>+class QIImpl(object):

I think the constructor should be the first method defined in this class.


>+    def __init__(self, cname, ifaces, emitrefcount=False, threadsafe="", unfound = None, base=None):

nit: consistent spacing around the assignments

>+        # list of (i, baselist)
>+        ilist = [i for i in self.iter_all_ifaces()]
>+        print >>fd, "namespace %s_QIImpl {" % self.cname

Commented on this on IRC, but can we reliably use namespace now, or does it break on AIX or some other oddball platform?  (Works on all tier 1 platforms.)

>+
>+        if len(actions) > 1:
>+            print >>fd, "enum QIActionType {"
>+            print >>fd, ", ".join([action.enumtype for action in actions])
>+            print >>fd, "};"
>+
>+        print >>fd, "struct QIAction {"
>+        if len(actions) > 1:
>+            print >>fd, "  QIActionType action;"
>+        print >>fd, "  const nsID *iid;"
>+        print >>fd, "  union {"
>+        print >>fd, "    void* ptr;"

A few of these prints could be improved using triple-quoted strings.  Actually, a lot of things in this function could.  I think it'd make it much clearer.

>+        print >>fd, "#ifdef DEBUG"
>+        for i, baselist in ilist:
>+            if i.uuid.pseudo:
>+                print >>fd, "static const nsID kGQI_%s = %s;" % (i.uuid.name, i.uuid.asguid())
>+                print >>fd, 'NS_ASSERTION(NS_GET_IID(%(iname)s).Equals(kGQI_%(iname)s), "GQI pseudo-IID doesn\'t match reality");' % {'iname': i.uuid.name }
>+        print >>fd, "#endif"

Could you mangle this into a PR_STATIC_ASSERT, or do you really need to call IID::Equals?


>+        for i in bigitable:
>+            print >>fd, "  %s," % (i, "0xFF")[i is None]

Is there a clearer way to write that that won't make this too verbose?  I find that indexing by boolean very hard to comprehend.

I'll pick this up later, I think I need to go lay down for a bit.
This fixes most of the nits: it uses a single large """-quoted string for the generated code, with named inserts as appropriate.

I did not replace the (falseval, trueval)[conditional] expression because that is the standard pythonism for a ternary operator, but I did comment it for readability.

I moved the member methods below __init__ in QIImpl, but I put the class-static data above, as with other classes.
Attachment #298502 - Flags: review?(ted.mielczarek)
Attachment #298390 - Attachment is obsolete: true
Attachment #298390 - Flags: review?(ted.mielczarek)
Comment on attachment 298502 [details] [diff] [review]
Fix more nits, rev. 1.6

>Index: xpcom/base/gqi.py
>===================================================================

>+class CCISupportsResponse(object):
>+    action = CCSupportsAction
>+
>+    def __init__(self, classname):
>+        self.classname = classname
>+        self.uuid = nsCycleCollectionISupports
>+
>+    def emitPrelue(self):
>+        pass
>+
>+    def emitTable(self, classname, baselist):
>+        return "{ 0 }"
>+

"emitPrelue" doesn't seem to be used anywhere.

r=me on the Python and build parts.  I basically skipped all the C++ changes here, that goes over my head.
Attachment #298502 - Flags: review?(ted.mielczarek) → review+
This is a simple move to generated-QI for nsXMLDocument/nsXULDocument/nsGlobalWindow, as I suspect those will have the most impact on Ts/Txul times.
Attachment #299266 - Flags: review?
Attachment #299266 - Flags: review? → review?(jonas)
This tweaks gqi.py to add the following features:

* A commandline flag to turn on debugging
* Responses which should not be inherited by subclasses
* Conditional response (needed for nsStandardURL)
* CID responses
Attachment #299280 - Flags: review?(ted.mielczarek)
nsStandardURL::QI showed up in the startup profile, so this implements it and others that I think might be likely candidates.
Attachment #299284 - Flags: review?(cbiesinger)
Comment on attachment 299280 [details] [diff] [review]
Minor tweaks to gqi.py, rev. 1

>+        # ensure that interfaces are not duplicated
>+        # maps name -> (uuid, baselist)
>+        ifdict = uniqdict()
>+        for i, baselist in self.iter_all_ifaces():
>+            if i.uuid.name in ifdict:
>+                print >>sys.stderr, "warning: interface '%s' from derived class '%s' overrides base class '%s'" % (i.uuid.name, baselist[0], baselist[len(baselist) - 1])
>+            else:
>+                ifdict[i.uuid.name] = (i, baselist)
>+

Doesn't this sort of defeat the purpose of using a uniqdict here?
Attachment #299280 - Flags: review?(ted.mielczarek) → review+
XULElement likely gets QIed more than XULDocument.  Same for nsGenericElement...
nsGenericElement is already done. I'll get to nsXULElement, in another patch.

As for the uniqdict, the problem is that because we iterate from "most-derived" to "base", I can't just blindly overwrite. Now there shouldn't ever be a case where I overwrite, but I'd prefer to be safe than sorry.
Did the original check-in help with performance? It certainly increased codesize 
and made the code a bit harder to read and maintain.
So we going to try with the additional DOM + URI classes before we call this one way or the other?
The original checkin helped performance on Linux... I didn't see obvious improvements on Mac/Win. I would like to land the xuldocument/standardurl changes to see whether they help at all.

Smaug, what are the specific concerns you had about "harder to read and maintain"?
(In reply to comment #56)
> Smaug, what are the specific concerns you had about "harder to read and
> maintain"?
- yet another Mozilla-specific coding style
  * .gqi files
- keeping some IIDs in two places
  * .h
  * .gqi
- so far not-document syntax of .gqi
  * what can a .gqi file contain?
- nsCycleCollectionParticipant and nsIClassInfo IIDs hardcoded in the middle of some python script
Comment on attachment 299284 [details] [diff] [review]
nsStandardURL and friends, rev. 1

+%pseudo-cid nsStandardURL b8e3e97b-1ccd-4b45-af5a-79596770f5d7 NS_THIS_STANDARDURL_IMPL_CID

please add a comment pointing to the other location of this constant, in case people ever want to change this constant. also in the other place.

+NS_IMPL_THREADSAFE_ISUPPORTS4(nsSocketTransport,

I'd prefer if you didn't move that, because the original file contains the same list of interfaces right next to it. but if you insist, do add a comment in both places pointing to the other one.
Attachment #299284 - Flags: review?(cbiesinger) → review+
or maybe moving the NS_IMPL_CI_INTERFACE_GETTER3(nsSocketTransport...) to the qgi file would work too, perhaps in a %{C++} block?
At least on Windows, dependencies aren't working for GQI e.g. a nsINode.h change won't rebuild htmlcontentGQI.cpp etc. You can verify this by touching content/base/public/nsINode.h then content/html/content/src/ns*.obj and then trying to make in content/html/content/src - nothing will happen.
this fact broke all three windows dep tinderboxen today; they had to be clobbered to fix it. (i revved the nsINode iid in nsINode.h and contentbase.gqi; however the dep build asserted the iid's didn't match anymore.)
OK, maybe I was too hasty there... only one of my builds had the problem. Sorry.
Looking through the log files for the affected tinderboxen, what I saw was that on Windows the files that depend on contentbase.qgi weren't being re-generated. On Linux they were.
What's odd is that it looks like .all.pp is being updated. It just doesn't trigger contentbaseQI.cpp to be regenerated.
it looks like dependencies on some people's linux builds are broken too (bug 415422). i'd also note that |make distclean| didn't clobber the gqi bits in that instance.
Shouldn't be too hard to add the GQI_SRCS to GARBAGE.

I wonder what's going on with their linux dep build though.
Could we file a different bug (or reopen 415422) on the dependency issue?
Depends on: 415422
reopened 415422.
Jonas can we review the last bit so we can decide one way or another on this approach?
Depends on: 416843
The checkin here seems to be a likely candidate for causing the new crasher bug 417485.
Looks like the Attachment #298502 [details] [diff] (  Fix more nits, rev. 1.6 ) is causing the MOzilla Firefox crash ( Bug 416300 )
I backed all this out yesterday... it only showed a perf win on linux, and that was only 0.5% or so, which by consensus isn't worth the additional complexity and codesize. Resolving this WONTFIX.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.