Closed Bug 92144 Opened 23 years ago Closed 3 years ago

investigate inlining of nsCOMPtr methods

Categories

(Core :: XPCOM, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE
Future

People

(Reporter: dbaron, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

There are a few nsCOMPtr methods that consistently show up as a bit of time
(function, excluding descendents) on jprof profles because we call them so much
that the overhead of a function call may be a significant amount of time.

We should investigate (on multiple platforms and perhaps on multiple versions of
gcc on Linux) the code size and performance affects of inlining:
 * ~nsCOMPtr_base (which would be done by un-defining the
NSCAP_FEATURE_FACTOR_DESTRUCTOR macro), and shows up a little above 1% of time
on page loading profiles taken in jprof
 * nsCOMPtr_base::begin_assignment, which shows up around 0.8% of time on page
loading profiles taken in jprof (which would be done by defining
NSCAP_FEATURE_INLINE_STARTASSIGMENT)
 * nsCOMPtr_base::assign_from_helper, which shows up around 0.3% of time on page
loading profiles taken in jprof.  (This one is more interesting because it might
allow the |operator()| on helpers to be inlined by good compilers rather than be
executed as a virtual function call, although it would also probably increase
code size more than the other two.)
Status: NEW → ASSIGNED
Keywords: perf
Priority: -- → P3
Target Milestone: --- → mozilla1.0.1
Blocks: 71668
Blocks: deCOM
Target Milestone: mozilla1.0.1 → mozilla0.9.7
On Intel x86, inlining nsCOMPtr's destructor will yield the same
number of instructions in the callee, but the number of bytes emitted
will increase by five bytes per nsCOMPtr.

I started with this routine, using RH7.1's gcc-2.96 compiler with
optimization set at -O2.

  void
  comptr_test(nsIRDFService* aService)
  {
      nsCOMPtr<nsIRDFResource> resource;
      aService->GetResource("rdf:null", getter_AddRefs(resource));
  }

With our current nsCOMPtr.h (dtor out-of-line), we get:

00000000 <comptr_test__FP13nsIRDFService>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	56                   	push   %esi
   4:	53                   	push   %ebx
   5:	83 ec 1c             	sub    $0x1c,%esp
   8:	8d 75 e8             	lea    0xffffffe8(%ebp),%esi
   b:	8b 5d 08             	mov    0x8(%ebp),%ebx
   e:	56                   	push   %esi
   f:	c7 45 e8 00 00 00 00 	movl   $0x0,0xffffffe8(%ebp)
  16:	e8 fc ff ff ff       	call   17 <comptr_test__FP13nsIRDFService+0x17>
  1b:	83 c4 0c             	add    $0xc,%esp
  1e:	8b 13                	mov    (%ebx),%edx
  20:	50                   	push   %eax
  21:	68 00 00 00 00       	push   $0x0
  26:	53                   	push   %ebx
  27:	ff 52 14             	call   *0x14(%edx)
  2a:	58                   	pop    %eax
  2b:	5a                   	pop    %edx
  2c:	6a 00                	push   $0x0
  2e:	56                   	push   %esi
  2f:	e8 fc ff ff ff       	call   30 <comptr_test__FP13nsIRDFService+0x30>
  34:	83 c4 10             	add    $0x10,%esp
  37:	8d 65 f8             	lea    0xfffffff8(%ebp),%esp
  3a:	5b                   	pop    %ebx
  3b:	5e                   	pop    %esi
  3c:	5d                   	pop    %ebp
  3d:	c3                   	ret    
  3e:	89 f6                	mov    %esi,%esi

That's 27 instructions, assuming that the |mov %esi,%esi| instruction
was emitted to align the function.

With nsCOMPtr::~nsCOMPtr inlined, we get:

00000000 <comptr_test__FP13nsIRDFService>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	53                   	push   %ebx
   4:	83 ec 20             	sub    $0x20,%esp
   7:	8d 45 e8             	lea    0xffffffe8(%ebp),%eax
   a:	8b 5d 08             	mov    0x8(%ebp),%ebx
   d:	50                   	push   %eax
   e:	c7 45 e8 00 00 00 00 	movl   $0x0,0xffffffe8(%ebp)
  15:	e8 fc ff ff ff       	call   16 <comptr_test__FP13nsIRDFService+0x16>
  1a:	83 c4 0c             	add    $0xc,%esp
  1d:	8b 13                	mov    (%ebx),%edx
  1f:	50                   	push   %eax
  20:	68 00 00 00 00       	push   $0x0
  25:	53                   	push   %ebx
  26:	ff 52 14             	call   *0x14(%edx)
  29:	8b 55 e8             	mov    0xffffffe8(%ebp),%edx
  2c:	83 c4 10             	add    $0x10,%esp
  2f:	85 d2                	test   %edx,%edx
  31:	74 0c                	je     3f <comptr_test__FP13nsIRDFService+0x3f>
  33:	83 ec 0c             	sub    $0xc,%esp
  36:	8b 02                	mov    (%edx),%eax
  38:	52                   	push   %edx
  39:	ff 50 10             	call   *0x10(%eax)
  3c:	83 c4 10             	add    $0x10,%esp
  3f:	8b 5d fc             	mov    0xfffffffc(%ebp),%ebx
  42:	c9                   	leave  
  43:	c3                   	ret    

Again, 27 instructions.

For reference, re-writing the function without nsCOMPtr, thus:

  void
  comptr_test(nsIRDFService* aService)
  {
      nsIRDFResource* resource;
      aService->GetResource("rdf:null", &resource);
      NS_RELEASE(resource);
  }

yielded:

00000000 <comptr_test__FP13nsIRDFService>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	83 ec 0c             	sub    $0xc,%esp
   6:	8b 55 08             	mov    0x8(%ebp),%edx
   9:	8d 45 fc             	lea    0xfffffffc(%ebp),%eax
   c:	8b 0a                	mov    (%edx),%ecx
   e:	50                   	push   %eax
   f:	68 00 00 00 00       	push   $0x0
  14:	52                   	push   %edx
  15:	ff 51 14             	call   *0x14(%ecx)
  18:	8b 45 fc             	mov    0xfffffffc(%ebp),%eax
  1b:	8b 10                	mov    (%eax),%edx
  1d:	89 04 24             	mov    %eax,(%esp,1)
  20:	ff 52 10             	call   *0x10(%edx)
  23:	83 c4 10             	add    $0x10,%esp
  26:	c9                   	leave  
  27:	c3                   	ret    

Only 17 instructions here.
I'm reading the inlining as a net loss (added memory usage, w/ no instruction
gain) then.?
No, since the inlined version has a direct call to the release method whereas
the non-inlined version calls a function that calls release.  Furthermore, the
function that calls release is probably more likely to lead to mispredicted
branches since the processor's cache of where we branched the last time will be
meaningless when every release goes through the same code, which would lead to
more pipeline halts and also more instruction cache misses which require going
back to memory.

In other words, we'd need to do performance testing to find out what's faster.
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Would inlining all nsCOMPtr methods (constructor, getter_AddRefs, destructor) 
yield the same code as not using an nsCOMPtr at all? 

I mean the result should be the same and if the compiler can see the full method 
it should be able (maybe only in my dreams) to generate the same machine code.
Given a good compiler, one would hope it would, although there's a bit of
nulling, null-checking, and making sure to addref before releasing that it
wouldn't be able to optimize.  And it would certainly stop defeating some of the
branch-prediction stuff in the newer Pentium processors.  (~nsCOMPtr_base calls
Release on tons of different interfaces).
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Target Milestone: mozilla0.9.9 → Future
collecting under |nsCOMPtr| tracking bug # 178174
Is bug 139986 related to this (see profile results there)?
QA Contact: scc → xpcom
Assignee: dbaron → nobody
Status: ASSIGNED → NEW

Probably better to file a new bug if nsCOMPtr methods show up in profiles. Hopefully with all of the PGO and whatnot that happens we don't have the manually inline things as much.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.