Evaluate performance of FixedMalloc::FindBeginning (and speed up if necessary)

RESOLVED FIXED in Q1 12 - Brannan

Status

P1
major
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: pnkfelix, Assigned: pnkfelix)

Tracking

unspecified
Q1 12 - Brannan
Bug Flags:
flashplayer-qrb +

Details

Attachments

(5 attachments, 4 obsolete attachments)

Bug 663508, which added FixedMalloc::FindBeginning, has been signed off.  But there has a significant potential speed trap in that code.  The performance impact needs to be properly evaluated; if we are in danger of introducing a serious slow-down, that needs to be addressed.

There were some preliminary measurements done on the original ticket, see Bug 663508, comment 45.  But the slice of data presented there may not be optimal; Lars noted in Bug 663508, comment 46 that I should have run some of the apps for a longer period of time, and that peak heap blocks may not be the right metric to be analyzing (at least not in isolation).

Comment 1

7 years ago
changeset: 6540:0bdd7ace90ff
user:      Felix S Klock II <fklockii@adobe.com>
summary:   Bug 663508: FixedMalloc::FindBeginning code; it is slow (r=treilly, sr=lhansen).

http://hg.mozilla.org/tamarin-redux/rev/0bdd7ace90ff
(Assignee)

Updated

7 years ago
Priority: -- → P1
(while doing some general BZ maintenance, noticed Lars had already played with a project that would resolve this in a totally different way: getting rid of FixedMalloc entirely.  See Bug 519998; if we were to consider that option seriously, it would make all of this FixedMalloc::FindBegining stuff moot.)
See Also: → bug 519998
Note: Performance impact under discussion is for cases where FindBeginning[AndSize] is being invoked in non-Debug builds.  The main case of that is the GCRoot construction; see Bug 664137.

(This is relevant because I realized today that there is a clear way to make this fast for the GCRoot case alone, which is probably good enough as long as I also document FindBeginning[AndSize] as being slow in general.)
Created attachment 557554 [details] [diff] [review]
patch S1: add GCRoot + FindBeginning stats on TR rev 6554

This is a revision of the stats-adding patch on Bug 663508 (aka patchD, aka attachment 540506 [details] [diff] [review]).

In particular, it adds timing data in addition to the iteration counts that patchD had.

I used it last night to gather some data on Brent's family of GC benchmark swfs.

(I also developed a patch to side-step the iteration entirely; I'll be putting that up shortly.)
Created attachment 557642 [details]
stress test for GCRoot

stress test for GCRoot.  Illustrates poor scaling for current GCRoot call to FindBeginning as heap grows/becomes more fragmented.

(Note that this stress test is allocating a lot more GCRoots than I've encountered in practice.)

Performing a series of invocations (Release 64-bit avmshell build):

  % for i in {250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000};do \
    avmshell -Dselftest=mmgc,gcroot_alloc,gcroot_allocs$i | grep -e 'tick';done

yields the following table:

root count: 24 *  250 mean ticks:  2628 = 0.00ms max ticks:  437523 = 0.44ms 
root count: 24 *  500 mean ticks:  4027 = 0.00ms max ticks:  828482 = 0.83ms 
root count: 24 *  750 mean ticks:  5642 = 0.01ms max ticks:  938133 = 0.94ms 
root count: 24 * 1000 mean ticks:  7404 = 0.01ms max ticks: 1207910 = 1.21ms 
root count: 24 * 1250 mean ticks:  9214 = 0.01ms max ticks: 1359110 = 1.36ms 
root count: 24 * 1500 mean ticks: 11628 = 0.01ms max ticks: 1589258 = 1.59ms 
root count: 24 * 1750 mean ticks: 14690 = 0.01ms max ticks: 2031810 = 2.03ms 
root count: 24 * 2000 mean ticks: 18828 = 0.02ms max ticks: 2162801 = 2.16ms 
root count: 24 * 2250 mean ticks: 23563 = 0.02ms max ticks: 2561135 = 2.56ms 
root count: 24 * 2500 mean ticks: 28872 = 0.03ms max ticks: 2548454 = 2.55ms 
root count: 24 * 2750 mean ticks: 34794 = 0.03ms max ticks: 2650959 = 2.65ms 
root count: 24 * 3000 mean ticks: 41223 = 0.04ms max ticks: 2898228 = 2.90ms 

(Each run is iterated 10 times; the root count measures the maximum number of roots allocated at once.  The roots themselves vary widely in size; see the code.)

The above tick counts include the cost of all GCRoot construction, which is not limited to the FindBeginning call.  By using the -gcbehavior switch and the stats gathering patch S1 on this ticket (attachment 557554 [details] [diff] [review]), we can narrow the data to gauge how much time is being spent in FindBeginning iteration.

% for i in {250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000}; do \
           ../objdir-rel64/shell/avmshell                           \
           -Dselftest=mmgc,gcroot_alloc,gcroot_allocs$i -gcbehavior \
         | grep -e 'blockiter' | grep -v 0.0$ | tail -1 ;           \
  done | cut -d ' ' -f3,10,6,11

constructed=60000 blockiters.max=299 blockiter-time-max=0.0 blockiter-time-all=111.8
constructed=120000 blockiters.max=575 blockiter-time-max=0.0 blockiter-time-all=392.5
constructed=180000 blockiters.max=851 blockiter-time-max=0.1 blockiter-time-all=875.2
constructed=240000 blockiters.max=1127 blockiter-time-max=0.1 blockiter-time-all=1583.2
constructed=300000 blockiters.max=1403 blockiter-time-max=0.1 blockiter-time-all=2560.2
constructed=360000 blockiters.max=1679 blockiter-time-max=0.2 blockiter-time-all=3885.1
constructed=420000 blockiters.max=1954 blockiter-time-max=0.2 blockiter-time-all=5989.2
constructed=480000 blockiters.max=2229 blockiter-time-max=0.2 blockiter-time-all=8658.0
constructed=540000 blockiters.max=2506 blockiter-time-max=0.2 blockiter-time-all=12284.3
constructed=600000 blockiters.max=2782 blockiter-time-max=0.3 blockiter-time-all=16867.5
constructed=660000 blockiters.max=3057 blockiter-time-max=0.3 blockiter-time-all=22622.2
constructed=720000 blockiters.max=3334 blockiter-time-max=0.4 blockiter-time-all=29267.6

The interesting points here:
1. The maximum time for a block iteration is increasing as we increase the root count, and that is because the number of blocks that we have to iterate over is growing.  This is unsurprising; it just shows that the self test is stressing the exact place that Lars has been concerned with.

2. If we divide the accumulated time for all of the iterations (the right hand number) by the total number of GCRoot constructions (the left hand number), we get the mean time spent in block iteration.  For the early rows, the result is insignificant.  But it gets larger, and looks like it might be dominating the average time spent in overall GCRoot construction that we saw in the first table above.

Here's the numbers so you don't have to do the division yourself:

> 111.8 / 60000          0.0018633333333333334
> 392.5 / 120000         0.0032708333333333335
> 875.2 / 180000         0.004862222222222223
> 1583.2 / 240000        0.0065966666666666665
> 2560.2 / 300000        0.008534
> 3885.1 / 360000        0.010791944444444444
> 5989.2 / 420000        0.01426
> 8658.0 / 480000        0.0180375
> 12284.3 / 540000       0.0227487037037037
> 16867.5 / 600000       0.0281125
> 22622.2 / 660000       0.03427606060606061
> 29267.6 / 720000       0.04064944444444444


So it seems plausible that in this stress test, the block iteration is what is growing to dominate the GCRoot construction time.
Created attachment 557646 [details] [diff] [review]
patch C v1: pre-filled one-element cache for FindBeginning

This is the trick that I alluded to in comment 3.

I've also adapted it in my tree to be applicable atop patch S1; I'm gather the performance differences on that stress test next.

(But I can post my results from last night describing the difference it makes on Brent's real-world SWF's as well.)
(In reply to Felix S Klock II from comment #5)
> So it seems plausible that in this stress test, the block iteration is what
> is growing to dominate the GCRoot construction time.

This conclusion was bogus, even from the data presented.  Note in particular that the GCRoot construction time grows to the order of 2.9ms, but the maximum time spent in any block iteration was at most 0.4ms.  So the block iteration was never dominating the construction time.

(Nonetheless, the gcbehavior stats do present growth in the time spent doing block iterations, and so the cache from patch C still has value.  All this means is that the selftest is not stressing solely the block iteration, as I had hoped; there are clearly other parts of the system that are also being stressed as the root count is increased.)

Will post data from running with patch C shortly.
Adapted patch C to work properly atop the statistics gathering code and the selftest (which, *importantly*, required adapting the selftest UnzeroedRoot::operator new() to call the Stash function the same way that the GCRoot::operator new() now does with patch C).

With that adaptation in place, here is the analogous data to that presented in comment 5:

% for i in {250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000} ; do  ../objdir-rel64/shell/avmshell -Dselftest=mmgc,gcroot_alloc,gcroot_allocs$i | grep -e 'tick'; done

root count: 24 *  250 mean ticks:   745 = 0.00ms max ticks:  442259 = 0.44ms 
root count: 24 *  500 mean ticks:   798 = 0.00ms max ticks:  740731 = 0.74ms 
root count: 24 *  750 mean ticks:   780 = 0.00ms max ticks:  941671 = 0.94ms 
root count: 24 * 1000 mean ticks:   798 = 0.00ms max ticks: 1125540 = 1.13ms 
root count: 24 * 1250 mean ticks:   763 = 0.00ms max ticks: 1274617 = 1.27ms 
root count: 24 * 1500 mean ticks:   752 = 0.00ms max ticks: 1514228 = 1.51ms 
root count: 24 * 1750 mean ticks:   754 = 0.00ms max ticks: 1648674 = 1.65ms 
root count: 24 * 2000 mean ticks:   760 = 0.00ms max ticks: 1910810 = 1.91ms 
root count: 24 * 2250 mean ticks:   748 = 0.00ms max ticks: 2121365 = 2.12ms 
root count: 24 * 2500 mean ticks:   749 = 0.00ms max ticks: 7192998 = 7.19ms 
root count: 24 * 2750 mean ticks:   737 = 0.00ms max ticks: 2391226 = 2.39ms 
root count: 24 * 3000 mean ticks:   720 = 0.00ms max ticks: 2819803 = 2.82ms 

% for i in {250,500,750,1000,1250,1500,1750,2000,2250,2500,2750,3000} ; do  ../objdir-rel64/shell/avmshell -Dselftest=mmgc,gcroot_alloc,gcroot_allocs$i -gcbehavior | grep -e 'blockiter' | grep -v 0.0$ | tail -1 ; done | cut -d ' ' -f3,10,6,11

constructed=60000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=1.4
constructed=120000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=2.8
constructed=180000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=4.3
constructed=240000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=5.6
constructed=300000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=7.1
constructed=360000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=8.2
constructed=420000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=9.8
constructed=480000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=11.0
constructed=540000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=12.3
constructed=600000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=13.7
constructed=660000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=15.1
constructed=720000 blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=16.5

This should be completely unsurprising, since we should expect the pre-filled cache to hit 100% of the time with this self test.

----

I'll finish up my work on this ticket my posting my results from running Brent's set of real-world SWF's.
(In reply to Felix S Klock II from comment #7)
> (In reply to Felix S Klock II from comment #5)
> > So it seems plausible that in this stress test, the block iteration is what
> > is growing to dominate the GCRoot construction time.
> 
> This conclusion was bogus, even from the data presented.  Note in particular
> that the GCRoot construction time grows to the order of 2.9ms, but the
> maximum time spent in any block iteration was at most 0.4ms.  So the block
> iteration was never dominating the construction time.

(It wasn't completely bogus.  Rereading what I wrote in comment 5, and looking at the data again, I am reminded that what I was attempting to argue in comment 5 was that the *average* GCRoot construction time is being dominated by block iteration.  NOT the worst-case GCRoot construction time.  With that thought in mind, the data presented in comment 8 actually *supports* that conclusion: Look at how the mean ticks value is no longer growing as the root count increases.)
Created attachment 557661 [details]
tarball TB: stats before the cache addition from patch C
Created attachment 557663 [details]
tarball TC: stats after the cache addition from patch C
(Assignee)

Updated

7 years ago
Attachment #557661 - Attachment mime type: application/octet-stream → application/x-gzip
Here's the executive summary on attachments TB and TC:

Before the cache addition (TB), the most significant (as in longest overhead from the findbeginning blocks iteration) for each of the swfs is as follows:

% for f in stats-before-cache/* ; do echo $f; grep gcroot $f | cut -d ' ' -f6,10,11 | sed -e 's/^\(.*blockiters.max=\)\([0-9.]*\) \(.*blockiter-time-all=\)\([0-9.]*\)$/\4 \2 \1\2 \3\4/' | sort -n | tail -1 | cut -d ' ' -f3,4,5,6 ; done

stats-before-cache/gcbehavior.bigpixelzombies.txt
blockiters.max=5352 blockiter-time-max=0.5 blockiter-time-all=50.2
stats-before-cache/gcbehavior.brainstorm.txt
blockiters.max=577 blockiter-time-max=0.0 blockiter-time-all=0.8
stats-before-cache/gcbehavior.checkinapp.txt
blockiters.max=91 blockiter-time-max=0.0 blockiter-time-all=0.2
stats-before-cache/gcbehavior.cignatabbing.txt
blockiters.max=817 blockiter-time-max=0.1 blockiter-time-all=0.2
stats-before-cache/gcbehavior.coverflow.txt
blockiters.max=74 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-before-cache/gcbehavior.crushthecastle2.txt
blockiters.max=566 blockiter-time-max=0.0 blockiter-time-all=4.9
stats-before-cache/gcbehavior.flashversion.txt
blockiters.max=54 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-before-cache/gcbehavior.flexdatagridscroll.txt
blockiters.max=36 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-before-cache/gcbehavior.gettheglass.txt
blockiters.max=1210 blockiter-time-max=0.1 blockiter-time-all=193.5
stats-before-cache/gcbehavior.mechanism.txt
blockiters.max=209 blockiter-time-max=0.0 blockiter-time-all=0.4
stats-before-cache/gcbehavior.topten-mono.txt
blockiters.max=150 blockiter-time-max=0.0 blockiter-time-all=0.5
stats-before-cache/gcbehavior.topten-moodstream.txt
blockiters.max=543 blockiter-time-max=0.1 blockiter-time-all=1.7
stats-before-cache/gcbehavior.topten-waterlife.txt
blockiters.max=1081 blockiter-time-max=0.1 blockiter-time-all=15.1
stats-before-cache/gcbehavior.watsonexpress.txt
blockiters.max=318 blockiter-time-max=0.0 blockiter-time-all=1.8

(I had already noticed outlying large iteration counts for bigpixelzombies and gettheglass, and deliberately ran those benchmarks for a longer amount of time in an attempt to exacerbate the issue.  In particular, I played through level 1 of bigpixelzombies, and the *entire* game of gettheglass, all the way to the end.  At the time I was gathering the data, my goal was to evaluate how bad the problem could possibly get on real-world content, not to do a blind evaluation of average case behavior.)

So, here's the analogous analysis for TC, after the cache addition:

% for f in stats-after-cache/* ; do echo $f; grep gcroot $f | cut -d ' ' -f6,10,11 | sed -e 's/^\(.*blockiters.max=\)\([0-9.]*\) \(.*blockiter-time-all=\)\([0-9.]*\)$/\4 \2 \1\2 \3\4/' | sort -n | tail -1 | cut -d ' ' -f3,4,5,6 ; done

stats-after-cache/gcbehavior.bigpixelzombies.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.1
stats-after-cache/gcbehavior.brainstorm.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.checkinapp.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.cignatabbing.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.coverflow.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.crushthecastle2.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.flashversion.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.flexdatagridscroll.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.gettheglass.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.2
stats-after-cache/gcbehavior.mechanism.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.topten-mono.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.topten-moodstream.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0
stats-after-cache/gcbehavior.topten-waterlife.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.1
stats-after-cache/gcbehavior.watsonexpress.txt
blockiters.max=0 blockiter-time-max=0.0 blockiter-time-all=0.0

Look at all those lovely zeroes!

(presumably the non-zero blockiter-time-all cases are due to accumulated off-by-one noise in the time sampling process.)

So, I think this motivates patch C.  I'll tag a review request shortly.
Status: NEW → ASSIGNED
Comment on attachment 557646 [details] [diff] [review]
patch C v1: pre-filled one-element cache for FindBeginning

Lars: its probably simpler for you to just review this patch than to slog over the data and analyses that I posted earlier.  (It certainly would have been simpler for me to go straight to the patch, but alas, I did not think of it until I had already done a run of swf's.)
Attachment #557646 - Flags: review?(lhansen)

Comment 14

7 years ago
Comment on attachment 557646 [details] [diff] [review]
patch C v1: pre-filled one-element cache for FindBeginning

At first blush this seems fine, and I had mostly one comment:

The prevalent style (though not the uniform style, yay :-) in FixedMalloc is to prefix member variables with m_, it may be worthwhile to do that for these new variables too.  Your call really.

Then I started to wonder if it is true that a test in "free" would not pay for itself:

        // Bugzilla 681388: Clear cache (unconditionally) to ensure
        // consistency as memory is recycled.  Freeing of any object
        // is rare during intended use case (GCRoot construction); so
        // an item == cached addr test here would not pay for itself.

The reason I started to wonder is that FixedMalloc is used from multiple threads, and so an unconditional write on free would cause the cache line that contains those variables to be invalidated constantly for all threads, thus there might be a performance issue.

But then - threads?  GCRoots are entered into the roots list with a lock held, and the roots list is traversed only with the lock held too.  All that suggests strongly that GCRoots can be constructed from other threads than the main thread.  But that probably means that the caching code you have here suffers from race conditions.

Holding off on the review for now.
(In reply to Lars T Hansen from comment #14)
> But then - threads?  GCRoots are entered into the roots list with a lock
> held, and the roots list is traversed only with the lock held too.  All that
> suggests strongly that GCRoots can be constructed from other threads than
> the main thread.  But that probably means that the caching code you have
> here suffers from race conditions.
> 
> Holding off on the review for now.

I'll retract the review request and review the code for race conditions myself, see if I can come up with something "obviously correct".

(The potential for races did cross my mind when I noticed Bug 683630; I should have been more pro-active about it on my end.  Too eager to close off this issue.)
(Assignee)

Updated

7 years ago
Attachment #557646 - Flags: review?(lhansen)
(In reply to Felix S Klock II from comment #15)
> I'll retract the review request and review the code for race conditions
> myself, see if I can come up with something "obviously correct".
> 
> (The potential for races did cross my mind when I noticed Bug 683630; I
> should have been more pro-active about it on my end.  Too eager to close off
> this issue.)

The patch I posted wasn't thread-safe and fixing it is not trivial.  I had vague hopes for making the cache lock-free but maybe that is not realistic.
(on the other hand, for the particular use-case under consideration, namely GCRoots, I do not think it is feasible for one thread to free a root that has not been completely constructed yet.  When you include that in the analysis, then you have a prayer of coming up with something easy that is thread-safe.  So maybe a better approach would be to have a separate API just for the GCRoot code to used, and then that API gets the cache, but the general FindBeginning API does not...)
Created attachment 558646 [details] [diff] [review]
patch C v2: pre-filled one-elem FindBeginning cache (with locking)

Addressing threading issues raised in comment 14.

The main bit of (potentially premature and/or misinformed) cleverness is in FixedMalloc::Free, where I try to avoid grabbing the lock by an initial unprotected read of m_findCacheObjBegin.  I'm pretty sure this is sound, but it does require that e.g. pointer-sized (word-based) reads will be atomic even in the presence of concurrent writes to the same location.
Attachment #557646 - Attachment is obsolete: true
Comment on attachment 558646 [details] [diff] [review]
patch C v2: pre-filled one-elem FindBeginning cache (with locking)

want tommy to double-check my thinking with respect to locking, if he has time.
Attachment #558646 - Flags: superreview?(lhansen)
Attachment #558646 - Flags: review?(treilly)

Comment 20

7 years ago
Cursory thoughts:

1

Comment 21

7 years ago
Cursory thoughts:

1) GCRoot construction always occurs on the main thread and we could probably assert this, the rootlist lock exists solely for a oddball case in the player where a GCRoot can be deleted from a different thread (this theory needs to be vetted but I'd place a medium sized wager on it).  

2) the caching mechanism could move to the GC if the constructor was new (GC*) CLASS(GC*), or a factory method could be employed to match the SafeGC factory method that's under consideration, ie New<T>(GC*, ...) where we call T(...) in the template expansion.   Then we could use the placement new and the find beginning cache pointer would just be on the stack.

3) adding any new code to the FixedMalloc::Alloc/Free critical paths seems unnecessary when we can override GCRoot new/delete.
(In reply to Tommy Reilly from comment #21)
> Cursory thoughts:
> 
> 1) GCRoot construction always occurs on the main thread and we could
> probably assert this, the rootlist lock exists solely for a oddball case in
> the player where a GCRoot can be deleted from a different thread (this
> theory needs to be vetted but I'd place a medium sized wager on it).  

That's worth knowing.  But even so, I still need to clear the cache when a root is freed, if only to handle the pathological case of two roots where the second root is allocated and freed in between the points where the first root is operator new'ed and where the first root's GCRoot constructor is invoked.

(Do I expect that to occur?  No; but then again I didn't expect to see GCRoot multiply-inherited either.)

> 2) the caching mechanism could move to the GC if the constructor was new
> (GC*) CLASS(GC*), or a factory method could be employed to match the SafeGC
> factory method that's under consideration, ie New<T>(GC*, ...) where we call
> T(...) in the template expansion.   Then we could use the placement new and
> the find beginning cache pointer would just be on the stack.

If I were willing to rewrite all the spots that allocated any subclass of GCRoot, I probably would have more seriously investigated the other (seriously invasive) option of revising our class hierarchies to not multiply-inherit from GCRoot, side-stepping the bug that prompted all this.

(Admittedly, the comparison is not fair because your suggestion is a local transformation while class hierarchy restructuring is not.  But I think the point remains that I don't want to muck around much with the player code to address this.)

> 3) adding any new code to the FixedMalloc::Alloc/Free critical paths seems
> unnecessary when we can override GCRoot new/delete.

Okay, the critical path argument convinces me that I should follow through on my thoughts from comment 17: I'll shift more of the operations into GCRoot operators.

I still need to stash the relevant state somewhere; my current plan is to keep hanging it off of the FixedMalloc instance, but only GCRoot will maintain it.
Comment on attachment 558646 [details] [diff] [review]
patch C v2: pre-filled one-elem FindBeginning cache (with locking)

(retracting review request while I address comment 21)
Attachment #558646 - Flags: superreview?(lhansen)
Attachment #558646 - Flags: review?(treilly)

Comment 24

7 years ago
(In reply to Felix S Klock II from comment #22)

> (Admittedly, the comparison is not fair because your suggestion is a local
> transformation while class hierarchy restructuring is not.  But I think the
> point remains that I don't want to muck around much with the player code to
> address this.)

You won't get much sympathy from me on this point, its not as bad as you think.   We already have a script to find all the GCRoot's and little regexp could do the transformation automatically.     

That said a solution that shifts the extra code from the hotpath to GCRoot operators will probably get an R+ from me.
(Assignee)

Updated

7 years ago
Assignee: nobody → fklockii

Updated

7 years ago
Flags: flashplayer-qrb+
Target Milestone: --- → Q1 12 - Brannan
Created attachment 560909 [details] [diff] [review]
patch C v3: pre-filled one-elem FindBeginning cache (with locking)

(following request from tom, this version is gcroot specific.  and has benefited from associated cleanup, IMO.)
Attachment #558646 - Attachment is obsolete: true
(In reply to Felix S Klock II from comment #25)
> Created attachment 560909 [details] [diff] [review]
> patch C v3: pre-filled one-elem FindBeginning cache (with locking)
> 
> (following request from tom, this version is gcroot specific.  and has
> benefited from associated cleanup, IMO.)

(i'm posting it here for safe-keeping until I get a chance to double-check that it performs properly on my iMac.)
(Assignee)

Updated

7 years ago
Blocks: 663159
Created attachment 562804 [details]
tarball TB: stats before cache addition from patch C v3
Attachment #557661 - Attachment is obsolete: true
Created attachment 562805 [details]
tarball TC: stats after cache addition from patch C v3
Attachment #557663 - Attachment is obsolete: true
(In reply to Felix S Klock II from comment #26)
> (In reply to Felix S Klock II from comment #25)
> > Created attachment 560909 [details] [diff] [review] [diff] [details] [review]
> > patch C v3: pre-filled one-elem FindBeginning cache (with locking)
> > 
> > (following request from tom, this version is gcroot specific.  and has
> > benefited from associated cleanup, IMO.)
> 
> (i'm posting it here for safe-keeping until I get a chance to double-check
> that it performs properly on my iMac.)

The results (see attachment 562804 [details] and attachment 562805 [details]) basically match those presented in comment 12.

Lets finish this off.
(Assignee)

Updated

7 years ago
Attachment #560909 - Flags: review?(lhansen)

Comment 30

7 years ago
Comment on attachment 560909 [details] [diff] [review]
patch C v3: pre-filled one-elem FindBeginning cache (with locking)

I am not 100% convinced that it is correct to read m_objBegin unlocked, but I've not been able to construct a scenario where this is a problem where there isn't a more serious synchronization bug elsewhere in the program.

Effectively, the cache would have to be filled with an object R from some other thread T and the store of R into the cache would have to not propagate to the main thread by the time the main thread chooses to delete the root R.  We can assume that T has not left the critical section and hence has not executed a memory barrier; T could be descheduled while in the critical section, say.  Then the main thread runs along happily and decides to delete R.  Now it checks the cache; the store has not propagated from T; hence the fast unlocked test in Clear will fail; hence the cache is not cleared and once the main thread is done deleting and T leaves the critical section the cache contains a deleted item R.

Now it seems that that scenario critically depends on a race between the use of R in one thread and its unsynchronized deletion in another, so I don't worry a lot about it, but I wonder whether the fast check in Clear is really worth the risk.

Your call.
Attachment #560909 - Flags: review?(lhansen) → review+
(In reply to Lars T Hansen from comment #30)
> Now it seems that that scenario critically depends on a race between the use
> of R in one thread and its unsynchronized deletion in another, so I don't
> worry a lot about it, but I wonder whether the fast check in Clear is really
> worth the risk.

I want to understand the scenario you are outlining, so I would appreciate it if you could elaborate a bit further.

I'm having difficulty understanding how the deleting thread [*] would get its hands on a reference to R, since R itself does not escape during the dynamic extent from where it is first OutOfLineAlloc'ed in GCRoot::operator new to the point where it is written by Stash in GCRoot::operator new.

(Or is that an example of of a bug that would have to pre-exist in order for the code to exhibit a race condition?  Such a scenario strikes me as a very serious bug.)

[*] I shifted the terminology here; what I call "the deleting thread" you call "the main thread" but I think that conflicts with the claims that Tommy made in comment 21).  So lets just talk about the allocating thread "T" and the deleting thread "D"

----

As for the patch itself: My current inclination is to land the patch as-is, but post another review request for Tommy.  Then if he agrees with you (that the fast check is not worth the risk), then I'll happily remove it?

Comment 32

7 years ago
Operator new had nothing to do with my scenario.

The bug would have to pre-exist, it's not introduced by this patch, all I'm noting is that this patch makes the code more vulnerable to the existing bug because the bug becomes potentially more observable.

I thought Tommy's comment was that the main thread was the only thread that could allocate and free a root?

It sounds like you're making an assumption that the only code to call FindBeginning on a root is the main thread.  I am not making that assumption; I'm assuming any thread can do that, hence that's how thread T could enter R into the cache.

Comment 33

7 years ago
Ah, I see my mistake.  When I wrote the comments after reviewing I was making the incorrect assumption that a successfull call to FindBeginning would replace the cache contents with the value that was found.  That is not the case obviously, I did not look carefully enough.  And that invalidates my scenario above.

Consider the issue retracted.
(In reply to Lars T Hansen from comment #33)
> Ah, I see my mistake.  When I wrote the comments after reviewing I was
> making the incorrect assumption that a successfull call to FindBeginning
> would replace the cache contents with the value that was found.

Heh.  That would have been a typical way to write a cache, wouldn't it?  This really is more of a secret communication channel (from GCRoot::operator new to FixedMalloc) than a cache.  But I think I'll keep the names the way they are.

Comment 35

7 years ago
changeset: 6606:ff336ed526d2
user:      Felix S Klock II <fklockii@adobe.com>
summary:   Bug 681388: Add GCRoot-only FindBeginning 'cache' (r=lhansen).

http://hg.mozilla.org/tamarin-redux/rev/ff336ed526d2
(Assignee)

Updated

7 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.