Last Comment Bug 711623 - rt->gcNumFreeArenas is not updated properly
: rt->gcNumFreeArenas is not updated properly
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla11
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 700357 707162
  Show dependency treegraph
 
Reported: 2011-12-16 14:41 PST by Igor Bukanov
Modified: 2011-12-20 05:53 PST (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1: Fix desync of gcNumFreeArenas (1.46 KB, patch)
2011-12-16 15:39 PST, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
full fix (11.46 KB, patch)
2011-12-16 16:10 PST, Igor Bukanov
terrence.d.cole: review+
Details | Diff | Splinter Review
v2 (11.52 KB, patch)
2011-12-18 07:52 PST, Igor Bukanov
terrence.d.cole: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2011-12-16 14:41:03 PST
The bug 700357 has added the rt->gcNumFreeArenas counter that is used for GC scheduling. However, currently the modifications to the counter can race. In particular, the counter is modified in Chunk::init. 

That function is called outside the GC lock by the background thread. So the increase rt->gcNumFreeArenas += ArenasPerChunk in the init method can race with a counter decrease on the allocation thread. That could lead to the counter wrapping around 0 and becoming a positive number around UINT_MAX. That in turn permanently trigger periodic GC each 20 seconds.

That could be the reason for the bug 707162.
Comment 1 Igor Bukanov 2011-12-16 15:10:46 PST
AFAICS there is another bug with rt->gcNumFreeArenas that has more chances to lead to the bug 707162. When the chunk becomes empty and is moved to the GC free list and the later destroyed, rt->gcNumFreeArenas is not decreased by ArenasPerChunk amount. So whenever we release the first empty chunk rt->gcNumFreeArenas becomes permanently positive.
Comment 2 Terrence Cole [:terrence] 2011-12-16 15:39:26 PST
Created attachment 582415 [details] [diff] [review]
v1: Fix desync of gcNumFreeArenas

Good catch.  I think it is fine to have the race as long as we reliably reset the value to something valid occasionally.  I am not sure why I did not implement it like this in the first place: as is, it is clearly buggy.
Comment 3 Justin Lebar (not reading bugmail) 2011-12-16 16:00:28 PST
Do we have telemetry set up so that, when this change is checked in, we'll be able to see whether it solves bug 707162?  If not, we should set up that telemetry ASAP.
Comment 4 Igor Bukanov 2011-12-16 16:10:43 PST
Created attachment 582428 [details] [diff] [review]
full fix

For the bug 702251 and related bugs it is a must to avoid the race as with bugs the reset point may not be reached.

The fix here solves that and yet another problem (but it is mostly theoretical) with counter desynchronization when the counter is not reset after PickChunk failed to add the chunk to the set and calls Chunk::release(rt, chunk).

With the fix the counter reflects the number of free committed arenas in non-empty chunks. I also renamed it to match ChunkInfo::numArenasFreeCommitted.
Comment 5 Igor Bukanov 2011-12-16 16:30:40 PST
(In reply to Justin Lebar [:jlebar] from comment #3)
> Do we have telemetry set up so that, when this change is checked in, we'll
> be able to see whether it solves bug 707162?  If not, we should set up that
> telemetry ASAP.

I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC (1).
Comment 6 Gregor Wagner [:gwagner] 2011-12-16 16:40:26 PST
(In reply to Igor Bukanov from comment #5)
> (In reply to Justin Lebar [:jlebar] from comment #3)
> > Do we have telemetry set up so that, when this change is checked in, we'll
> > be able to see whether it solves bug 707162?  If not, we should set up that
> > telemetry ASAP.
> 
> I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC
> (1).

Where can I see this histogram?
Comment 8 Igor Bukanov 2011-12-16 16:57:47 PST
(In reply to Justin Lebar [:jlebar] from comment #7)
> https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/
> Render?solution=metrics2&path=telemetry&file=telemetryHistogram.wcdf

That does not show GC_REASON for some reasons...
Comment 9 Bill McCloskey (:billm) 2011-12-16 17:00:59 PST
(In reply to Igor Bukanov from comment #8)
> That does not show GC_REASON for some reasons...

Ehsan recently pointed out that telemetry histograms are not allowed to be defined with a range that starts at 0. (Why this is I have no idea; it seems insane.) Because of this, the GC_REASON histogram was getting ignored.
Comment 10 Igor Bukanov 2011-12-16 17:04:55 PST
(In reply to Bill McCloskey (:billm) from comment #9)
> Ehsan recently pointed out that telemetry histograms are not allowed to be
> defined with a range that starts at 0. (Why this is I have no idea; it seems
> insane.) Because of this, the GC_REASON histogram was getting ignored.

Is it ignored during submission so we have absolutely no data?
Comment 11 Bill McCloskey (:billm) 2011-12-16 17:18:12 PST
The bug is when the histogram data structure is created on the client side.
  http://hg.mozilla.org/mozilla-central/file/cbb0233c7ba8/toolkit/components/telemetry/Telemetry.cpp#l135
Unfortunately, Telemetry::Accumulate drops errors on the floor.

Ehsan fixed it so that the range starts at 1. However, we're stilling adding 0 values to the histogram, so I'm not actually sure what will happen.
Comment 12 Igor Bukanov 2011-12-16 17:45:34 PST
(In reply to Bill McCloskey (:billm) from comment #11)
> The bug is when the histogram data structure is created on the client side.
>  
> http://hg.mozilla.org/mozilla-central/file/cbb0233c7ba8/toolkit/components/
> telemetry/Telemetry.cpp#l135
> Unfortunately, Telemetry::Accumulate drops errors on the floor.
> 
> Ehsan fixed it so that the range starts at 1. However, we're stilling adding
> 0 values to the histogram, so I'm not actually sure what will happen.

Do we have a bug to either add 1 to histogram values or remap 0 into some other unused value to preserve submitted data if any?
Comment 13 Terrence Cole [:terrence] 2011-12-17 13:45:55 PST
Comment on attachment 582428 [details] [diff] [review]
full fix

Review of attachment 582428 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!  That is a huge improvement.
Comment 14 Igor Bukanov 2011-12-17 15:52:35 PST
Comment on attachment 582428 [details] [diff] [review]
full fix

My patch has a bad bug - when the chunk is returned to the empty pool the global counter should decrease not by ArenasPerChunk, but rather by the number of committed arenas in the chunk that is put to the pool.
Comment 15 Igor Bukanov 2011-12-18 07:52:39 PST
Created attachment 582667 [details] [diff] [review]
v2

The new version hopefully fixes all the issues with the counter update. As before the patch the runtime counter accounts for decommitted arenas both in non-empty chunks and in empty chunks. If we have a lot of empty chunks, but only few arenas in them are committed, it should not represent the memory pressure.
Comment 16 Terrence Cole [:terrence] 2011-12-19 11:26:21 PST
Comment on attachment 582667 [details] [diff] [review]
v2

Review of attachment 582667 [details] [diff] [review]:
-----------------------------------------------------------------

I like the name changes and I especially like that this removes the rt from Chunk::init.  However, this does introduce one potential problem: we include FreeCommitted from the ChunkPool in the count, but don't decommit them in DecommitFreePages, leaving us with a persistent positive number.  With the current constants it is impossible to get into a loop where we GC_SHRINK at every MaybeGC, but this is a subtle interaction for anyone tweaking those constants to keep in mind.  I'll give this an r+ given the addition of sufficient static asserts to ensure that the maximum ChunkPool size and the FreeCommittedArenaThreshhold have a sane relation.
Comment 17 Igor Bukanov 2011-12-19 11:37:29 PST
(In reply to Terrence Cole [:terrence] from comment #16)
> However, this does introduce one potential problem: we include
> FreeCommitted from the ChunkPool in the count, but don't decommit them in
> DecommitFreePages, leaving us with a persistent positive number.

I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the backgrounw sweep f

  With the
> current constants it is impossible to get into a loop where we GC_SHRINK at
> every MaybeGC, but this is a subtle interaction for anyone tweaking those
> constants to keep in mind.  I'll give this an r+ given the addition of
> sufficient static asserts to ensure that the maximum ChunkPool size and the
> FreeCommittedArenaThreshhold have a sane relation.
Comment 18 Igor Bukanov 2011-12-19 11:42:01 PST
Sorry for incomplete comment, I meant:

(In reply to Terrence Cole [:terrence] from comment #16)
> However, this does introduce one potential problem: we include
> FreeCommitted from the ChunkPool in the count, but don't decommit them in
> DecommitFreePages, leaving us with a persistent positive number.

I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the background sweep finishes, we cannot have several GC with decommits without releasing all empty chunks.

Also note that in  the bug 702251 I wull implement decommit together with the chunk release on the background thread, so there would be no arbitrary pause between decoomit and release,
Comment 19 Terrence Cole [:terrence] 2011-12-19 13:19:24 PST
(In reply to Igor Bukanov from comment #18)
> Sorry for incomplete comment, I meant:
> 
> (In reply to Terrence Cole [:terrence] from comment #16)
> > However, this does introduce one potential problem: we include
> > FreeCommitted from the ChunkPool in the count, but don't decommit them in
> > DecommitFreePages, leaving us with a persistent positive number.
> 
> I am not sure what do you mean here. Surely we skip empty chunks when doing
> the decommit, but when we do decommit, we remove later on the background
> thread all the empty chunks. As the new GC always waits until the background
> sweep finishes, we cannot have several GC with decommits without releasing
> all empty chunks.

Ah, okay, I did not know that we freed the ChunkPool on GC_SHRINK, but it makes perfect sense.  In that case, this is fine.  It would be nice if we could also add an assert at the end of GC_SHRINK that the counter is always 0, but that would require us to decide that the Decommit is infallible, which I'm not sure we want to guarantee yet.

> Also note that in  the bug 702251 I wull implement decommit together with
> the chunk release on the background thread, so there would be no arbitrary
> pause between decoomit and release,

I'm looking forward to seeing that -- particularly how you will avoid taking a lock at arena allocation if we may be decommitting at the same time.
Comment 20 Igor Bukanov 2011-12-19 13:40:45 PST
(In reply to Terrence Cole [:terrence] from comment #19)
> I'm looking forward to seeing that -- particularly how you will avoid taking
> a lock at arena allocation if we may be decommitting at the same time.

That was not the most difficult part. The bigger complexity comes from avoiding racing with the allocation thread and decommitting from the tail of the available chunk lists, not from the start - I need to fix few issues with the patch before asking for a review...
Comment 22 Ed Morley [:emorley] 2011-12-20 05:53:15 PST
https://hg.mozilla.org/mozilla-central/rev/42b04bce9deb

Note You need to log in before you can comment on or make changes to this bug.