711623 - rt->gcNumFreeArenas is not updated properly

Assignee

Description

•

13 years ago

The bug 700357 has added the rt->gcNumFreeArenas counter that is used for GC scheduling. However, currently the modifications to the counter can race. In particular, the counter is modified in Chunk::init. That function is called outside the GC lock by the background thread. So the increase rt->gcNumFreeArenas += ArenasPerChunk in the init method can race with a counter decrease on the allocation thread. That could lead to the counter wrapping around 0 and becoming a positive number around UINT_MAX. That in turn permanently trigger periodic GC each 20 seconds. That could be the reason for the bug 707162.

Igor Bukanov

Assignee

Comment 1

•

13 years ago

AFAICS there is another bug with rt->gcNumFreeArenas that has more chances to lead to the bug 707162. When the chunk becomes empty and is moved to the GC free list and the later destroyed, rt->gcNumFreeArenas is not decreased by ArenasPerChunk amount. So whenever we release the first empty chunk rt->gcNumFreeArenas becomes permanently positive.

Summary: modifications of rt->gcNumFreeArenas can race → rt->gcNumFreeArenas is not updated properly

Terrence Cole [:terrence]

Comment 2

•

13 years ago

Attached patch v1: Fix desync of gcNumFreeArenas (obsolete) — Details — Splinter Review

Good catch. I think it is fine to have the race as long as we reliably reset the value to something valid occasionally. I am not sure why I did not implement it like this in the first place: as is, it is clearly buggy.

Attachment #582415 - Flags: review?(igor)

Justin Lebar (not reading bugmail)

Comment 3

•

13 years ago

Do we have telemetry set up so that, when this change is checked in, we'll be able to see whether it solves bug 707162? If not, we should set up that telemetry ASAP.

Igor Bukanov

Assignee

Comment 4

•

13 years ago

Attached patch full fix (obsolete) — Details — Splinter Review

For the bug 702251 and related bugs it is a must to avoid the race as with bugs the reset point may not be reached. The fix here solves that and yet another problem (but it is mostly theoretical) with counter desynchronization when the counter is not reset after PickChunk failed to add the chunk to the set and calls Chunk::release(rt, chunk). With the fix the counter reflects the number of free committed arenas in non-empty chunks. I also renamed it to match ChunkInfo::numArenasFreeCommitted.

Attachment #582415 - Attachment is obsolete: true

Attachment #582415 - Flags: review?(igor)

Attachment #582428 - Flags: review?(terrence)

Igor Bukanov

Assignee

Comment 5

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #3) > Do we have telemetry set up so that, when this change is checked in, we'll > be able to see whether it solves bug 707162? If not, we should set up that > telemetry ASAP. I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC (1).

Gregor Wagner [:gwagner]

Comment 6

•

13 years ago

(In reply to Igor Bukanov from comment #5) > (In reply to Justin Lebar [:jlebar] from comment #3) > > Do we have telemetry set up so that, when this change is checked in, we'll > > be able to see whether it solves bug 707162? If not, we should set up that > > telemetry ASAP. > > I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC > (1). Where can I see this histogram?

Justin Lebar (not reading bugmail)

Comment 7

•

13 years ago

https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/Render?solution=metrics2&path=telemetry&file=telemetryHistogram.wcdf

Igor Bukanov

Assignee

Comment 8

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #7) > https://metrics.mozilla.com/pentaho/content/pentaho-cdf-dd/ > Render?solution=metrics2&path=telemetry&file=telemetryHistogram.wcdf That does not show GC_REASON for some reasons...

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

13 years ago

(In reply to Igor Bukanov from comment #8) > That does not show GC_REASON for some reasons... Ehsan recently pointed out that telemetry histograms are not allowed to be defined with a range that starts at 0. (Why this is I have no idea; it seems insane.) Because of this, the GC_REASON histogram was getting ignored.

Igor Bukanov

Assignee

Comment 10

•

13 years ago

(In reply to Bill McCloskey (:billm) from comment #9) > Ehsan recently pointed out that telemetry histograms are not allowed to be > defined with a range that starts at 0. (Why this is I have no idea; it seems > insane.) Because of this, the GC_REASON histogram was getting ignored. Is it ignored during submission so we have absolutely no data?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

13 years ago

The bug is when the histogram data structure is created on the client side. http://hg.mozilla.org/mozilla-central/file/cbb0233c7ba8/toolkit/components/telemetry/Telemetry.cpp#l135 Unfortunately, Telemetry::Accumulate drops errors on the floor. Ehsan fixed it so that the range starts at 1. However, we're stilling adding 0 values to the histogram, so I'm not actually sure what will happen.

Igor Bukanov

Assignee

Comment 12

•

13 years ago

(In reply to Bill McCloskey (:billm) from comment #11) > The bug is when the histogram data structure is created on the client side. > > http://hg.mozilla.org/mozilla-central/file/cbb0233c7ba8/toolkit/components/ > telemetry/Telemetry.cpp#l135 > Unfortunately, Telemetry::Accumulate drops errors on the floor. > > Ehsan fixed it so that the range starts at 1. However, we're stilling adding > 0 values to the histogram, so I'm not actually sure what will happen. Do we have a bug to either add 1 to histogram values or remap 0 into some other unused value to preserve submitted data if any?

Terrence Cole [:terrence]

Comment 13

•

13 years ago

Comment on attachment 582428 [details] [diff] [review] full fix Review of attachment 582428 [details] [diff] [review]: ----------------------------------------------------------------- Thanks! That is a huge improvement.

Attachment #582428 - Flags: review?(terrence) → review+

Igor Bukanov

Assignee

Comment 14

•

13 years ago

Comment on attachment 582428 [details] [diff] [review] full fix My patch has a bad bug - when the chunk is returned to the empty pool the global counter should decrease not by ArenasPerChunk, but rather by the number of committed arenas in the chunk that is put to the pool.

Attachment #582428 - Attachment is obsolete: true

Igor Bukanov

Assignee

Comment 15

•

13 years ago

Attached patch v2 — Details — Splinter Review

The new version hopefully fixes all the issues with the counter update. As before the patch the runtime counter accounts for decommitted arenas both in non-empty chunks and in empty chunks. If we have a lot of empty chunks, but only few arenas in them are committed, it should not represent the memory pressure.

Attachment #582667 - Flags: review?(terrence)

Terrence Cole [:terrence]

Comment 16

•

13 years ago

Comment on attachment 582667 [details] [diff] [review] v2 Review of attachment 582667 [details] [diff] [review]: ----------------------------------------------------------------- I like the name changes and I especially like that this removes the rt from Chunk::init. However, this does introduce one potential problem: we include FreeCommitted from the ChunkPool in the count, but don't decommit them in DecommitFreePages, leaving us with a persistent positive number. With the current constants it is impossible to get into a loop where we GC_SHRINK at every MaybeGC, but this is a subtle interaction for anyone tweaking those constants to keep in mind. I'll give this an r+ given the addition of sufficient static asserts to ensure that the maximum ChunkPool size and the FreeCommittedArenaThreshhold have a sane relation.

Attachment #582667 - Flags: review?(terrence) → review+

Igor Bukanov

Assignee

Comment 17

•

13 years ago

(In reply to Terrence Cole [:terrence] from comment #16) > However, this does introduce one potential problem: we include > FreeCommitted from the ChunkPool in the count, but don't decommit them in > DecommitFreePages, leaving us with a persistent positive number. I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the backgrounw sweep f With the > current constants it is impossible to get into a loop where we GC_SHRINK at > every MaybeGC, but this is a subtle interaction for anyone tweaking those > constants to keep in mind. I'll give this an r+ given the addition of > sufficient static asserts to ensure that the maximum ChunkPool size and the > FreeCommittedArenaThreshhold have a sane relation.

Igor Bukanov

Assignee

Comment 18

•

13 years ago

Sorry for incomplete comment, I meant: (In reply to Terrence Cole [:terrence] from comment #16) > However, this does introduce one potential problem: we include > FreeCommitted from the ChunkPool in the count, but don't decommit them in > DecommitFreePages, leaving us with a persistent positive number. I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the background sweep finishes, we cannot have several GC with decommits without releasing all empty chunks. Also note that in the bug 702251 I wull implement decommit together with the chunk release on the background thread, so there would be no arbitrary pause between decoomit and release,

Terrence Cole [:terrence]

Comment 19

•

13 years ago

(In reply to Igor Bukanov from comment #18) > Sorry for incomplete comment, I meant: > > (In reply to Terrence Cole [:terrence] from comment #16) > > However, this does introduce one potential problem: we include > > FreeCommitted from the ChunkPool in the count, but don't decommit them in > > DecommitFreePages, leaving us with a persistent positive number. > > I am not sure what do you mean here. Surely we skip empty chunks when doing > the decommit, but when we do decommit, we remove later on the background > thread all the empty chunks. As the new GC always waits until the background > sweep finishes, we cannot have several GC with decommits without releasing > all empty chunks. Ah, okay, I did not know that we freed the ChunkPool on GC_SHRINK, but it makes perfect sense. In that case, this is fine. It would be nice if we could also add an assert at the end of GC_SHRINK that the counter is always 0, but that would require us to decide that the Decommit is infallible, which I'm not sure we want to guarantee yet. > Also note that in the bug 702251 I wull implement decommit together with > the chunk release on the background thread, so there would be no arbitrary > pause between decoomit and release, I'm looking forward to seeing that -- particularly how you will avoid taking a lock at arena allocation if we may be decommitting at the same time.

Igor Bukanov

Assignee

Comment 20

•

13 years ago

(In reply to Terrence Cole [:terrence] from comment #19) > I'm looking forward to seeing that -- particularly how you will avoid taking > a lock at arena allocation if we may be decommitting at the same time. That was not the most difficult part. The bigger complexity comes from avoiding racing with the allocation thread and decommitting from the tail of the available chunk lists, not from the start - I need to fix few issues with the patch before asking for a review...

Igor Bukanov

Assignee

Comment 21

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/42b04bce9deb

Ed Morley [:emorley]

Comment 22

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/42b04bce9deb

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla11

v1: Fix desync of gcNumFreeArenas 13 years ago Terrence Cole [:terrence] 1.46 KB, patch		Details \| Diff \| Splinter Review
full fix 13 years ago Igor Bukanov 11.46 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
v2 13 years ago Igor Bukanov 11.52 KB, patch	terrence : review+	Details \| Diff \| Splinter Review