Closed Bug 1727157 Opened 3 years ago Closed 3 years ago

UnmarkGrayGCThingRecursively calling TimeStamp::Now causes hangs on Windows (via UnmarkGrayJSListenersInCCGenerationDocuments)

Categories

(Core :: JavaScript: GC, defect)

Desktop
Windows
defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox94 --- fixed

People

(Reporter: florian, Assigned: jonco)

References

Details

(Keywords: hang, perf, Whiteboard: [bhr:UnmarkGrayGCThingRecursively])

Attachments

(1 file)

See https://fqueze.github.io/hang-stats/#date=20210815&row=0&filter=UnmarkGrayGCThingRecursively

The typical stack is:

NtQueryPerformanceCounter ntdll
RtlQueryPerformanceCounter ntdll
static mozilla::TimeStamp::Now(bool) mozglue
JS::UnmarkGrayGCThingRecursively(JS::GCCellPtr) xul
...

The frequency of this hang varies a lot depending on the day. There's probably a difference between week days and week-end days.

It looks like this is happening during nsContentUtils::UnmarkGrayJSListenersInCCGenerationDocuments(). I'm not sure why we're seeing hangs from TimeStamp::Now(). Maybe we could batch up these calls into the JS engine so there wouldn't be as many TimeStamp calls?

It looks like the call to TimeStamp::Now is from Statistics::recordPhaseBegin. I guess this can be an overhead if we unmark many gray things in a loop.

Besides batching these unmark grey calls, another better way to reduce the performance overhead would be to move more things into mozilla::dom::TraceBlackJS(), so we mark these things black initially, instead of marking them grey and then later unmarking them to black.

Severity: -- → S3
Priority: -- → P3

While fixing this might need work on the JS side, I'd argue that the primary cause of this is the way we dump tons of stuff into UnmarkGrey from the DOM side, so I'm going to move it to DOM.

Severity: S3 → --
Component: JavaScript: GC → DOM: Core & HTML
Priority: P3 → --

I believe performance related things go to Performance, so...

Component: DOM: Core & HTML → Performance

(NI'ing smaug just in case)

Flags: needinfo?(bugs)

Hmm, having TimeStamp::Now() in a hot code path is surprising.

DOM could optimize how often calls are being made, but UnmarkGrayGCThingRecursively is a hot method, so TimeStamp::Now() call there should be probably optimized out anyhow.

Component: Performance → JavaScript Engine
Flags: needinfo?(bugs)

(In reply to Olli Pettay [:smaug] from comment #7)
Steve, I'd be OK with removing the GC stats phase recording from gray unmarking. Gray unmarking is a hot code path that can be called many times repeatedly, unlike the other GC phases which occur as part of a GC and do a bunch of work in each phase. What do you think?

Flags: needinfo?(sphink)
Summary: UnmarkGrayGCThingRecursively calling TimeStamp::Now causes hangs on Windows → UnmarkGrayGCThingRecursively calling TimeStamp::Now causes hangs on Windows (via UnmarkGrayJSListenersInCCGenerationDocuments)

I'm a little torn. It seems like timing gray unmarking is exactly what we need in order to motivate things like bug 1727587 and measure resulting improvements. On the other hand, grabbing a timestamp in hot code feels very wrong. I guess one alternative would be to count calls without timing them, or only time calls that recurse past the initial object. But that's getting more complex.

I will also note that we also have a profiling category for gray unmarking, which covers some of the same ground.

An in-between solution would be to make GC stats recording for gray unmarking be conditional on the profiler being enabled. You would have to rely on the profiler to tell you that there is a problem, but then when you're already in the profiler the phase data for the GC would be correct.

All in all, I'm in favor of just removing GC stats recording from gray unmarking entirely.

Flags: needinfo?(sphink)

Is TimeStamp::Now() hanging? Or is it just that we are spending a lot of time unmarking things, and that we are reporting the first thing which happens to be observable by the BHR?

Another option would be that we have a loop of GCs which are not doing anything beside reporting statistics, but this would be unlikely to hang on its own, and the issue would be in the thing which causes this GC loop.

(In reply to Nicolas B. Pierron [:nbp] from comment #10)

Is TimeStamp::Now() hanging?

It's not really hanging, but calling RtlQueryPerformanceCounter is known to be slow on Windows, and it's likely called many times.

Could this issue be caused by Bug 1724980?

(In reply to Nicolas B. Pierron [:nbp] from comment #12)

Could this issue be caused by Bug 1724980?

I don't believe so. That runnable isn't running any GCs, so it wouldn't be gray-unmarking anything.

Assignee: nobody → jcoppeard
Component: JavaScript Engine → JavaScript: GC
Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/eac1397f4352 Remove GC stats phase for gray unmarking r=sfink
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: