Open Bug 1619213 Opened 2 years ago Updated 1 year ago

Crash in [@ IPCError-browser | ShutDownKill | js::gc::ParallelWorker<T>::run]

Categories

(Core :: JavaScript: GC, defect, P5)

Unspecified
Windows 10
defect

Tracking

()

ASSIGNED
Tracking Status
firefox73 --- unaffected
firefox74 --- unaffected
firefox75 --- wontfix

People

(Reporter: calixte, Assigned: allstars.chh)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-272ad86e-cc39-48e2-b116-469840200229.

Top 10 frames of crashing thread:

0 xul.dll js::gc::ParallelWorker<ArenaListSegment, ArenasToUnmark>::run js/src/gc/ParallelWork.h:49
1 xul.dll js::GCParallelTask::runTask js/src/gc/GCParallelTask.cpp:143
2 xul.dll js::gc::GCRuntime::joinTask js/src/gc/GC.cpp:5035
3 xul.dll js::gc::AutoRunParallelWork<ArenaListSegment, ArenasToUnmark>::~AutoRunParallelWork js/src/gc/ParallelWork.h:103
4 xul.dll js::gc::GCRuntime::unmarkCollectedZones js/src/gc/GC.cpp:4003
5 xul.dll js::gc::GCRuntime::beginMarkPhase js/src/gc/GC.cpp:4055
6 xul.dll js::gc::GCRuntime::incrementalSlice js/src/gc/GC.cpp:6494
7 xul.dll js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:6979
8 xul.dll js::gc::GCRuntime::collect js/src/gc/GC.cpp:7161
9 xul.dll js::gc::GCRuntime::startGC js/src/gc/GC.cpp:7249

There are 6 crashes (from 6 installations) in nightly 75 starting with buildid 20200228161940. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1617902.

[1] https://hg.mozilla.org/mozilla-central/rev?node=1434bd7bae35

Flags: needinfo?(jcoppeard)
Crash Signature: [@ IPCError-browser | ShutDownKill | js::gc::ParallelWorker<T>::run] → [@ IPCError-browser | ShutDownKill | js::gc::ParallelWorker<T>::run] [@ IPCError-browser | ShutDownKill | ArenasToUnmark::next]

Jonco is on PTO, taking this bug.

Assignee: nobody → allstars.chh
Status: NEW → ASSIGNED
Flags: needinfo?(jcoppeard)

BTW we already had frequent signatures in the GC, see bug 1616018 comment 4. It would be interesting to know if these signatures are new (i.e. they appear in addition to the old ones) or if they've replaced the old ones. Yoshi, FYI if you're not familiar with this type of signature this is not a crash per se. What is happening is that a content process is being slow during shutdown and we're killing it after having waited for 5 seconds for it to close itself. We grab a minidump which is sent as a crash report just before we kill the process to know where it was getting stuck.

Yeah, this feels like it is probably a signature change.

See Also: → 1616018

Joh or Yoshi, could please you set the priority flag?

Flags: needinfo?(jcoppeard)
Flags: needinfo?(allstars.chh)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(allstars.chh)
Priority: -- → P1

I think the problem is the AutoRunParallelWork in unmarkCollectedZones keeps holding the lock
https://searchfox.org/mozilla-central/rev/13b081a62d3f3e3e3120f95564529257b0bf451c/js/src/gc/GC.cpp#4006

so JS Helpers aren't run because they cannot acquire the lock.

And when unmarkCollectedZones exits, the destructor of AutoRunParallelWork calls joinTask
https://searchfox.org/mozilla-central/rev/13b081a62d3f3e3e3120f95564529257b0bf451c/js/src/gc/ParallelWork.h#103

and because those tasks are not run by helpers, so main thread runs them
https://searchfox.org/mozilla-central/rev/13b081a62d3f3e3e3120f95564529257b0bf451c/js/src/gc/GC.cpp#5031
The stack trace in Comment 0 also explains this.

And because it's only main thread runs the task, it takes a long time, and as Gabriele explains in comment 2, the timer kills it
(Thanks Gabriels)

So I will try to fix this with adding another joinTask() that won't run the task by main thread. (or maybe add another condition, say main thread will run the task if helper threads are all busy)

Crash Signature: [@ IPCError-browser | ShutDownKill | js::gc::ParallelWorker<T>::run] [@ IPCError-browser | ShutDownKill | ArenasToUnmark::next] → [@ IPCError-browser | ShutDownKill | js::gc::ParallelWorker<T>::run] [@ IPCError-browser | ShutDownKill | ArenasToUnmark::next] [@ IPCError-browser | ShutDownKill | __lll_lock_wait ]

AutoRunParallelWork will try to dispatch the tasks to other helper
threads, but it keeps holding the lock so helper threads cannot run the
tasks. Later when ~AutoRunParallelWork calls joinTask, which will try to
run those tasks on main thread. And in some cases it's taking too long
hence gets killed by Shutdown timer.

So adding a overload version of joinTask to forcibly to run the tasks on
helper thread without dispatching it to main thread.

(In reply to Yoshi Cheng-Hao Huang [:allstars.chh][:allstarschh] from comment #5)
Good catch about not releasing the lock. However joinTask() will release the lock when running a task on the main thread, so in the worst case one task will run on the main thread and the others will start running on helper threads at that point.

I added some printf debugging to confirm this:

js> gc()
Starting ParallelWorker for unmarkCollectedZones on thread 140013731993792
Starting ParallelWorker for unmarkCollectedZones on thread 140013677328128
Starting ParallelWorker for unmarkCollectedZones on thread 140013679421184
Starting ParallelWorker for unmarkCollectedZones on thread 140013685700352
Starting ParallelWorker for sweepWeakCaches on thread 140013731993792
Starting ParallelWorker for sweepWeakCaches on thread 140013673142016
Starting ParallelWorker for sweepWeakCaches on thread 140013685700352
Starting ParallelWorker for sweepWeakCaches on thread 140013671048960
Starting ParallelWorker for sweepWeakCaches on thread 140013683607296
Starting ParallelWorker for sweepWeakCaches on thread 140013675235072
Starting ParallelWorker for sweepWeakCaches on thread 140013673142016
Starting ParallelWorker for sweepWeakCaches on thread 140013681514240
"before 638976, after 638976\n"

This is probably a signature change following bug 1618638.

Priority: P1 → P5
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.