Closed Bug 1735284 Opened 3 years ago Closed 3 years ago

Avoid a potential race between thread self-shutdown and external shutdown resulting in hang

Tracking

()

Status:

RESOLVED FIXED

Milestone:

95 Branch

Tracking Flags:

Tracking

Status

firefox95

---

fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

Details

Attachments

(2 files)

Bug 1735284: Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r?#xpcom-reviewers 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
data collection review 3 years ago Chris H-C :chutten 2.21 KB, text/plain	chutten : data-review+	Details

Jens Stutte [:jstutte]

Assignee

Description

•

3 years ago

•

Edited

From bug 1629669:

The false returns from PutEventInternal() are pretty limited - mEventsAreDoomed or no mQueue in the target sink. mEventsAreDoomed can happen if we try to send an event to a thread after it gets past nsThread.cpp:414, where we process events at shutdown time until the event queue is empty. At that point we set mEventsAreDoomed and start cleanup, but an attempt to send us a message could come in there I think.
However for this to happen, we'd have to have already started shutdown for the thread, and ShutdownInternal() doesn't let itself be called twice. If a thread exited on it's own, and Shutdown() was called on it while it was in the process of shutting down, perhaps this sequence could happen. Seems unlikely, but in theory possible for threads that commit suicide.
We could add a MOZ_RELEASE_ASSERT to flag this case, and see if it's being hit.

We can probably directly add a return on error here. It is clear that if PushEvent fails, we will not be able to ever receive a shutdown ack from the thread. Even if we do not know yet if those conditions ever arise, it seems the only logical thing to do here to avoid a potential hang.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Blocks: 1629669

Jens Stutte [:jstutte]

Assignee

Comment 1

•

3 years ago

Adding the signatures from bug 1629669 also here to directly see if this moves the needle.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ]

Jens Stutte [:jstutte]

Assignee

Comment 2

•

3 years ago

Attached file Bug 1735284: Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r?#xpcom-reviewers — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Nika Layzell [:nika] (ni? for response)

Comment 3

•

3 years ago

I was a bit curious about this bug, so I looked into it a bit before reviewing the patch:

Because of this atomic: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/xpcom/threads/nsThread.cpp#759, only one call to Shutdown() should be able to enter ShutdownInternal and start the shutdown process through nsThread::ShutdownInternal. The only way that MessageLoop::Run should return here: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/xpcom/threads/nsThread.cpp#390 is if MessagePumpForNonMainThreads::keep_running_ returns false https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/glue/MessagePump.cpp#301-303,311-313,321-323. This only happens when MessagePumpDefault::Quit is called: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/message_pump_default.cc#79, which is only called if MessageLoop::state_->quit_received is true: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/message_loop.cc#570. quit_received is only set to true by MessageLoop::Quit(), which has only 5 callers:

QuitTask::Run https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/message_loop.h#151
- Only used on macOS for a UI thread: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/toolkit/xre/nsEmbedFunctions.cpp#910
ThreadQuitTask::Run https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/thread.cc#26
- base::Thread only. I believe these threads are not shut down by XPCOM, as they are not XPCOM threads and therefore shouldn't be shut down in the hanging codepath.
XRE_ShutdownChildProcess() https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/toolkit/xre/nsEmbedFunctions.cpp#932
- Child-process main thread only
nsBaseAppShell::Exit https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/widget/nsBaseAppShell.cpp#148
- I believe this is UI-thread-only
nsThreadShutdownEvent::Run https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/xpcom/threads/nsThread.cpp#238
- Only called from ShutdownInternal()

This makes me think that the shutdown hangs which we're running into in this bug are probably not caused by this potential race, but rather by unresponsive background threads. I took a look at a few randomly selected crash reports (https://crash-stats.mozilla.org/report/index/bd55a828-5f6a-401d-8f7f-4d0600211012#allthreads, https://crash-stats.mozilla.org/report/index/054a6ae0-992c-4d2f-b3f9-15fe20211012#allthreads, https://crash-stats.mozilla.org/report/index/994ea721-2ccf-4a50-913c-2c8aa0211012#allthreads, https://crash-stats.mozilla.org/report/index/fc7dea1e-d3ff-464d-bbb1-7b2920211012#allthreads, https://crash-stats.mozilla.org/report/index/a06b3dc0-320d-426c-98fc-fe3fa0211012#allthreads), and I noticed that every crash report had a running BitsCommander thread with a deep stack. BitsCommander is a nsThread, so would be blocked on in nsThreadManager::Shutdown. I'm guessing that we're hanging waiting for the BitsCommander thread to exit, and the BitsCommander thread is not responding due to performing some form of sync Windows RPC.

Not sure how tricky it'd be to do, but it might be nice to hold on to the thread name which is being waited on here, so it's more clear which thread is being waited on when we get hangs like this.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 4

•

3 years ago

(In reply to Nika Layzell [:nika] (ni? for response) from comment #3)

Not sure how tricky it'd be to do, but it might be nice to hold on to the thread name which is being waited on here, so it's more clear which thread is being waited on when we get hangs like this.

That goes a bit in the same direction as bug 1735129, I'd say? I assume there is a way to retrieve the name being inside the nsThread object ?

Flags: needinfo?(jstutte) → needinfo?(nika)

Jens Stutte [:jstutte]

Assignee

Comment 5

•

3 years ago

(I just proposed a patch for this in bug 1735129, so ignore the question here about retrieving the name.)

Nika Layzell [:nika] (ni? for response)

Updated

•

3 years ago

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Assignee

Comment 6

•

3 years ago

We do not expect this to have great influence on crash stats, and in case we would see the diagnostic asserts, anyway.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ]

Jens Stutte [:jstutte]

Assignee

Comment 7

•

3 years ago

•

Edited

Comment on attachment 9245407 [details]
Bug 1735284: Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r?#xpcom-reviewers

Request for data collection review form

All questions are mandatory. You must receive review from a data steward peer on your responses to these questions before shipping new data collection.

What questions will you answer with this data?
We want to attribute unexpected errors to specific threads in order to see eventual patterns.
Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?
To help reducing unexpected errors in thread handling.
What alternative methods did you consider to answer these questions? Why were they not sufficient?
We could live without this, but if it happens it would be nice to see which thread caused this.
Can current instrumentation answer these questions?
Yes, it is just a MOZ_CRASH_UNSAFE.
List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.

Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.

Measurement Description	Data Collection Category	Tracking Bug
Thread Name	Category 1 “Technical data”	1735284

Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.
This collection is using the existing crash-stats and falls under its descriptions.
How long will this data be collected? Choose one of the following:

I want to permanently monitor this data. (jstutte@mozilla.com)

What populations will you measure?
All.
If this data collection is default on, what is the opt-out mechanism for users?
The normal opt-out for telemetry.
Please provide a general description of how you will analyze this data.
We hope to never see this crash. If we do, we'll look at the core dump.
Where do you intend to share the results of your analysis?
As fixes in the source code (if any).
Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection?
No.

Attachment #9245407 - Flags: data-review?(chutten)

Chris H-C :chutten

Comment 8

•

3 years ago

Attached file data collection review — Details

Please attach future data collection reviews as their own text attachment to better integrate with Data Steward processes.

Attachment #9246502 - Flags: data-review?(chutten)

Chris H-C :chutten

Updated

•

3 years ago

Attachment #9245407 - Flags: data-review?(chutten)

Chris H-C :chutten

Comment 9

•

3 years ago

Comment on attachment 9246502 [details]
data collection review

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.

Result: datareview+

Attachment #9246502 - Flags: data-review?(chutten) → data-review+

Pulsebot

Comment 10

•

3 years ago

Pushed by rjesup@wgate.com: https://hg.mozilla.org/integration/autoland/rev/66532c424290 Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r=xpcom-reviewers,nika

Sandor Molnar[:smolnar]

Comment 11

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/66532c424290

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox95: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 95 Branch

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Avoid a potential race between thread self-shutdown and external shutdown resulting in hang

Categories

(Core :: XPCOM, defect)

Tracking

()

People

(Reporter: jstutte, Assigned: jstutte)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Request for data collection review form

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type