Closed Bug 1735284 Opened 2 months ago Closed 2 months ago

Avoid a potential race between thread self-shutdown and external shutdown resulting in hang

Categories

(Core :: XPCOM, defect)

defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox95 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

From bug 1629669:

The false returns from PutEventInternal() are pretty limited - mEventsAreDoomed or no mQueue in the target sink. mEventsAreDoomed can happen if we try to send an event to a thread after it gets past nsThread.cpp:414, where we process events at shutdown time until the event queue is empty. At that point we set mEventsAreDoomed and start cleanup, but an attempt to send us a message could come in there I think.
However for this to happen, we'd have to have already started shutdown for the thread, and ShutdownInternal() doesn't let itself be called twice. If a thread exited on it's own, and Shutdown() was called on it while it was in the process of shutting down, perhaps this sequence could happen. Seems unlikely, but in theory possible for threads that commit suicide.
We could add a MOZ_RELEASE_ASSERT to flag this case, and see if it's being hit.

We can probably directly add a return on error here. It is clear that if PushEvent fails, we will not be able to ever receive a shutdown ack from the thread. Even if we do not know yet if those conditions ever arise, it seems the only logical thing to do here to avoid a potential hang.

Blocks: 1629669

Adding the signatures from bug 1629669 also here to directly see if this moves the needle.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ]
Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ] [@ @ shutdownhang | nsAppShell::ProcessNextNativeEvent ] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ]

I was a bit curious about this bug, so I looked into it a bit before reviewing the patch:

Because of this atomic: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/xpcom/threads/nsThread.cpp#759, only one call to Shutdown() should be able to enter ShutdownInternal and start the shutdown process through nsThread::ShutdownInternal. The only way that MessageLoop::Run should return here: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/xpcom/threads/nsThread.cpp#390 is if MessagePumpForNonMainThreads::keep_running_ returns false https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/glue/MessagePump.cpp#301-303,311-313,321-323. This only happens when MessagePumpDefault::Quit is called: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/message_pump_default.cc#79, which is only called if MessageLoop::state_->quit_received is true: https://searchfox.org/mozilla-central/rev/2c4b830b924f42283632b70f39a60fd36433dd4d/ipc/chromium/src/base/message_loop.cc#570. quit_received is only set to true by MessageLoop::Quit(), which has only 5 callers:

This makes me think that the shutdown hangs which we're running into in this bug are probably not caused by this potential race, but rather by unresponsive background threads. I took a look at a few randomly selected crash reports (https://crash-stats.mozilla.org/report/index/bd55a828-5f6a-401d-8f7f-4d0600211012#allthreads, https://crash-stats.mozilla.org/report/index/054a6ae0-992c-4d2f-b3f9-15fe20211012#allthreads, https://crash-stats.mozilla.org/report/index/994ea721-2ccf-4a50-913c-2c8aa0211012#allthreads, https://crash-stats.mozilla.org/report/index/fc7dea1e-d3ff-464d-bbb1-7b2920211012#allthreads, https://crash-stats.mozilla.org/report/index/a06b3dc0-320d-426c-98fc-fe3fa0211012#allthreads), and I noticed that every crash report had a running BitsCommander thread with a deep stack. BitsCommander is a nsThread, so would be blocked on in nsThreadManager::Shutdown. I'm guessing that we're hanging waiting for the BitsCommander thread to exit, and the BitsCommander thread is not responding due to performing some form of sync Windows RPC.

Not sure how tricky it'd be to do, but it might be nice to hold on to the thread name which is being waited on here, so it's more clear which thread is being waited on when we get hangs like this.

Flags: needinfo?(jstutte)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #3)

Not sure how tricky it'd be to do, but it might be nice to hold on to the thread name which is being waited on here, so it's more clear which thread is being waited on when we get hangs like this.

That goes a bit in the same direction as bug 1735129, I'd say? I assume there is a way to retrieve the name being inside the nsThread object ?

Flags: needinfo?(jstutte) → needinfo?(nika)

(I just proposed a patch for this in bug 1735129, so ignore the question here about retrieving the name.)

Flags: needinfo?(nika)

We do not expect this to have great influence on crash stats, and in case we would see the diagnostic asserts, anyway.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ]

Comment on attachment 9245407 [details]
Bug 1735284: Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r?#xpcom-reviewers

Request for data collection review form

All questions are mandatory. You must receive review from a data steward peer on your responses to these questions before shipping new data collection.

  1. What questions will you answer with this data?
    We want to attribute unexpected errors to specific threads in order to see eventual patterns.

  2. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?
    To help reducing unexpected errors in thread handling.

  3. What alternative methods did you consider to answer these questions? Why were they not sufficient?
    We could live without this, but if it happens it would be nice to see which thread caused this.

  4. Can current instrumentation answer these questions?
    Yes, it is just a MOZ_CRASH_UNSAFE.

  5. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.

Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.

Measurement Description Data Collection Category Tracking Bug
Thread Name Category 1 “Technical data” 1735284
  1. Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.
    This collection is using the existing crash-stats and falls under its descriptions.

  2. How long will this data be collected? Choose one of the following:

  1. What populations will you measure?
    All.

  2. If this data collection is default on, what is the opt-out mechanism for users?
    The normal opt-out for telemetry.

  3. Please provide a general description of how you will analyze this data.
    We hope to never see this crash. If we do, we'll look at the core dump.

  4. Where do you intend to share the results of your analysis?
    As fixes in the source code (if any).

  5. Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection?
    No.

Attachment #9245407 - Flags: data-review?(chutten)
Attached file data collection review

Please attach future data collection reviews as their own text attachment to better integrate with Data Steward processes.

Attachment #9246502 - Flags: data-review?(chutten)
Attachment #9245407 - Flags: data-review?(chutten)

Comment on attachment 9246502 [details]
data collection review

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9246502 - Flags: data-review?(chutten) → data-review+
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/autoland/rev/66532c424290
Add a check in nsThread::ShutdownInternal if the shutdown event could be dispatched. r=xpcom-reviewers,nika
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
You need to log in before you can comment on or make changes to this bug.