Closed Bug 1895110 Opened 2 months ago Closed 14 days ago

Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout]

Categories

(Firefox :: Address Bar, defect, P1)

Firefox 126
Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
129 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox127 --- wontfix
firefox128 --- wontfix
firefox129 --- verified

People

(Reporter: worcester12345, Assigned: adw)

References

Details

(Keywords: crash, topcrash, Whiteboard: [sng])

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/9a060de6-0aad-43e1-a28c-ba2c10240505

MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdownThreads. Something is blocking the main-thread.

Top 9 frames:

0  xul.dll  MOZ_Crash(char const*, int, char const*)  mfbt/Assertions.h:317
0  xul.dll  mozilla::(anonymous namespace)::RunWatchdog(void*)  toolkit/components/terminator/nsTerminator.cpp:244
1  nss3.dll  _PR_NativeRunThread(void*)  nsprpub/pr/src/threads/combined/pruthr.c:399
2  nss3.dll  pr_root(void*)  nsprpub/pr/src/md/windows/w95thred.c:139
3  ucrtbase.dll  thread_start<unsigned int (__cdecl*)(void*), 1>
4  kernel32.dll  BaseThreadInitThunk
5  mozglue.dll  mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mo...  toolkit/xre/dllservices/mozglue/nsWindowsDllInterceptor.h:150
5  mozglue.dll  patched_BaseThreadInitThunk(int, void*, void*)  toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:562
6  ntdll.dll  RtlUserThreadStart

I see a SQL query executed from Rust, and it seems to come from the Suggest rust component. Drew, I'm not sure if you're the right person to ping, but maybe you can help figuring this out. See Thread 24.

Component: General → Address Bar
Flags: needinfo?(adw)

I wonder if https://www.sqlite.org/c3ref/interrupt.html may be useful here?

Severity: -- → S3
Keywords: crash
Priority: -- → P3
Whiteboard: [sng]

Thanks Marco, this looks like the same problem that prompted the creation of bug 1897264. The PR in that bug was merged but it hasn't been vendored into m-c yet. Once it is, we can add a shutdown hook in desktop to call interrupt() on the Rust component with the new InterruptKind::ReadWrite arg (looks like).

That bug was created out of this Slack thread. I didn't read all of it but it looks like iOS was seeing exceptions due to shutdown being blocked by Suggest ingest.

Depends on: 1897264
Flags: needinfo?(adw)

Bug 1892800 seems similar to this except it happened to ingestion. Interrupting both read/write tasks in the shutdown blocker should largely mitigate this.

That bug looks like an exact dupe of this one? I'll mark it. Thanks for calling that out.

Duplicate of this bug: 1892800
Duplicate of this bug: 1897967
Crash Signature: [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ] [@ shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ]

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta
  • Top 10 desktop browser crashes on nightly
  • Top 5 desktop browser crashes on Windows on beta

:adw, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(adw)
Keywords: topcrash

What is the difference between this [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ] bug and bug 1866944?

I've hit this crash ten times (starting 2024-05-29), typically when I launch Nightly from the Windows PowerShell and then exit Nightly.

https://crash-stats.mozilla.org/report/index/790156a1-0eb5-4579-bd28-842c90240603

See Also: → 1866944

(In reply to Chris Peterson [:cpeterson] from comment #9)

What is the difference between this [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ] bug and bug 1866944?

The problem is that crash signature is a general symptom that may indicate many different issues, it's only by checking all the stack traces for the various threads that one can tell which specific code path may be causing it.

Bug 1897264 has been vendored into m-c so we can fix this on desktop now.

Assignee: nobody → adw
Severity: S3 → S2
Status: NEW → ASSIGNED
Flags: needinfo?(adw)
Priority: P3 → P1

This adds a profileBeforeChange async shutdown blocker that interrupts the
Suggest store. It calls interrupt(InterruptKind.READ_WRITE) so that both
ingests and queries are interrupted. The new interrupt() API is defined in
suggest.udl here:

https://searchfox.org/mozilla-central/rev/b11735b86bb4d416c918e2b2413456561beff50c/third_party/rust/suggest/src/suggest.udl#149-153

InterruptKind is defined in the generated UniFFI JS bindings here:

https://searchfox.org/mozilla-central/rev/b11735b86bb4d416c918e2b2413456561beff50c/toolkit/components/uniffi-bindgen-gecko-js/components/generated/RustSuggest.sys.mjs#1115-1119

This also modifies the one existing interrupt() call (when queries are
canceled) to pass InterruptKind.READ. According to the interrupt() doc,
passing nothing is deprecated.

I simplified the first task in test_rust_ingest.js. Now that Rust is enabled by
default, there's no need to test a first run where Rust starts out disabled but
then becomes enabled. Now the task only tests disabling it and then re-enabling
it. This also means Rust is enabled for the rest of the test, which makes the
test easier to work with.

Pushed by dwillcoxon@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d3346aa58eb8
Add a profileBeforeChange shutdown blocker to interrupt the Rust Suggest store. r=daisuke

STR for QA

We want to verify that quitting Firefox in various scenarios works successfully and as quickly as usual.

Set up

Enable Firefox Suggest if it's not already enabled. You can use whatever method you normally use, or start Firefox and set the following prefs to true:

browser.urlbar.suggest.quicksuggest.nonsponsored
browser.urlbar.suggest.quicksuggest.sponsored
browser.urlbar.quicksuggest.enabled

Scenario 1

  1. Start Firefox
  2. Immediately quit

Scenario 2

  1. Start Firefox
  2. Wait 3 seconds (doesn't need to be exact)
  3. Quit

Scenario 3

  1. Start Firefox
  2. Wait 10 seconds (doesn't need to be exact)
  3. Quit

Scenario 4

  1. Start Firefox
  2. Wait a few seconds
  3. Open about:config and confirm browser.urlbar.quicksuggest.rustEnabled is true. Trigger a new ingest by toggling it to false and then back to true (don't worry about doing this quickly, just false and then true is fine)
  4. Immediately quit

Scenario 5

  1. Start Firefox
  2. Wait a few seconds
  3. Trigger a new ingest by toggling browser.urlbar.quicksuggest.rustEnabled to false and then back to true
  4. Wait 3 seconds
  5. Quit

Scenario 6

  1. Start Firefox
  2. Wait a few seconds
  3. Trigger a new ingest by toggling browser.urlbar.quicksuggest.rustEnabled to false and then back to true
  4. Wait 10 seconds
  5. Quit
Crash Signature: [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ] [@ shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ]
Flags: qe-verify+
Flags: in-testsuite+

I don't know why that last comment cleared the crash signature!

Crash Signature: [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ] [@ shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout ]
Attached patch Beta 128 PatchSplinter Review

There's a trivial patch conflict on Beta due to bug 1880183.

Approval Request Comment
[Feature/Bug causing the regression]: Firefox Suggest implemented in Rust
[User impact if declined]: This is a top crash. Since Suggest is enabled only for US users, it affects US users.
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: No
[Needs manual test from QE? If yes, steps to reproduce]: See comment 14
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: Medium risk
[Why is the change risky/not risky?]: This adds a shutdown blocker to force the Suggest Rust component to stop ingest, but hypothetically if there's a bug in the interrupt routine or the blocker itself, it could end up causing another problem on shutdown, which might happen more or less frequently than the crash it's trying to fix.
[String changes made/needed]:

Attachment #9409022 - Flags: approval-mozilla-beta?
Attachment #9409022 - Attachment is patch: true
Attachment #9409022 - Attachment mime type: application/octet-stream → text/plain
Status: ASSIGNED → RESOLVED
Closed: 14 days ago
Resolution: --- → FIXED
Target Milestone: --- → 129 Branch

@adw I'm slightly worried that the code will be stuck in drop_suggestions for too long, where it doesn't check for interruption. I think it will probably work since this is only an issue for the amp/wikipedia data and by my count there's about 40 records there, so the new worst-case scenario should be < 1/40th the time of the previous one. However, I'm not totally sure about this one. Have you tested it personally?

Flags: needinfo?(adw)

I haven't been able to reproduce the original crash, probably because it's so timing dependent. With the patch applied, I also haven't been able to reproduce a crash or any other problem. I guess I could try to somehow artificially increase the amount of time suggest spends executing SQL statements during ingest, to make it eaiser to quit in the middle of them.

What's the solution to drop_suggestions? It should use self.scope.err_if_interrupted() like the other functions that write to the DB, right? And that's a problem that applies to mobile too, right?

Flags: needinfo?(adw)
Whiteboard: [sng] → [sng][qa-triaged]

Unfortunately I wasn't able to reproduce the issue either using the scenarios from Comment 14 with Firefox 126.0 on Windows10 x64.
@adw please let me know if your suggestion from the last comment will help in reproducing this issue, so I can give it another try. Thanks.

Flags: needinfo?(adw)

Adding a NI to :bdk for the questions in Comment 19.

:adw/:bdk we are near end of the beta cycle for Fx128. Tomorrow is the last day of uplifts.
Though this has an uplift request I've been waiting on a resolution for your discussion.
If concerns are taking this, should this ride the train with Fx129?

Flags: needinfo?(bdeankawamura)

What's the solution to drop_suggestions? It should use self.scope.err_if_interrupted() like the other functions that write to the DB, right? And that's a problem that applies to mobile too, right?

There's a patch to make it run faster. I think that should be enough, since it reduces the time to something reasonable (around a second or so), then we will check for interruption right after.

:adw/:bdk we are near end of the beta cycle for Fx128. Tomorrow is the last day of uplifts.
Though this has an uplift request I've been waiting on a resolution for your discussion.
If concerns are taking this, should this ride the train with Fx129?

I'm not sure if this will fix the issue or not, but there's a good chance it might and it's is definitely an improvement. IMO, it's worth an uplift.

Flags: needinfo?(bdeankawamura)

I'm also going to hoping to merge https://github.com/mozilla/application-services/pull/6267 into the app-services 128 branch and uplift it to Android. If we vendored that in to moz-central it should fix the underlying performance issue. I got behind on this because I've been waiting for RyanVM to get back to me, but I just realized he's OOO until the end of the weak. If we uplifted that one, then I think we could skip uplifting this one.

Summary of some offline with :bdk, I will uplift Bug 1900837 (https://github.com/mozilla/application-services/pull/6267) for Fx128.
:bdk thinks having Bug 1900837 in Fx128 means that we don't need to uplift this bug for Fx129 i.e. Bug 1895110 should ride the train with Fx129.
:adw any objections?

Riding the trains sounds good to me, thanks. I'll cancel the request.

Ben, I see what you mean, you're right that the problem is drop_suggestions. I didn't look closely enough at the crash reports. All the reports with Suggest-related crashes are crashing in drop_suggestions. That includes the report in comment 0, comment 9, bug 1892800, and others I've checked with the same signature. The patch in this bug doesn't help with that.

Flags: needinfo?(adw)
Attachment #9409022 - Flags: approval-mozilla-beta?

(In reply to Ina Popescu, Desktop QA from comment #20)

Unfortunately I wasn't able to reproduce the issue either using the scenarios from Comment 14 with Firefox 126.0 on Windows10 x64.
@adw please let me know if your suggestion from the last comment will help in reproducing this issue, so I can give it another try. Thanks.

I think for verification, we should just make sure quitting Firefox in the scenarios I outlined completes successfully and without any unusual delay. The patch here added a shutdown blocker (to perform some cleanup on shutdown), which could potentially cause Firefox to take longer to quit or even cause it to hang, and we want to verify that doesn't happen.

I don't think it's important to reproduce the crash. And from the recent discussion, we've determined that this patch doesn't even fix it anyway.

Unfortunately the idea I had about making ingest take longer would require modifying the code, so it's not something that can easily be done.

@adw, thanks for the clarification. I've tested the scenarios from Comment 14 using the latest Nightly 129.0a1 on Windows 10, macOS 13 and Ubuntu 22.04. I didn't notice any issues when quitting the window. I've updated the flag accordingly.

Flags: qe-verify+
Whiteboard: [sng][qa-triaged] → [sng]

The patch landed in nightly and beta is affected.
:adw, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox128 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(adw)

Setting 128 to wontfix, see Comment 25

Flags: needinfo?(adw)

Ben, I see what you mean, you're right that the problem is drop_suggestions. ... The patch in this bug doesn't help with that.

For completeness: This patch might have helped things, since we call drop_suggestions for each record and there's about 40 records for the AMP/Wikipedia data. So, even though drop_suggestions doesn't check for interrupting and all the crashes are inside that function, I think this would cause an interrupt in the short time in-between the calls. That said, I don't think it's worth an uplift, since the other patch should fix the shutdown hangs. It'll be great to have the two patches together once they both ride the trains though.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: