Closed Bug 1873229 Opened 1 year ago Closed 1 year ago

Crash [@ LateWriteObserver::Observe]

Categories

(Core :: SQLite and Embedded Database Bindings, defect)

x86_64
Windows
defect

Tracking

()

VERIFIED FIXED
125 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox121 --- disabled
firefox122 --- disabled
firefox123 --- disabled
firefox124 --- wontfix
firefox125 --- verified

People

(Reporter: jkratzer, Assigned: saschanaz)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, testcase, Whiteboard: [bugmon:bisected,confirmed])

Crash Data

Attachments

(3 files)

Testcase found while fuzzing mozilla-central rev 9ea90dc23395 (built with: --enable-debug --enable-fuzzing).

Testcase can be reproduced using the following commands:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build 9ea90dc23395 --debug --fuzzing  -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>
[@ LateWriteObserver::Observe]

    r10 = 0x00007ffd79c10000	r11 = 0x000000c0249be1a0	r12 = 0x0000000000000000
    r13 = 0x0000000000000000	r14 = 0x000001ed80b1d1b8	r15 = 0x000000c0249bee40
     r8 = 0x000000000000000e	 r9 = 0x00007ffd79d9ebf8	rax = 0x00007ffd264f261d
    rbp = 0x000001ed80334000	rbx = 0x000001ed85886160	rcx = 0x00007ffd36ed4d48
    rdi = 0x000001ed80b1d1b0	rdx = 0x0000000000000000	rip = 0x00007ffd1ce76791
    rsi = 0x000000c0249bed10	rsp = 0x000000c0249bebf0
    OS|Windows NT|10.0.22621
    CPU|amd64|family 6 model 186 stepping 2|6
    Crash|EXCEPTION_BREAKPOINT|0x00007ffd1ce76791|26
    26|0|xul.dll|LateWriteObserver::Observe(mozilla::IOInterposeObserver::Observation&)|hg:hg.mozilla.org/mozilla-central:xpcom/build/LateWriteChecks.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|119|0x91
    26|1|xul.dll|mozilla::IOInterposer::Report(mozilla::IOInterposeObserver::Observation&)|hg:hg.mozilla.org/mozilla-central:xpcom/build/IOInterposer.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|479|0x1f1
    26|2|xul.dll|mozilla::IOInterposeObserver::Observation::Report()|hg:hg.mozilla.org/mozilla-central:xpcom/build/IOInterposer.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|386|0x43
    26|3|xul.dll|(anonymous namespace)::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)|hg:hg.mozilla.org/mozilla-central:xpcom/build/PoisonIOInterposerWin.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|358|0x1ec
    26|4|xul.dll|std::sys::windows::handle::Handle::synchronous_write()|git:github.com/rust-lang/rust:library/std/src/sys/windows/handle.rs:82e1608dfa6e0b5569232559e3d385fea5a93112|286|0x89
    26|5|xul.dll|std::fs::write::inner()|git:github.com/rust-lang/rust:library/std/src/fs.rs:82e1608dfa6e0b5569232559e3d385fea5a93112|333|0xc6
    26|6|xul.dll|rkv::backend::impl_safe::environment::EnvironmentImpl::write_to_disk()|hg:hg.mozilla.org/mozilla-central:third_party/rust/rkv/src/backend/impl_safe/environment.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|223|0x331
    26|7|xul.dll|rkv::backend::impl_safe::transaction::impl$4::commit(rkv::backend::impl_safe::transaction::RwTransactionImpl)|hg:hg.mozilla.org/mozilla-central:third_party/rust/rkv/src/backend/impl_safe/transaction.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|193|0x264
    26|8|xul.dll|kvstore::task::impl$11::run(kvstore::task::DeleteTask*)|hg:hg.mozilla.org/mozilla-central:toolkit/components/kvstore/src/task.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|574|0x2af
    26|9|xul.dll|async_task::raw::RawTask<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0>,alloc::boxed::Box<dyn$<moz_task::Task,core::marker::Send,core::marker::Sync>,alloc::alloc::Global>,moz_task::executor::impl$5::spawn_onto::closure_env$0<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0> > >::run<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0>,alloc::boxed::Box<dyn$<moz_task::Task,core::marker::Send,core::marker::Sync>,alloc::alloc::Global>,moz_task::executor::impl$5::spawn_onto::closure_env$0<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0> > >(tuple$<>*)|hg:hg.mozilla.org/mozilla-central:third_party/rust/async-task/src/raw.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|511|0x56
    26|10|xul.dll|moz_task::dispatcher::impl$4::allocate::Run<moz_task::executor::schedule::closure_env$1>(xpcom::interfaces::idl::nsIRunnable*)|hg:hg.mozilla.org/mozilla-central:xpcom/rust/moz_task/src/dispatcher.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|16|0xd1
    26|11|xul.dll|mozilla::TaskQueue::Runner::Run()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/TaskQueue.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|257|0x367
    26|12|xul.dll|nsThreadPool::Run()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThreadPool.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|341|0x7df
    26|13|xul.dll|nsThread::ProcessNextEvent(bool, bool*)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThread.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|1193|0xa68
    26|14|xul.dll|NS_ProcessNextEvent(nsIThread*, bool)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThreadUtils.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|480|0x44
    26|15|xul.dll|mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)|hg:hg.mozilla.org/mozilla-central:ipc/glue/MessagePump.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|300|0xad
    26|16|xul.dll|MessageLoop::RunHandler()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|363|0x4f
    26|17|xul.dll|MessageLoop::Run()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|345|0x6e
    26|18|xul.dll|nsThread::ThreadFunc(void*)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThread.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|370|0x15a
    26|19|nss3.dll|_PR_NativeRunThread(void*)|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/threads/combined/pruthr.c:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|399|0x120
    26|20|nss3.dll|pr_root(void*)|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/md/windows/w95thred.c:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|139|0x10
    26|21|ucrtbase.dll||||
    26|22|KERNELBASE.dll||||
    26|23|kernel32.dll||||
    26|24|ucrtbase.dll||||
    26|25|mozglue.dll|patched_BaseThreadInitThunk(int, void*, void*)|hg:hg.mozilla.org/mozilla-central:toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|561|0x74
    26|26|ntdll.dll||||
    26|27|KERNELBASE.dll||||
Attached file Testcase

It looks like this is a late write from some code using the Rust RKV data store. Maybe this is from Glean because that's what seems to have motivated the most recent updates to this library.

I'll speculatively move this to Telemetry.

Component: XPCOM → Telemetry
Product: Core → Toolkit

Glean doesn't use kvstore to talk to rkv (we go direct). Notification db, settings sync, and Extensions (Scripting and Permissions) use it (though the latter doesn't use it on all channels (bug 1646182))... Not sure which is the most likely culprit, but I remember from previous adventures (bug 1807010, bug 1805427) that kvstore itself is in Toolkit :: Storage, so that's where I choose to send this next.

Component: Telemetry → Storage

Verified bug as reproducible on mozilla-central 20240109162901-b29d6ace45f4.
The bug appears to have been introduced in the following build range:

Start: 93db8f4d488c7abd7f9dc663ae8061964260d028 (20230501062357)
End: a316042804c7811afb12c9fa73aa35c08c70f587 (20230501093846)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=93db8f4d488c7abd7f9dc663ae8061964260d028&tochange=a316042804c7811afb12c9fa73aa35c08c70f587

Keywords: regression
Whiteboard: [bugmon:confirm] → [bugmon:bisected,confirmed]

This bug has been marked as a regression. Setting status flag for Nightly to affected.

Marco, can we get a Severity applied to this, please?

Also, soft freeze for Fx123 is this Thursday. Is this something we feel comfortable letting go to Beta?

Flags: needinfo?(mak)

I'd love to be able to give better hints, but kvstore is in Storage mostly for historical reasons, thus I may not be the best person to evaluate this.

This is marked as a regression, but it points to an empty pushlog, making it hard to evaluate which component is this related to, and whether something effectively regressed, or if this is just an intermittent failure that happens more often (different code scheduling due to other changes?).
Jason, Is that empty pushlog expected?

In lack of such data, there's also no data suggesting this would be particularly problematic in Beta.

The first thing I can suggest here, is we need a way to tell which kvstore consumer is writing when a crash happens, by annotating the crash accordingly, that sounds a bit similar to Bug 1834046, where I started a discussion with :gsvelto, and after some crash reporter annotations refactorings (that I think are happening these days) it should be feasible.

Jens do you have any additional insight? Do you know who may still be developing kvstore after Porof left?
It may also be worth thinking about the direction, if we don't have owners of this code, is it still something we suggest to use? Is any of the current consumers interested into taking over kvstore maintenance? Or should we start investigation over alternatives (e.g. a SQLite based kvstore, redb, bolt)?

Maybe :saschanaz can help, being one of the rkv create owners.

Flags: needinfo?(mak)
Flags: needinfo?(krosylight)
Flags: needinfo?(jstutte)
Flags: needinfo?(jkratzer)

we need a way to tell which kvstore consumer is writing when a crash happens

Let's see if pernosco can help us.

Flags: needinfo?(jstutte)
Keywords: pernosco-wanted

Jens do you have any additional insight? Do you know who may still be developing kvstore after Porof left?

We took over maintenance for rkv but are not doing any active development right now. Given that it mostly works, not sure if we would want to spend the effort to think about alternatives, though.

Pernosco is only supported for Linux x86_64 bugs.

Keywords: pernosco-wanted

Ah yes, InterposedNtWriteFile is Windows only.

I can confirm the crash from mozregression, let me bisect to get some real range.

Flags: needinfo?(krosylight)

Manual bisect says bug 1826224, but not 100% sure as this is a shutdown phase crash that depends a lot on timing; closing too early or closing too late does not trigger crash. I'll double check before adding the regressor.

The second bisection on a different device also points to bug 1826224.

Regressed by: 1826224

The method:

  1. Download the testcase of comment #2 and serve the folder by either npm serve package or python -m http.server.
  2. Run Firefox debug build and open the page
  3. You'll see:
2024-01-16T22:03:30.598000: INFO : b'JavaScript error: http://localhost:3000/sw.js, line 13: TypeError: ServiceWorkerRegistration.showNotification: No active worker for scope http://localhost:3000/.'
2024-01-16T22:03:30.601000: INFO : b'[WARN  rkv::backend::impl_safe::environment] `load_ratio()` is irrelevant for this storage backend.'

(... which means we should probably remove load_ratio call from kvstore`. But anyway.)

  1. Wait ~10 seconds and you'll see another load_ratio warning. Then close the page and Firefox.
  2. You'll see the crash

Bug 1826224 added a pref timer.auto_increase_timer_resolution but disabling it does not prevent the crash.

The testcase repeats showNotification via setInterval which never really appears to the UI, but does it go to the DB? 🤔

Edit: Yes, getNotification() returns a lot of notifications. What's the point when it never appears at all, though? Chrome behaves same if the permission is given, hmm. showNotification should not put anything to DB without permission, so at least that's a problem, but nothing to do with the crash itself.

See Also: → 1874958
See Also: → 1874960

Set release status flags based on info from the regressing bug 1826224

The timers resolution fix is Nightly only, but according to comment 17, disabling the pref doesn't revert this crash, sounds like timing issue.
It looks like the notifications component is not using a shutdown blocker to ensure writes are complete? I can't see any shutdown handling in https://searchfox.org/mozilla-central/source/dom/notification/new/NotificationDB.sys.mjs, apart from a check in init()
I also don't see shudown handling in kvstore?

(In reply to Marco Bonardo [:mak] from comment #20)

The timers resolution fix is Nightly only, but according to comment 17, disabling the pref doesn't revert this crash, sounds like timing issue.
It looks like the notifications component is not using a shutdown blocker to ensure writes are complete? I can't see any shutdown handling in https://searchfox.org/mozilla-central/source/dom/notification/new/NotificationDB.sys.mjs, apart from a check in init()
I also don't see shudown handling in kvstore?

Good hint. I think there is some form of shutdown handling here but IIUC unregisterListeners does not wait for an eventual queue of tasks to be drained before we let continue the shutdown?

An easy thing to do might be to add some check for _shutdownInProgress inside runNextTask, if we are not interested in saving them anymore. Otherwise we need to wait somewhere and the cleanest way to do so would be to use an async shutdown blocker, IMHO. While we are here we could also think about using an earlier phase for that blocker than XPCOMShutdown.

This is marked as a regression, but it points to an empty pushlog, making it hard to evaluate which component is this related to, and whether something effectively regressed, or if this is just an intermittent failure that happens more often (different code scheduling due to other changes?).
Jason, Is that empty pushlog expected?

This is actually an issue with the way hg.mozilla.org shows the commit ranges. The full range is:

changeset:   662060:93db8f4d488c
summary:     Bug 1823135 - Initial support for NVIDIA RTX Video Super Resolution r=gfx-reviewers,sotaro

changeset:   662061:d12f178c408f
summary:     no bug - Bumping Firefox l10n changesets r=release a=l10n-bump DONTBUILD CLOSED TREE

changeset:   662062:a316042804c7
summary:     Backed out changeset 93db8f4d488c (bug 1823135) for causing assertion failures on StaticPrefList_gfx.h. CLOSED TREE

I suspect rev a316042804c7 caused issues during bisection.

Flags: needinfo?(jkratzer)

:jlink, since you are the author of the regressor, bug 1826224, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(jlink)

(In reply to Jens Stutte [:jstutte] from comment #22)

While we are here we could also think about using an earlier phase for that blocker than XPCOMShutdown.

Yes, "xpcom-shutdown" is arguably way too late. "profile-before-change" is probably the last appropriate phase for a subsystem like this.

(In reply to Jason Kratzer [:jkratzer] from comment #23)

This is actually an issue with the way hg.mozilla.org shows the commit ranges. The full range is:

changeset:   662060:93db8f4d488c
summary:     Bug 1823135 - Initial support for NVIDIA RTX Video Super Resolution r=gfx-reviewers,sotaro

This would be even more surprising than the previous range...

Set release status flags based on info from the regressing bug 1826224

This regression was marked as being caused by my change (bug 1826224) and I was NI'd but my interpretation of the above discussion is that this is not necessarily the case as there was a problem during the bisection. Is my understanding on that correct?
(It's also interesting to see that the bug doesn't go away when the pref is used to disable the functionality added in that bug - if that's the case but yet this is caused by my change, that would seem like a very useful piece of information.)

Flags: needinfo?(jlink)

(In reply to Justin Link from comment #28)

This regression was marked as being caused by my change (bug 1826224) and I was NI'd but my interpretation of the above discussion is that this is not necessarily the case as there was a problem during the bisection. Is my understanding on that correct?
(It's also interesting to see that the bug doesn't go away when the pref is used to disable the functionality added in that bug - if that's the case but yet this is caused by my change, that would seem like a very useful piece of information.)

There was a problem for the bot but I don't think there was a problem in my bisection. That said, it's probably fair to say this was a hidden existing problem that is discovered by some timing change (not sure why it persists even without the pref though).

Thanks for the update. I'm going to take a quick second look at my change to try and see if I can see anything that seems like it could be more than just a tiny timing difference when disabled. But yeah, this otherwise seems like a pre-existing issue that was probably bound to be run into at some point.

I can take the notification side issue.

Assignee: nobody → krosylight

So, it turns out that spamming notification is a great way to freeze the main process:

for (const i of new Array(100000).fill(0)) {
  new Notification("foo");
}

Running this will instantly make the main process unresponsive. Adding shutdown blocker is needed anyway, but I think the core issue is that we allow this kind of IPC spam. But then perhaps this is not specific to notification, maybe IDB can do the same spam.

I think the patch should at least make some difference but couldn't confirm it with python -m grizzly.replay.bugzilla D:\gecko\obj-dbg\dist\bin\firefox.exe 1873229, somehow it just say "No results detected" even without any fix. I'll try tomorrow.

This goes away when nightly-only MOZ_NEW_NOTIFICATION_STORE is disabled, which gives the same impl for stable/beta. That gives a question, should we just disable this? As argued in bug 1777973 I don't see a good reason to keep this one.

We have two very similar modules right now but with some different behavior:

  1. With MOZ_NEW_NOTIFICATION_STORE=True we save the DB via kvstore, which has no shutdown blocker and causes crash when being too late.
  2. With MOZ_NEW_NOTIFICATION_STORE=False we save the DB via IOUtils, which does have its shutdown blocker.

Either way there's no shutdown blocker that ensures processing the whole task queue, which poses risk of losing notification data, which is required when notification is clicked by user. This patch adds that.

Unfortunately we don't have a test here as I don't think we have infra for shutdown blocking test.

Pushed by krosylight@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/306a8e5f5de0 Add a shutdown blocker for NotificationDB r=asuth
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch

Verified bug as fixed on rev mozilla-central 20240229044455-f21a5f54b6ec.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED
Keywords: bugmon

The patch landed in nightly and beta is affected.
:saschanaz, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox124 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(krosylight)
  1. This is pretty edge case
  2. The crash itself only affects the nightly-only implementation
  3. Even if we upstream this there are many more issues that prevents proper notification DB handling and thus it won't make much difference

Thus I say wontfix.

Flags: needinfo?(krosylight)
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: