Crash [@ LateWriteObserver::Observe]
Categories
(Core :: SQLite and Embedded Database Bindings, defect)
Tracking
()
People
(Reporter: jkratzer, Assigned: saschanaz)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression, testcase, Whiteboard: [bugmon:bisected,confirmed])
Crash Data
Attachments
(3 files)
Testcase found while fuzzing mozilla-central rev 9ea90dc23395 (built with: --enable-debug --enable-fuzzing).
Testcase can be reproduced using the following commands:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build 9ea90dc23395 --debug --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>
[@ LateWriteObserver::Observe]
r10 = 0x00007ffd79c10000 r11 = 0x000000c0249be1a0 r12 = 0x0000000000000000
r13 = 0x0000000000000000 r14 = 0x000001ed80b1d1b8 r15 = 0x000000c0249bee40
r8 = 0x000000000000000e r9 = 0x00007ffd79d9ebf8 rax = 0x00007ffd264f261d
rbp = 0x000001ed80334000 rbx = 0x000001ed85886160 rcx = 0x00007ffd36ed4d48
rdi = 0x000001ed80b1d1b0 rdx = 0x0000000000000000 rip = 0x00007ffd1ce76791
rsi = 0x000000c0249bed10 rsp = 0x000000c0249bebf0
OS|Windows NT|10.0.22621
CPU|amd64|family 6 model 186 stepping 2|6
Crash|EXCEPTION_BREAKPOINT|0x00007ffd1ce76791|26
26|0|xul.dll|LateWriteObserver::Observe(mozilla::IOInterposeObserver::Observation&)|hg:hg.mozilla.org/mozilla-central:xpcom/build/LateWriteChecks.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|119|0x91
26|1|xul.dll|mozilla::IOInterposer::Report(mozilla::IOInterposeObserver::Observation&)|hg:hg.mozilla.org/mozilla-central:xpcom/build/IOInterposer.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|479|0x1f1
26|2|xul.dll|mozilla::IOInterposeObserver::Observation::Report()|hg:hg.mozilla.org/mozilla-central:xpcom/build/IOInterposer.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|386|0x43
26|3|xul.dll|(anonymous namespace)::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)|hg:hg.mozilla.org/mozilla-central:xpcom/build/PoisonIOInterposerWin.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|358|0x1ec
26|4|xul.dll|std::sys::windows::handle::Handle::synchronous_write()|git:github.com/rust-lang/rust:library/std/src/sys/windows/handle.rs:82e1608dfa6e0b5569232559e3d385fea5a93112|286|0x89
26|5|xul.dll|std::fs::write::inner()|git:github.com/rust-lang/rust:library/std/src/fs.rs:82e1608dfa6e0b5569232559e3d385fea5a93112|333|0xc6
26|6|xul.dll|rkv::backend::impl_safe::environment::EnvironmentImpl::write_to_disk()|hg:hg.mozilla.org/mozilla-central:third_party/rust/rkv/src/backend/impl_safe/environment.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|223|0x331
26|7|xul.dll|rkv::backend::impl_safe::transaction::impl$4::commit(rkv::backend::impl_safe::transaction::RwTransactionImpl)|hg:hg.mozilla.org/mozilla-central:third_party/rust/rkv/src/backend/impl_safe/transaction.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|193|0x264
26|8|xul.dll|kvstore::task::impl$11::run(kvstore::task::DeleteTask*)|hg:hg.mozilla.org/mozilla-central:toolkit/components/kvstore/src/task.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|574|0x2af
26|9|xul.dll|async_task::raw::RawTask<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0>,alloc::boxed::Box<dyn$<moz_task::Task,core::marker::Send,core::marker::Sync>,alloc::alloc::Global>,moz_task::executor::impl$5::spawn_onto::closure_env$0<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0> > >::run<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0>,alloc::boxed::Box<dyn$<moz_task::Task,core::marker::Send,core::marker::Sync>,alloc::alloc::Global>,moz_task::executor::impl$5::spawn_onto::closure_env$0<enum2$<moz_task::impl$2::dispatch_with_options::async_block_env$0> > >(tuple$<>*)|hg:hg.mozilla.org/mozilla-central:third_party/rust/async-task/src/raw.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|511|0x56
26|10|xul.dll|moz_task::dispatcher::impl$4::allocate::Run<moz_task::executor::schedule::closure_env$1>(xpcom::interfaces::idl::nsIRunnable*)|hg:hg.mozilla.org/mozilla-central:xpcom/rust/moz_task/src/dispatcher.rs:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|16|0xd1
26|11|xul.dll|mozilla::TaskQueue::Runner::Run()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/TaskQueue.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|257|0x367
26|12|xul.dll|nsThreadPool::Run()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThreadPool.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|341|0x7df
26|13|xul.dll|nsThread::ProcessNextEvent(bool, bool*)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThread.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|1193|0xa68
26|14|xul.dll|NS_ProcessNextEvent(nsIThread*, bool)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThreadUtils.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|480|0x44
26|15|xul.dll|mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)|hg:hg.mozilla.org/mozilla-central:ipc/glue/MessagePump.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|300|0xad
26|16|xul.dll|MessageLoop::RunHandler()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|363|0x4f
26|17|xul.dll|MessageLoop::Run()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|345|0x6e
26|18|xul.dll|nsThread::ThreadFunc(void*)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThread.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|370|0x15a
26|19|nss3.dll|_PR_NativeRunThread(void*)|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/threads/combined/pruthr.c:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|399|0x120
26|20|nss3.dll|pr_root(void*)|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/md/windows/w95thred.c:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|139|0x10
26|21|ucrtbase.dll||||
26|22|KERNELBASE.dll||||
26|23|kernel32.dll||||
26|24|ucrtbase.dll||||
26|25|mozglue.dll|patched_BaseThreadInitThunk(int, void*, void*)|hg:hg.mozilla.org/mozilla-central:toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:9ea90dc23395fc3e2d9d2bd18598cc49434eeff1|561|0x74
26|26|ntdll.dll||||
26|27|KERNELBASE.dll||||
Reporter | ||
Comment 1•1 year ago
|
||
Reporter | ||
Comment 2•1 year ago
|
||
Comment 3•1 year ago
|
||
It looks like this is a late write from some code using the Rust RKV data store. Maybe this is from Glean because that's what seems to have motivated the most recent updates to this library.
Comment 4•1 year ago
|
||
I'll speculatively move this to Telemetry.
Comment 5•1 year ago
|
||
Glean doesn't use kvstore
to talk to rkv
(we go direct). Notification db, settings sync, and Extensions (Scripting and Permissions) use it (though the latter doesn't use it on all channels (bug 1646182))... Not sure which is the most likely culprit, but I remember from previous adventures (bug 1807010, bug 1805427) that kvstore
itself is in Toolkit :: Storage
, so that's where I choose to send this next.
Comment 6•1 year ago
|
||
Verified bug as reproducible on mozilla-central 20240109162901-b29d6ace45f4.
The bug appears to have been introduced in the following build range:
Start: 93db8f4d488c7abd7f9dc663ae8061964260d028 (20230501062357)
End: a316042804c7811afb12c9fa73aa35c08c70f587 (20230501093846)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=93db8f4d488c7abd7f9dc663ae8061964260d028&tochange=a316042804c7811afb12c9fa73aa35c08c70f587
Comment 7•1 year ago
|
||
This bug has been marked as a regression. Setting status flag for Nightly to affected
.
Comment 8•1 year ago
|
||
Marco, can we get a Severity applied to this, please?
Also, soft freeze for Fx123 is this Thursday. Is this something we feel comfortable letting go to Beta?
Comment 9•1 year ago
•
|
||
I'd love to be able to give better hints, but kvstore is in Storage mostly for historical reasons, thus I may not be the best person to evaluate this.
This is marked as a regression, but it points to an empty pushlog, making it hard to evaluate which component is this related to, and whether something effectively regressed, or if this is just an intermittent failure that happens more often (different code scheduling due to other changes?).
Jason, Is that empty pushlog expected?
In lack of such data, there's also no data suggesting this would be particularly problematic in Beta.
The first thing I can suggest here, is we need a way to tell which kvstore consumer is writing when a crash happens, by annotating the crash accordingly, that sounds a bit similar to Bug 1834046, where I started a discussion with :gsvelto, and after some crash reporter annotations refactorings (that I think are happening these days) it should be feasible.
Jens do you have any additional insight? Do you know who may still be developing kvstore after Porof left?
It may also be worth thinking about the direction, if we don't have owners of this code, is it still something we suggest to use? Is any of the current consumers interested into taking over kvstore maintenance? Or should we start investigation over alternatives (e.g. a SQLite based kvstore, redb, bolt)?
Maybe :saschanaz can help, being one of the rkv create owners.
Comment 10•1 year ago
•
|
||
we need a way to tell which kvstore consumer is writing when a crash happens
Let's see if pernosco can help us.
Comment 11•1 year ago
|
||
Jens do you have any additional insight? Do you know who may still be developing kvstore after Porof left?
We took over maintenance for rkv but are not doing any active development right now. Given that it mostly works, not sure if we would want to spend the effort to think about alternatives, though.
Comment 13•1 year ago
|
||
Ah yes, InterposedNtWriteFile
is Windows only.
Assignee | ||
Comment 14•1 year ago
|
||
I can confirm the crash from mozregression, let me bisect to get some real range.
Assignee | ||
Comment 15•1 year ago
|
||
Manual bisect says bug 1826224, but not 100% sure as this is a shutdown phase crash that depends a lot on timing; closing too early or closing too late does not trigger crash. I'll double check before adding the regressor.
Assignee | ||
Comment 16•1 year ago
|
||
The second bisection on a different device also points to bug 1826224.
Assignee | ||
Comment 17•1 year ago
•
|
||
The method:
- Download the testcase of comment #2 and serve the folder by either npm
serve
package orpython -m http.server
. - Run Firefox debug build and open the page
- You'll see:
2024-01-16T22:03:30.598000: INFO : b'JavaScript error: http://localhost:3000/sw.js, line 13: TypeError: ServiceWorkerRegistration.showNotification: No active worker for scope http://localhost:3000/.'
2024-01-16T22:03:30.601000: INFO : b'[WARN rkv::backend::impl_safe::environment] `load_ratio()` is irrelevant for this storage backend.'
(... which means we should probably remove load_ratio call from kvstore`. But anyway.)
- Wait ~10 seconds and you'll see another load_ratio warning. Then close the page and Firefox.
- You'll see the crash
Bug 1826224 added a pref timer.auto_increase_timer_resolution
but disabling it does not prevent the crash.
Assignee | ||
Comment 18•1 year ago
•
|
||
The testcase repeats showNotification via setInterval which never really appears to the UI, but does it go to the DB? 🤔
Edit: Yes, getNotification()
returns a lot of notifications. What's the point when it never appears at all, though? Chrome behaves same if the permission is given, hmm. showNotification should not put anything to DB without permission, so at least that's a problem, but nothing to do with the crash itself.
Comment 19•1 year ago
|
||
Set release status flags based on info from the regressing bug 1826224
Comment 20•1 year ago
•
|
||
The timers resolution fix is Nightly only, but according to comment 17, disabling the pref doesn't revert this crash, sounds like timing issue.
It looks like the notifications component is not using a shutdown blocker to ensure writes are complete? I can't see any shutdown handling in https://searchfox.org/mozilla-central/source/dom/notification/new/NotificationDB.sys.mjs, apart from a check in init()
I also don't see shudown handling in kvstore?
Comment 21•1 year ago
|
||
(In reply to Marco Bonardo [:mak] from comment #20)
The timers resolution fix is Nightly only, but according to comment 17, disabling the pref doesn't revert this crash, sounds like timing issue.
It looks like the notifications component is not using a shutdown blocker to ensure writes are complete? I can't see any shutdown handling in https://searchfox.org/mozilla-central/source/dom/notification/new/NotificationDB.sys.mjs, apart from a check in init()
I also don't see shudown handling in kvstore?
Good hint. I think there is some form of shutdown handling here but IIUC unregisterListeners
does not wait for an eventual queue of tasks
to be drained before we let continue the shutdown?
Comment 22•1 year ago
|
||
An easy thing to do might be to add some check for _shutdownInProgress
inside runNextTask
, if we are not interested in saving them anymore. Otherwise we need to wait somewhere and the cleanest way to do so would be to use an async shutdown blocker, IMHO. While we are here we could also think about using an earlier phase for that blocker than XPCOMShutdown
.
Reporter | ||
Comment 23•1 year ago
•
|
||
This is marked as a regression, but it points to an empty pushlog, making it hard to evaluate which component is this related to, and whether something effectively regressed, or if this is just an intermittent failure that happens more often (different code scheduling due to other changes?).
Jason, Is that empty pushlog expected?
This is actually an issue with the way hg.mozilla.org shows the commit ranges. The full range is:
changeset: 662060:93db8f4d488c
summary: Bug 1823135 - Initial support for NVIDIA RTX Video Super Resolution r=gfx-reviewers,sotaro
changeset: 662061:d12f178c408f
summary: no bug - Bumping Firefox l10n changesets r=release a=l10n-bump DONTBUILD CLOSED TREE
changeset: 662062:a316042804c7
summary: Backed out changeset 93db8f4d488c (bug 1823135) for causing assertion failures on StaticPrefList_gfx.h. CLOSED TREE
I suspect rev a316042804c7 caused issues during bisection.
Comment 24•1 year ago
|
||
:jlink, since you are the author of the regressor, bug 1826224, could you take a look? Also, could you set the severity field?
For more information, please visit BugBot documentation.
Comment 25•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #22)
While we are here we could also think about using an earlier phase for that blocker than
XPCOMShutdown
.
Yes, "xpcom-shutdown" is arguably way too late. "profile-before-change" is probably the last appropriate phase for a subsystem like this.
Comment 26•1 year ago
|
||
(In reply to Jason Kratzer [:jkratzer] from comment #23)
This is actually an issue with the way hg.mozilla.org shows the commit ranges. The full range is:
changeset: 662060:93db8f4d488c summary: Bug 1823135 - Initial support for NVIDIA RTX Video Super Resolution r=gfx-reviewers,sotaro
This would be even more surprising than the previous range...
Updated•1 year ago
|
Comment 27•1 year ago
|
||
Set release status flags based on info from the regressing bug 1826224
Comment 28•1 year ago
|
||
This regression was marked as being caused by my change (bug 1826224) and I was NI'd but my interpretation of the above discussion is that this is not necessarily the case as there was a problem during the bisection. Is my understanding on that correct?
(It's also interesting to see that the bug doesn't go away when the pref is used to disable the functionality added in that bug - if that's the case but yet this is caused by my change, that would seem like a very useful piece of information.)
Updated•1 year ago
|
Assignee | ||
Comment 29•1 year ago
|
||
(In reply to Justin Link from comment #28)
This regression was marked as being caused by my change (bug 1826224) and I was NI'd but my interpretation of the above discussion is that this is not necessarily the case as there was a problem during the bisection. Is my understanding on that correct?
(It's also interesting to see that the bug doesn't go away when the pref is used to disable the functionality added in that bug - if that's the case but yet this is caused by my change, that would seem like a very useful piece of information.)
There was a problem for the bot but I don't think there was a problem in my bisection. That said, it's probably fair to say this was a hidden existing problem that is discovered by some timing change (not sure why it persists even without the pref though).
Comment 30•1 year ago
|
||
Thanks for the update. I'm going to take a quick second look at my change to try and see if I can see anything that seems like it could be more than just a tiny timing difference when disabled. But yeah, this otherwise seems like a pre-existing issue that was probably bound to be run into at some point.
Assignee | ||
Comment 31•1 year ago
|
||
I can take the notification side issue.
Assignee | ||
Comment 32•1 year ago
|
||
So, it turns out that spamming notification is a great way to freeze the main process:
for (const i of new Array(100000).fill(0)) {
new Notification("foo");
}
Running this will instantly make the main process unresponsive. Adding shutdown blocker is needed anyway, but I think the core issue is that we allow this kind of IPC spam. But then perhaps this is not specific to notification, maybe IDB can do the same spam.
Assignee | ||
Comment 33•1 year ago
|
||
I think the patch should at least make some difference but couldn't confirm it with python -m grizzly.replay.bugzilla D:\gecko\obj-dbg\dist\bin\firefox.exe 1873229
, somehow it just say "No results detected" even without any fix. I'll try tomorrow.
Assignee | ||
Comment 34•1 year ago
|
||
This goes away when nightly-only MOZ_NEW_NOTIFICATION_STORE is disabled, which gives the same impl for stable/beta. That gives a question, should we just disable this? As argued in bug 1777973 I don't see a good reason to keep this one.
Assignee | ||
Comment 35•1 year ago
|
||
We have two very similar modules right now but with some different behavior:
- With MOZ_NEW_NOTIFICATION_STORE=True we save the DB via kvstore, which has no shutdown blocker and causes crash when being too late.
- With MOZ_NEW_NOTIFICATION_STORE=False we save the DB via IOUtils, which does have its shutdown blocker.
Either way there's no shutdown blocker that ensures processing the whole task queue, which poses risk of losing notification data, which is required when notification is clicked by user. This patch adds that.
Unfortunately we don't have a test here as I don't think we have infra for shutdown blocking test.
Comment 36•1 year ago
|
||
Comment 37•1 year ago
|
||
bugherder |
Comment 38•1 year ago
|
||
Verified bug as fixed on rev mozilla-central 20240229044455-f21a5f54b6ec.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Comment 39•1 year ago
|
||
The patch landed in nightly and beta is affected.
:saschanaz, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox124
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 40•1 year ago
|
||
- This is pretty edge case
- The crash itself only affects the nightly-only implementation
- Even if we upstream this there are many more issues that prevents proper notification DB handling and thus it won't make much difference
Thus I say wontfix.
Updated•8 months ago
|
Description
•