Crash in [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList]
Categories
(Core :: DOM: Workers, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox-esr102 | --- | unaffected |
firefox104 | --- | unaffected |
firefox105 | --- | unaffected |
firefox106 | --- | wontfix |
firefox107 | --- | fixed |
People
(Reporter: aryx, Assigned: yulia)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(3 files, 1 obsolete file)
3 crashes from 2+ machines, first affected build is 106.0a1 20220822190304.
Crash report: https://crash-stats.mozilla.org/report/index/0491de6e-32e3-4aa5-b89b-1bf550220823
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList dom/workers/ScriptLoader.cpp:897
1 xul.dll mozilla::dom::workerinternals::loader::NetworkLoadHandler::OnStreamComplete dom/workers/loader/NetworkLoadHandler.cpp:57
2 xul.dll mozilla::net::nsStreamLoader::OnStopRequest netwerk/base/nsStreamLoader.cpp:85
3 xul.dll mozilla::net::nsStreamListenerTee::OnStopRequest netwerk/base/nsStreamListenerTee.cpp:51
4 xul.dll mozilla::net::nsHTTPCompressConv::OnStopRequest netwerk/streamconv/converters/nsHTTPCompressConv.cpp:181
5 xul.dll std::_Func_impl_no_alloc<`lambda at /builds/worker/checkouts/gecko/netwerk/protocol/http/HttpChannelChild.cpp:786:13', void>::_Do_call
6 xul.dll mozilla::net::ChannelEventQueue::FlushQueue netwerk/ipc/ChannelEventQueue.cpp:94
7 xul.dll mozilla::net::ChannelEventQueue::ResumeInternal::CompleteResumeRunnable::Run netwerk/ipc/ChannelEventQueue.cpp:152
8 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:851
9 xul.dll mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:461
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
This is likely due to the promises not being rejected, and everything being cleaned up before they complete. I have a patch for this in bug 1783190, I will see if this fixes the problem.
Updated•2 years ago
|
Comment 2•2 years ago
|
||
The bug is marked as tracked for firefox106 (nightly). However, the bug still isn't assigned.
:jstutte, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 3•2 years ago
|
||
Yulia, will you have a patch up before we hit beta? Thanks
Assignee | ||
Comment 4•2 years ago
|
||
If this is related to faulty cancellation (which I believe it might be), then hopefully it will be fixed by https://phabricator.services.mozilla.com/D154382. However I am not certain, and I am waiting on review there.
Assignee | ||
Comment 5•2 years ago
|
||
It has landed. I will monitor it over the next few days to see if it resolves the crashes as expected.
Comment 6•2 years ago
|
||
Yulia, there was no change on crash volume since the patch in bug 1783190 landed.
Comment 7•2 years ago
|
||
Apparently mWorkerRef
is null when we enter WorkerScriptLoader::DispatchMaybeMoveToLoadedList
which leads to a ' nullptr' deref here when executing aScriptLoader.mWorkerRef->Private()
. Be aware that we may access this also earlier here. And in general I see the pattern mWorkerRef->Private()->Xxxx()
quite often, I assume we should be more careful here.
Comment 8•2 years ago
|
||
Please note that the volume of crashes is significant and that we merge to beta next Monday, would be nice to have a fix by then :)
Comment 9•2 years ago
|
||
Just noting that Yulia is on PTO. Andrew, can you give this a look? I can plaster the code with null-checks, but I am not sure if that is what we actually want...
Comment 10•2 years ago
|
||
I think we're looking at the problem where cancellation causes a bunch of state machines to slowly grind to a stop, still generating callbacks, and one of those callbacks doesn't have an idempotent "are we already cancelled? then let's just ignore this" because it's very much not obvious that this is a thing that could happen.
So for NetworkLoadHandler::DataReceivedFromNetwork we have the important IsCancelled
check that will return true exactly when mWorkerRef
is gone.
nsresult NetworkLoadHandler::DataReceivedFromNetwork(nsIStreamLoader* aLoader,
nsresult aStatus,
uint32_t aStringLen,
const uint8_t* aString) {
AssertIsOnMainThread();
if (mLoader->IsCancelled()) {
return mLoader->mCancelMainThread.ref();
}
But for NetworkLoadHandler::OnStreamComplete just above it we do not:
NS_IMETHODIMP
NetworkLoadHandler::OnStreamComplete(nsIStreamLoader* aLoader,
nsISupports* aContext, nsresult aStatus,
uint32_t aStringLen,
const uint8_t* aString) {
nsresult rv = DataReceivedFromNetwork(aLoader, aStatus, aStringLen, aString);
return mLoader->OnStreamComplete(mLoadContext->mRequest, rv);
}
I think we probably want a similar check there in OnStreamComplete or in WorkerScriptLoader::OnStreamComplete which it calls:
nsresult WorkerScriptLoader::OnStreamComplete(ScriptLoadRequest* aRequest,
nsresult aStatus) {
AssertIsOnMainThread();
LoadingFinished(aRequest, aStatus);
return NS_OK;
}
I would favor doing so in NetworkLoadHandler as a first try as it makes a more consistent/clear boundary, but the reality is that the cache loader calls it to per this searchfox search. However, that's not this crash, so a the network handler is potentially a good first try.
If you want to try your hand at a patch, it would be appreciated as I'm going to turn in now, but I can also try and formulate that patch and do various try runs tomorrow morning, although I may end up doing some TPAC juggling.
Comment 11•2 years ago
|
||
Comment 12•2 years ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b246c998fb8f Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth
Comment 13•2 years ago
|
||
Backed out changeset b246c998fb8f (Bug 1786571) for causing ScriptLoader failures CLOSED TREE
Log: https://treeherder.mozilla.org/logviewer?job_id=390551940&repo=autoland&lineNumber=4348
Backout: https://hg.mozilla.org/integration/autoland/rev/126037147ea0aeea7376786603b4884ef8cc5dae
Comment 14•2 years ago
|
||
Hi Andrew, we remove a cancel and get a hang, it seems.
Comment 15•2 years ago
•
|
||
So from the pernosco session we see that there is a strong worker ref that is never unset. For the scriptloader in question we never called WorkerScriptLoader::TryShutdown
or ::ShutdownScriptLoader
which seems to be the only place where we unset the worker ref. Instead we called WorkerScriptLoader::CancelMainThreadWithBindingAborted
.
Comment 16•2 years ago
|
||
Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/374cb9c09095 Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth
Comment 17•2 years ago
|
||
Backed out changeset for causing several service_worker related regressions.
Failure logs:
- xpcshell: https://treeherder.mozilla.org/logviewer?job_id=390648925&repo=autoland
- devtools crashes: https://treeherder.mozilla.org/logviewer?job_id=390675417&repo=autoland and https://treeherder.mozilla.org/logviewer?job_id=390670496&repo=autoland
Backout link: https://hg.mozilla.org/integration/autoland/rev/4cf5fa9a53b93e62252d9fa378d3cd1227043d92
Comment 18•2 years ago
•
|
||
(In reply to Cosmin Sabou [:CosminS] from comment #17)
So the devtools crashes seem to be something quite obvious: If we kick off successful some LoadScript
and then fail on the next, we never initialize mCacheCreator
for any of them. Actually I assume that the current order of things can even lead to a race in the success case if CacheLoadHandler::OnStreamComplete
is called earlier than we reach to loadContext->SetCacheCreator(cacheCreator);
. It seems to me we should initialize the Edit: We never initialize it when CacheCreator
earlier and set it for each loadContext
before calling LoadScript
.!mWorkerRef->Private()->IsServiceWorker() || IsDebuggerScript()
, it seems.
The xpcshell failure is a timeout where the log does not tell us much, but it happens with networking on socket process, such that it would not affect release for now.
Comment 19•2 years ago
|
||
Argh, right, the first block of code is specialized for the non-SW case so we'd probably need some conceptually similar additions for the SW cases.
I'll do a deeper dive next week. On the upside, we now have a lot of material for our specific list of load permutations with specific tests (sometimes in sequence) that can cause them!
Comment 20•2 years ago
|
||
FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc
Reporter | ||
Updated•2 years ago
|
Comment 21•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #20)
FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc
Looks not bad, but I wait for your analysis before making further steps.
Comment 22•2 years ago
|
||
This is the new signature for this crash w/ inlined frames.
Comment 23•2 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #21)
(In reply to Jens Stutte [:jstutte] from comment #20)
FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc
Looks not bad, but I wait for your analysis before making further steps.
Can you update the patch in phabricator to make it easier to see the changes? I think I've manually/mentally diffed the changes to just the guards on the cache creator which makes sense for the first linked backout crash, but I'm not sure I understand what changes would address the second linked backout crash where the CachePromiseHandler::ResolvedCallback calls into MaybeExecuteFinishedScripts. I presume if we do something to avoid getting to CachePromiseHandler's creation in NetworkLoadHandler::PrepareForRequest then that will fix it, but I'm not sure what changed to ensure that. If you could clarify, that would be great, thank you! I've triggered a bunch more of the dt2 tests on try to give some extra confidence about the replication on that.
Comment 24•2 years ago
|
||
Thanks for having updated the phab patch; it was as I'd mentally diffed. I re-triggered a truly ridiculous number of the dt2 jobs and they all seemed okay... maybe we should just try and land the revised version? I'll re-approve the phabricator patch for clarity.
Comment 25•2 years ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/49661cdd3938 Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth
Updated•2 years ago
|
Comment 26•2 years ago
|
||
bugherder |
Comment 27•2 years ago
|
||
The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regression caused by this fix.
- If no, please set
status-firefox106
towontfix
.
For more information, please visit auto_nag documentation.
Comment 28•2 years ago
|
||
Probably yes, but I think we should wait a few days in nightly at least?
Reporter | ||
Comment 29•2 years ago
|
||
New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922
Comment 30•2 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #29)
New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922
IIUC that stack shows a promise rejection as consequence of a cycle collection. I think we should just null check the worker ref here?
Comment 31•2 years ago
|
||
Comment 32•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 33•2 years ago
|
||
This largely keeps in tact what jstutte did. The initial crash was fixed by eagerly calling
LoadingFinished. The second crash is caused because we call it twice, and only in the service worker
case, where we call it once the promise rejects. Now, we check if we have cancelled, and if we have
then we don't call the scriptLoader methods from inside of the load handlers. LoadHandlers now only
use OnStreamComplete if they are "successful" -- that is, if they were not cancelled.
OnStreamComplete retains its assertion error in the case that something was cancelled and we somehow
ended up there. In a follow up, I will clean up the friend classes of the ScriptLoader so you can't
easily access these methods from the LoadHandlers.
Assignee | ||
Comment 34•2 years ago
|
||
I was always a bit wary of the number of friend classes that ScriptLoader had, and I am finally
going ahead and removing them, as they should only use methods that are "safe" from an external
perspective. If there is a better way to do this I am quite open to that.
Depends on D158262
Assignee | ||
Comment 35•2 years ago
|
||
I've posted two patches that should resolve the new crash. They replace jstutte's second patch.
Comment 36•2 years ago
|
||
Pushed by ystartsev@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b9c694a6b63f Do not call LoadingFinished from handlers when cancelled; r=asuth
Updated•2 years ago
|
Comment 37•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 38•2 years ago
|
||
FWIW, I don't see any change on crash volume since the fix landed on Nightly.
Assignee | ||
Comment 39•2 years ago
|
||
Yes, this should be reopened.
Updated•2 years ago
|
Assignee | ||
Comment 40•2 years ago
|
||
The cancellation work seems to have actually made this worse.. Looking into it again.
Comment 41•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 42•1 year ago
|
||
It looks like this is, at long last, fixed.
Comment 43•11 months ago
|
||
:yoshi, do you happen to know if the remaining patch here is still meaningful after bug 1800496 has landed? In case I'd propose to move it to a new bug and de-bitrot it.
https://phabricator.services.mozilla.com/D158263 looks like a nice clean-up (remove friend classes), we should move it to a new file to land it.
Description
•