Closed Bug 1786571 Opened 2 years ago Closed 1 year ago

Crash in [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList]

Categories

(Core :: DOM: Workers, defect)

defect

Tracking

()

RESOLVED FIXED
107 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox104 --- unaffected
firefox105 --- unaffected
firefox106 --- wontfix
firefox107 --- fixed

People

(Reporter: aryx, Assigned: yulia)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files, 1 obsolete file)

3 crashes from 2+ machines, first affected build is 106.0a1 20220822190304.

Crash report: https://crash-stats.mozilla.org/report/index/0491de6e-32e3-4aa5-b89b-1bf550220823

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList dom/workers/ScriptLoader.cpp:897
1 xul.dll mozilla::dom::workerinternals::loader::NetworkLoadHandler::OnStreamComplete dom/workers/loader/NetworkLoadHandler.cpp:57
2 xul.dll mozilla::net::nsStreamLoader::OnStopRequest netwerk/base/nsStreamLoader.cpp:85
3 xul.dll mozilla::net::nsStreamListenerTee::OnStopRequest netwerk/base/nsStreamListenerTee.cpp:51
4 xul.dll mozilla::net::nsHTTPCompressConv::OnStopRequest netwerk/streamconv/converters/nsHTTPCompressConv.cpp:181
5 xul.dll std::_Func_impl_no_alloc<`lambda at /builds/worker/checkouts/gecko/netwerk/protocol/http/HttpChannelChild.cpp:786:13', void>::_Do_call 
6 xul.dll mozilla::net::ChannelEventQueue::FlushQueue netwerk/ipc/ChannelEventQueue.cpp:94
7 xul.dll mozilla::net::ChannelEventQueue::ResumeInternal::CompleteResumeRunnable::Run netwerk/ipc/ChannelEventQueue.cpp:152
8 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:851
9 xul.dll mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:461
Flags: needinfo?(ystartsev)

This is likely due to the promises not being rejected, and everything being cleaned up before they complete. I have a patch for this in bug 1783190, I will see if this fixes the problem.

The bug is marked as tracked for firefox106 (nightly). However, the bug still isn't assigned.

:jstutte, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)
Assignee: nobody → ystartsev
Flags: needinfo?(ystartsev)
Flags: needinfo?(jstutte)

Yulia, will you have a patch up before we hit beta? Thanks

Flags: needinfo?(ystartsev)

If this is related to faulty cancellation (which I believe it might be), then hopefully it will be fixed by https://phabricator.services.mozilla.com/D154382. However I am not certain, and I am waiting on review there.

It has landed. I will monitor it over the next few days to see if it resolves the crashes as expected.

Yulia, there was no change on crash volume since the patch in bug 1783190 landed.

Apparently mWorkerRef is null when we enter WorkerScriptLoader::DispatchMaybeMoveToLoadedList which leads to a ' nullptr' deref here when executing aScriptLoader.mWorkerRef->Private(). Be aware that we may access this also earlier here. And in general I see the pattern mWorkerRef->Private()->Xxxx() quite often, I assume we should be more careful here.

Please note that the volume of crashes is significant and that we merge to beta next Monday, would be nice to have a fix by then :)

Just noting that Yulia is on PTO. Andrew, can you give this a look? I can plaster the code with null-checks, but I am not sure if that is what we actually want...

Flags: needinfo?(ystartsev) → needinfo?(bugmail)

I think we're looking at the problem where cancellation causes a bunch of state machines to slowly grind to a stop, still generating callbacks, and one of those callbacks doesn't have an idempotent "are we already cancelled? then let's just ignore this" because it's very much not obvious that this is a thing that could happen.

So for NetworkLoadHandler::DataReceivedFromNetwork we have the important IsCancelled check that will return true exactly when mWorkerRef is gone.

nsresult NetworkLoadHandler::DataReceivedFromNetwork(nsIStreamLoader* aLoader,
                                                     nsresult aStatus,
                                                     uint32_t aStringLen,
                                                     const uint8_t* aString) {
  AssertIsOnMainThread();

  if (mLoader->IsCancelled()) {
    return mLoader->mCancelMainThread.ref();
  }

But for NetworkLoadHandler::OnStreamComplete just above it we do not:

NS_IMETHODIMP
NetworkLoadHandler::OnStreamComplete(nsIStreamLoader* aLoader,
                                     nsISupports* aContext, nsresult aStatus,
                                     uint32_t aStringLen,
                                     const uint8_t* aString) {
  nsresult rv = DataReceivedFromNetwork(aLoader, aStatus, aStringLen, aString);
  return mLoader->OnStreamComplete(mLoadContext->mRequest, rv);
}

I think we probably want a similar check there in OnStreamComplete or in WorkerScriptLoader::OnStreamComplete which it calls:

nsresult WorkerScriptLoader::OnStreamComplete(ScriptLoadRequest* aRequest,
                                              nsresult aStatus) {
  AssertIsOnMainThread();

  LoadingFinished(aRequest, aStatus);
  return NS_OK;
}

I would favor doing so in NetworkLoadHandler as a first try as it makes a more consistent/clear boundary, but the reality is that the cache loader calls it to per this searchfox search. However, that's not this crash, so a the network handler is potentially a good first try.

If you want to try your hand at a patch, it would be appreciated as I'm going to turn in now, but I can also try and formulate that patch and do various try runs tomorrow morning, although I may end up doing some TPAC juggling.

Flags: needinfo?(bugmail)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b246c998fb8f
Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth

Hi Andrew, we remove a cancel and get a hang, it seems.

https://pernos.co/debug/58optICDQn5Xuoe7am5Dig/index.html

Flags: needinfo?(jstutte) → needinfo?(bugmail)

So from the pernosco session we see that there is a strong worker ref that is never unset. For the scriptloader in question we never called WorkerScriptLoader::TryShutdown or ::ShutdownScriptLoader which seems to be the only place where we unset the worker ref. Instead we called WorkerScriptLoader::CancelMainThreadWithBindingAborted.

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/374cb9c09095
Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth
Regressions: 1791187

(In reply to Cosmin Sabou [:CosminS] from comment #17)

So the devtools crashes seem to be something quite obvious: If we kick off successful some LoadScript and then fail on the next, we never initialize mCacheCreator for any of them. Actually I assume that the current order of things can even lead to a race in the success case if CacheLoadHandler::OnStreamComplete is called earlier than we reach to loadContext->SetCacheCreator(cacheCreator);. It seems to me we should initialize the CacheCreator earlier and set it for each loadContext before calling LoadScript. Edit: We never initialize it when !mWorkerRef->Private()->IsServiceWorker() || IsDebuggerScript(), it seems.

The xpcshell failure is a timeout where the log does not tell us much, but it happens with networking on socket process, such that it would not affect release for now.

Flags: needinfo?(jstutte)

Argh, right, the first block of code is specialized for the non-SW case so we'd probably need some conceptually similar additions for the SW cases.

I'll do a deeper dive next week. On the upside, we now have a lot of material for our specific list of load permutations with specific tests (sometimes in sequence) that can cause them!

Flags: needinfo?(bugmail)
Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private ]

(In reply to Jens Stutte [:jstutte] from comment #20)

FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc

Looks not bad, but I wait for your analysis before making further steps.

Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private ] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList]
Flags: needinfo?(bugmail)

This is the new signature for this crash w/ inlined frames.

Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private]

(In reply to Jens Stutte [:jstutte] from comment #21)

(In reply to Jens Stutte [:jstutte] from comment #20)

FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc

Looks not bad, but I wait for your analysis before making further steps.

Can you update the patch in phabricator to make it easier to see the changes? I think I've manually/mentally diffed the changes to just the guards on the cache creator which makes sense for the first linked backout crash, but I'm not sure I understand what changes would address the second linked backout crash where the CachePromiseHandler::ResolvedCallback calls into MaybeExecuteFinishedScripts. I presume if we do something to avoid getting to CachePromiseHandler's creation in NetworkLoadHandler::PrepareForRequest then that will fix it, but I'm not sure what changed to ensure that. If you could clarify, that would be great, thank you! I've triggered a bunch more of the dt2 tests on try to give some extra confidence about the replication on that.

Flags: needinfo?(bugmail) → needinfo?(jstutte)

Thanks for having updated the phab patch; it was as I'd mentally diffed. I re-triggered a truly ridiculous number of the dt2 jobs and they all seemed okay... maybe we should just try and land the revised version? I'll re-approve the phabricator patch for clarity.

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/49661cdd3938
Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth
Assignee: ystartsev → jstutte
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 107 Branch

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regression caused by this fix.
  • If no, please set status-firefox106 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Probably yes, but I think we should wait a few days in nightly at least?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922

Status: RESOLVED → REOPENED
Flags: needinfo?(jstutte)
Resolution: FIXED → ---

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #29)

New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922

IIUC that stack shows a promise rejection as consequence of a cycle collection. I think we should just null check the worker ref here?

Flags: needinfo?(jstutte)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

This largely keeps in tact what jstutte did. The initial crash was fixed by eagerly calling
LoadingFinished. The second crash is caused because we call it twice, and only in the service worker
case, where we call it once the promise rejects. Now, we check if we have cancelled, and if we have
then we don't call the scriptLoader methods from inside of the load handlers. LoadHandlers now only
use OnStreamComplete if they are "successful" -- that is, if they were not cancelled.

OnStreamComplete retains its assertion error in the case that something was cancelled and we somehow
ended up there. In a follow up, I will clean up the friend classes of the ScriptLoader so you can't
easily access these methods from the LoadHandlers.

I was always a bit wary of the number of friend classes that ScriptLoader had, and I am finally
going ahead and removing them, as they should only use methods that are "safe" from an external
perspective. If there is a better way to do this I am quite open to that.

Depends on D158262

I've posted two patches that should resolve the new crash. They replace jstutte's second patch.

Pushed by ystartsev@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b9c694a6b63f
Do not call LoadingFinished from handlers when cancelled; r=asuth
Attachment #9295776 - Attachment is obsolete: true
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED

FWIW, I don't see any change on crash volume since the fix landed on Nightly.

Yes, this should be reopened.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: jstutte → ystartsev

The cancellation work seems to have actually made this worse.. Looking into it again.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Depends on: 1800496
See Also: → 1800496

It looks like this is, at long last, fixed.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Resolution: --- → FIXED

:yoshi, do you happen to know if the remaining patch here is still meaningful after bug 1800496 has landed? In case I'd propose to move it to a new bug and de-bitrot it.

Flags: needinfo?(bugmail) → needinfo?(allstars.chh)

https://phabricator.services.mozilla.com/D158263 looks like a nice clean-up (remove friend classes), we should move it to a new file to land it.

Flags: needinfo?(allstars.chh)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: