<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

3 years ago

Keywords: regression

Assignee

Comment 1

•

3 years ago

This is likely due to the promises not being rejected, and everything being cleaned up before they complete. I have a patch for this in bug 1783190, I will see if this fixes the problem.

Updated

•

3 years ago

tracking-firefox106: --- → +

Comment 2

•

3 years ago

The bug is marked as tracked for firefox106 (nightly). However, the bug still isn't assigned.

:jstutte, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Assignee

Updated

•

3 years ago

Assignee: nobody → ystartsev

Flags: needinfo?(ystartsev)

Updated

•

3 years ago

Flags: needinfo?(jstutte)

Comment 3

•

3 years ago

Yulia, will you have a patch up before we hit beta? Thanks

Flags: needinfo?(ystartsev)

Assignee

Comment 4

•

3 years ago

If this is related to faulty cancellation (which I believe it might be), then hopefully it will be fixed by https://phabricator.services.mozilla.com/D154382. However I am not certain, and I am waiting on review there.

Assignee

Comment 5

•

3 years ago

It has landed. I will monitor it over the next few days to see if it resolves the crashes as expected.

Comment 6

•

3 years ago

Yulia, there was no change on crash volume since the patch in bug 1783190 landed.

Comment 7

•

3 years ago

Apparently mWorkerRef is null when we enter WorkerScriptLoader::DispatchMaybeMoveToLoadedList which leads to a ' nullptr' deref here when executing aScriptLoader.mWorkerRef->Private(). Be aware that we may access this also earlier here. And in general I see the pattern mWorkerRef->Private()->Xxxx() quite often, I assume we should be more careful here.

Comment 8

•

3 years ago

Please note that the volume of crashes is significant and that we merge to beta next Monday, would be nice to have a fix by then :)

Comment 9

•

3 years ago

Just noting that Yulia is on PTO. Andrew, can you give this a look? I can plaster the code with null-checks, but I am not sure if that is what we actually want...

Flags: needinfo?(ystartsev) → needinfo?(bugmail)

Comment 10

•

3 years ago

I think we're looking at the problem where cancellation causes a bunch of state machines to slowly grind to a stop, still generating callbacks, and one of those callbacks doesn't have an idempotent "are we already cancelled? then let's just ignore this" because it's very much not obvious that this is a thing that could happen.

So for NetworkLoadHandler::DataReceivedFromNetwork we have the important IsCancelled check that will return true exactly when mWorkerRef is gone.

nsresult NetworkLoadHandler::DataReceivedFromNetwork(nsIStreamLoader* aLoader,
                                                     nsresult aStatus,
                                                     uint32_t aStringLen,
                                                     const uint8_t* aString) {
  AssertIsOnMainThread();

  if (mLoader->IsCancelled()) {
    return mLoader->mCancelMainThread.ref();
  }

But for NetworkLoadHandler::OnStreamComplete just above it we do not:

NS_IMETHODIMP
NetworkLoadHandler::OnStreamComplete(nsIStreamLoader* aLoader,
                                     nsISupports* aContext, nsresult aStatus,
                                     uint32_t aStringLen,
                                     const uint8_t* aString) {
  nsresult rv = DataReceivedFromNetwork(aLoader, aStatus, aStringLen, aString);
  return mLoader->OnStreamComplete(mLoadContext->mRequest, rv);
}

I think we probably want a similar check there in OnStreamComplete or in WorkerScriptLoader::OnStreamComplete which it calls:

nsresult WorkerScriptLoader::OnStreamComplete(ScriptLoadRequest* aRequest,
                                              nsresult aStatus) {
  AssertIsOnMainThread();

  LoadingFinished(aRequest, aStatus);
  return NS_OK;
}

I would favor doing so in NetworkLoadHandler as a first try as it makes a more consistent/clear boundary, but the reality is that the cache loader calls it to per this searchfox search. However, that's not this crash, so a the network handler is potentially a good first try.

If you want to try your hand at a patch, it would be appreciated as I'm going to turn in now, but I can also try and formulate that patch and do various try runs tomorrow morning, although I may end up doing some TPAC juggling.

Flags: needinfo?(bugmail)

Comment 11

•

3 years ago

Attached file Bug 1786571 - Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r?#dom-worker-reviewers — Details

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&collapsedPushes=842382&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=b246c998fb8f72242bb2179d989bd443e2b3d18f&selectedTaskRun=PzVT2qPGQPqPAWY1ZPXUHg.0

Comment 12

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b246c998fb8f Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth

Noemi Erli[:noemi_erli]

Comment 13

•

3 years ago

Backed out changeset b246c998fb8f (Bug 1786571) for causing ScriptLoader failures CLOSED TREE

Log: https://treeherder.mozilla.org/logviewer?job_id=390551940&repo=autoland&lineNumber=4348

Backout: https://hg.mozilla.org/integration/autoland/rev/126037147ea0aeea7376786603b4884ef8cc5dae

Flags: needinfo?(jstutte)

https://pernos.co/debug/58optICDQn5Xuoe7am5Dig/index.html

Comment 14

•

3 years ago

Hi Andrew, we remove a cancel and get a hang, it seems.

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Comment 15

•

3 years ago

•

Edited

So from the pernosco session we see that there is a strong worker ref that is never unset. For the scriptloader in question we never called WorkerScriptLoader::TryShutdown or ::ShutdownScriptLoader which seems to be the only place where we unset the worker ref. Instead we called WorkerScriptLoader::CancelMainThreadWithBindingAborted.

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=374cb9c090957305c1f22ff8f00ae8e9fc6ee781&selectedTaskRun=JEXjp5iBRkuo9Hr76Pc4QA.0

Comment 16

•

3 years ago

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/374cb9c09095 Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth

Sandor Molnar[:smolnar]

Updated

•

3 years ago

Regressions: 1791187

Cosmin Sabou [:CosminS]

Comment 17

•

3 years ago

Backed out changeset for causing several service_worker related regressions.

Failure logs:

Backout link: https://hg.mozilla.org/integration/autoland/rev/4cf5fa9a53b93e62252d9fa378d3cd1227043d92

Flags: needinfo?(jstutte)

Comment 18

•

3 years ago

•

Edited

(In reply to Cosmin Sabou [:CosminS] from comment #17)

devtools crashes: https://treeherder.mozilla.org/logviewer?job_id=390675417&repo=autoland and https://treeherder.mozilla.org/logviewer?job_id=390670496&repo=autoland

So the devtools crashes seem to be something quite obvious: If we kick off successful some LoadScript and then fail on the next, we never initialize mCacheCreator for any of them. Actually I assume that the current order of things can even lead to a race in the success case if CacheLoadHandler::OnStreamComplete is called earlier than we reach to loadContext->SetCacheCreator(cacheCreator);. ~~It seems to me we should initialize the CacheCreator earlier and set it for each loadContext before calling LoadScript.~~ Edit: We never initialize it when !mWorkerRef->Private()->IsServiceWorker() || IsDebuggerScript(), it seems.

The xpcshell failure is a timeout where the log does not tell us much, but it happens with networking on socket process, such that it would not affect release for now.

Flags: needinfo?(jstutte)

Comment 19

•

3 years ago

Argh, right, the first block of code is specialized for the non-SW case so we'd probably need some conceptually similar additions for the SW cases.

I'll do a deeper dive next week. On the upside, we now have a lot of material for our specific list of load permutations with specific tests (sometimes in sequence) that can cause them!

Flags: needinfo?(bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 20

•

3 years ago

FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc

Reporter

Updated

•

3 years ago

Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private ]

Gabriele Svelto [:gsvelto]

Comment 21

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #20)

FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc

Looks not bad, but I wait for your analysis before making further steps.

Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private ] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList]

Flags: needinfo?(bugmail)

Comment 22

•

3 years ago

This is the new signature for this crash w/ inlined frames.

Crash Signature: [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] → [@ mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList] [@ RefPtr<T>::get | RefPtr<T>::operator-> | mozilla::dom::ThreadSafeWorkerRef::Private]

Comment 23

•

3 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #21)

(In reply to Jens Stutte [:jstutte] from comment #20)

FWIW I added two paper-over checks for the cancel case, try run: https://treeherder.mozilla.org/jobs?repo=try&revision=14d68daa0594eaccff0ba39c6d9c145a7a61e8fc

Looks not bad, but I wait for your analysis before making further steps.

Can you update the patch in phabricator to make it easier to see the changes? I think I've manually/mentally diffed the changes to just the guards on the cache creator which makes sense for the first linked backout crash, but I'm not sure I understand what changes would address the second linked backout crash where the CachePromiseHandler::ResolvedCallback calls into MaybeExecuteFinishedScripts. I presume if we do something to avoid getting to CachePromiseHandler's creation in NetworkLoadHandler::PrepareForRequest then that will fix it, but I'm not sure what changed to ensure that. If you could clarify, that would be great, thank you! I've triggered a bunch more of the dt2 tests on try to give some extra confidence about the replication on that.

Flags: needinfo?(bugmail) → needinfo?(jstutte)

Comment 24

•

3 years ago

Thanks for having updated the phab patch; it was as I'd mentally diffed. I re-triggered a truly ridiculous number of the dt2 jobs and they all seemed okay... maybe we should just try and land the revised version? I'll re-approve the phabricator patch for clarity.

Flags: needinfo?(jstutte)

Comment 25

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/49661cdd3938 Add IsCancelled checks to NetworkLoadHandler::OnStreamComplete and have a GetCancelResult r=dom-worker-reviewers,asuth

https://hg.mozilla.org/mozilla-central/rev/49661cdd3938

Updated

•

3 years ago

Assignee: ystartsev → jstutte

Iulian Moraru

Comment 26

•

3 years ago

bugherder

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox107: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 107 Branch

Comment 27

•

3 years ago

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regression caused by this fix.
If no, please set status-firefox106 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 28

•

3 years ago

Probably yes, but I think we should wait a few days in nightly at least?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Reporter

Comment 29

•

3 years ago

New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922

Status: RESOLVED → REOPENED

status-firefox107: fixed → affected

Flags: needinfo?(jstutte)

Resolution: FIXED → ---

Comment 30

•

3 years ago

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #29)

New Nightly builds still submit crash reports with this signature, e.g. bp-732014ba-a732-4c13-88c9-213350220922

IIUC that stack shows a promise rejection as consequence of a cycle collection. I think we should just null check the worker ref here?

Flags: needinfo?(jstutte)

Comment 31

•

3 years ago

Attached file Bug 1786571 - Do not DispatchMaybeMoveToLoadedList on failed LoadingFinished. r?#dom-worker-reviewers (obsolete) — Details

Comment 32

•

3 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Assignee

Comment 33

•

3 years ago

Attached file Bug 1786571 - Do not call LoadingFinished from handlers when cancelled; r=asuth — Details

This largely keeps in tact what jstutte did. The initial crash was fixed by eagerly calling
LoadingFinished. The second crash is caused because we call it twice, and only in the service worker
case, where we call it once the promise rejects. Now, we check if we have cancelled, and if we have
then we don't call the scriptLoader methods from inside of the load handlers. LoadHandlers now only
use OnStreamComplete if they are "successful" -- that is, if they were not cancelled.

OnStreamComplete retains its assertion error in the case that something was cancelled and we somehow
ended up there. In a follow up, I will clean up the friend classes of the ScriptLoader so you can't
easily access these methods from the LoadHandlers.

Assignee

Comment 34

•

3 years ago

Attached file Bug 1786571 - Cleanup friend classes; r=asuth — Details

I was always a bit wary of the number of friend classes that ScriptLoader had, and I am finally
going ahead and removing them, as they should only use methods that are "safe" from an external
perspective. If there is a better way to do this I am quite open to that.

Depends on D158262

Assignee

Comment 35

•

3 years ago

I've posted two patches that should resolve the new crash. They replace jstutte's second patch.

https://hg.mozilla.org/mozilla-central/rev/b9c694a6b63f

Comment 36

•

3 years ago

Pushed by ystartsev@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b9c694a6b63f Do not call LoadingFinished from handlers when cancelled; r=asuth

Phabricator Automation

Updated

•

3 years ago

Attachment #9295776 - Attachment is obsolete: true

Noemi Erli[:noemi_erli]

Comment 37

•

3 years ago

bugherder

Status: REOPENED → RESOLVED

Closed: 3 years ago → 3 years ago

status-firefox107: affected → fixed

Resolution: --- → FIXED

Updated

•

3 years ago

status-firefox106: affected → fix-optional

tracking-firefox106: + → ---

Comment 38

•

3 years ago

FWIW, I don't see any change on crash volume since the fix landed on Nightly.

status-firefox106: fix-optional → wontfix

Assignee

Comment 39

•

3 years ago

Yes, this should be reopened.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Updated

•

3 years ago

Assignee: jstutte → ystartsev

Assignee

Comment 40

•

3 years ago

The cancellation work seems to have actually made this worse.. Looking into it again.

Comment 41

•

3 years ago

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Assignee

Updated

•

3 years ago

Depends on: 1800496

Assignee

Updated

•

3 years ago

Comment 42

•

3 years ago

It looks like this is, at long last, fixed.

Status: REOPENED → RESOLVED

Closed: 3 years ago → 3 years ago

Resolution: --- → FIXED