Closed Bug 1603484 Opened 1 year ago Closed 1 year ago

Walmart Grocery service worker fails to launch after initial run

Categories

(Core :: DOM: Service Workers, defect, P2)

73 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla75
Tracking Status
firefox-esr68 --- unaffected
firefox72 --- unaffected
firefox73 + verified disabled
firefox74 + verified
firefox75 --- verified

People

(Reporter: chris.cushman, Assigned: perry)

References

Details

(Keywords: regression, reproducible)

Attachments

(3 files)

Attached image walmartgrocery.png

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0

Steps to reproduce:

  1. Go to: https://grocery.walmart.com/
  2. Click the picture of the first item you see.
  3. You will be navigated to that product. Close out the page and try to go back to the website.
    (to fix the problem you will have to clear the cookies for that website: https://support.mozilla.org/en-US/kb/clear-cookies-and-site-data-firefox#w_clear-cookies-for-any-website)

Actual results:

The website will not load. You will see a blank page with the address https://grocery.walmart.com/ in the URL bar.

Expected results:

The page should load.

I didn't know about corrupt cookies until my question was answered by the exceptional support volunteers: https://support.mozilla.org/en-US/questions/1273210

Hi Chris,

Thanks for contacting us. Given that your question was already answered then, is it ok if we mark this issue as resolved?
Regards, have a great day!

Clara

Status: UNCONFIRMED → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME

Hi Clara,

This is a temporary fix and I was hoping that you guys could find out the actual issue and solve it so that I don't have to keep clearing the cookies from the specific website each time. I asked the question because I couldn't get the site to work at all and they told me how to work around it, I mentioned this so it would give you guys a better idea of how to fix it.

Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---

Hi Chris, my apologies!

I've chosen a component for this bug in hope that someone with more expertise may look at it. We'll await their answer. If you consider that there's another component that's more proper for this case you may change it.

Regards, Clara.

Status: UNCONFIRMED → NEW
Component: Untriaged → Networking: Cookies
Ever confirmed: true
Product: Firefox → Core

Ehsan, could you take a look?

Flags: needinfo?(ehsan)

This bug doesn't have a reproducible set of steps to reproduce. Without that I'm afraid there isn't anything that I or anyone else can really do here.

Has STR: --- → no
Flags: needinfo?(ehsan)
Keywords: steps-wanted

I managed to reproduce it with comment 0 very intermittently. Once I got into a state that it would fail to load the page, I tried to see which cookies were "corrupted", and it seems it's not the cookies at all here, but the service worker that's messing things up.

Steps that always reproduce for me:

Child 14018, Main Thread] WARNING: '!window', file /home/icecold/mozilla-central/dom/cache/CacheStorage.cpp, line 573
[Child 14018, Main Thread] WARNING: '!window', file /home/icecold/mozilla-central/dom/cache/CacheStorage.cpp, line 573
[Child 14018, Main Thread] WARNING: 'mIsWorkerScript || (mState != ServiceWorkerState::Parsed && mState != ServiceWorkerState::Installing)', file /home/icecold/mozilla-central/dom/workers/ScriptLoader.cpp, line 1717
[Parent 13857, IPDL Background] WARNING: 'aOwner->mState == RemoteWorkerController::eTerminated', file /home/icecold/mozilla-central/dom/workers/remoteworkers/RemoteWorkerController.cpp, line 399
[Parent 13857, IPDL Background] WARNING: 'aResult.IsReject()', file /home/icecold/mozilla-central/dom/workers/remoteworkers/RemoteWorkerControllerParent.cpp, line 108
No longer blocks: cookie
Has STR: no → yes
Component: Networking: Cookies → DOM: Service Workers
Priority: -- → P2

Possible regression caused by parent interception?

Blocks: 1588154

Calling this tracking+ for now, but it sounds to me like this could be a blocker for sw-e10s if confirmed.

Assignee: nobody → ytausky
Summary: Walmart Grocery website saves corrupt cookies. After entering the site and clicking an option. You will not be able to go back to the website unless you clear the cookies. → Walmart Grocery service worker fails to launch after initial run

The service worker imports scripts:

importScripts(
  '/sw-data.js?cb=' + Date.now(),
  'https://storage.googleapis.com/workbox-cdn/releases/3.4.1/workbox-sw.js'
);

The first script fails to load since this is a second run of the service worker, where we don't allow it to load new scripts, and that script is always new due to the use of Date.now(). I'd like to know how it worked with parent intercept off; we should probably also rethink the policy regarding loading new scripts in subsequent runs.

We used to forbid service worker from importing new scripts after
installation, restricting them to scripts they already imported.
This restriction never actually worked before the inception of
parent intercept mode, and now it turns out that the Walmart
Grocery site depends on this ability to function properly. This
commit removes the restriction altogether in order to avoid
breaking sites that already depend on it.

Does Chrome allow these sorts of second-run imports? Does the spec?

The spec forbids them entirely. See https://phabricator.services.mozilla.com/D60698#1851619 for a little more discussion.

I see. What does Chrome do here?

Pushed by ytausky@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/92d629854868
Allow new script imports in subsequent service worker runs r=dom-workers-and-storage-reviewers,asuth
Backout by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/36440af6bb5a
Backed out changeset 92d629854868 for wpt failures on service-worker/import-scripts-updated-flag.https.html.

So we have actually an explicit WPT test that ensures we do not allow these second-run imports? Then I wonder how we ever passed this test (or has it been enabled recently?).

If the spec is such clear, probably we must break the site then. Given that we have some time until SW e10s is enabled in release, maybe we can just warn them?

Flags: needinfo?(bugmail)

(In reply to Jens Stutte [:jstutte] from comment #22)

So we have actually an explicit WPT test that ensures we do not allow these second-run imports? Then I wonder how we ever passed this test (or has it been enabled recently?).

The failure of installing.https.htmlis a TEST-UNEXPECTED-PASS, it is configured to expect FAIL at https://searchfox.org/mozilla-central/rev/ad6234148892bc29bf08d736604ac71925138040/testing/web-platform/meta/service-workers/service-worker/installing.https.html.ini

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=286957357&resultStatus=testfailed%2Cbusted%2Cexception&revision=92d62985486856c0fb0b5a7b940816e19b1faf64&searchStr=wpt

Failure log: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=641d24ccb2f69c69b13aff168d983ec057857807&selectedJob=286975206
https://treeherder.mozilla.org/logviewer.html#?job_id=286957357&repo=autoland

Backout link: https://hg.mozilla.org/integration/autoland/rev/92d629854868

There is:
TEST-UNEXPECTED-FAIL | /service-workers/service-worker/import-scripts-updated-flag.https.html | import script not previously imported - assert_equals: expected (string) "NetworkError" but got (object) null
TEST-UNEXPECTED-PASS | /service-workers/service-worker/installing.https.html | installing is set - expected FAIL

Flags: needinfo?(ytausky)

There's a test that ensures that we don't allow new scripts to be imported, and it's 3 years old. Since it passed with child intercept, it seems that it had the same behavior -- at least in some cases. I'll try to find out why it doesn't stop this website from running its service worker, and also why Chrome lets it run. Anyhow, we should probably drop this patch.

So far I didn't manage to get in touch with anyone at Walmart; any ideas would be helpful.

Flags: needinfo?(ytausky)

:mconca, can you try to find a way to reach out to Walmart? Thank you!

Flags: needinfo?(mconca)

(In reply to Jens Stutte [:jstutte] from comment #22)

If the spec is such clear, probably we must break the site then. Given that we have some time until SW e10s is enabled in release, maybe we can just warn them?

The rationale for this fix that we attempted to land (previously discussed at https://phabricator.services.mozilla.com/D60698#1851619) was that we don't want to change this behavior between child-intercept and parent-intercept because it confuses the decision process in switching to parent intercept. Especially since broken Service Workers will manifest as broken sites until we learn how to mark ServiceWorkers as broken and automatically bypass them. (We do have a bug on that at least for corrupted content purposes, but there are variants on this failure mode that wouldn't result in an explicit corrupted content error page that would still be worth reloading the page to bypass the ServiceWorker when this happens.)

If we land this patch/another variant that maintains the child-intercept status quo, then we would want a follow-up bug to fix this behavior and improve the WPT tests to cover this exact scenario. We still pass the WPT test added (https://searchfox.org/mozilla-central/source/dom/serviceworkers/test/importscript_worker.js#31) which tests new loads after the top-level script execution has completed, at least. Haven't looked at the regressions on this bug yet.

Flags: needinfo?(bugmail)

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #19)

I see. What does Chrome do here?

It appears that what happens is:

  • The ServiceWorker throws an error.
  • Intercept is aborted and the network connection is reset to go to the network.
    • navigator.serviceWorker.controller ends up null.
    • The network panel shows that the connection went to the network.
  • They constantly end up reinstalling the ServiceWorker with every visit, which is interesting because the bytewise checks should think things are stable and the soft update should only run once per day.

I will see if we have a way to contact Walmart. I would not be comfortable landing a patch that breaks previous Firefox behavior, even if it is more correct according to the spec. Walmart is the case we know about, there are likely other sites that do this that we are unaware of, especially if Chrome has been allowing this behavior.

(In reply to Mike Conca [:mconca] from comment #29)

I will see if we have a way to contact Walmart. I would not be comfortable landing a patch that breaks previous Firefox behavior, even if it is more correct according to the spec. Walmart is the case we know about, there are likely other sites that do this that we are unaware of, especially if Chrome has been allowing this behavior.

I had this page cached, so didn't see asuth's comment on Chrome behavior before commenting. Looks like Chrome doesn't permit new script import, but the site works because it is reinstalling the worker on each visit. That makes me feel better about this patch, and most sites wouldn't allow new imports because Chrome hasn't allowed it. I will continue to contact Walmart about this issue.

The fallback seems to be non-standard behavior. I filed an issue with the spec to have it specified.

Just to be clear: does this mean, we will not land the patch?

Flags: needinfo?(ytausky)

I think the best solution in this case is to go to the network like Chrome does, and if possible, standardize this behavior. This means we won't land the current patch, and a new one is required.

Flags: needinfo?(ytausky)

Fx73 was reverted to the previous behavior. Chris, can you confirm that current Fx73 beta builds are working OK for you?

Flags: needinfo?(chris.cushman)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #34)

Fx73 was reverted to the previous behavior. Chris, can you confirm that current Fx73 beta builds are working OK for you?

Yes version 73.0 (64-bit) is working for me! Nightly version 74.0a1 (2020-02-04) (64-bit) still doesn't though.

Flags: needinfo?(chris.cushman)

Thank you for the confirmation. It is expected to not work with the current Nightly.

(In reply to Yaron Tausky [:ytausky] from comment #33)

I think the best solution in this case is to go to the network like Chrome does, and if possible, standardize this behavior. This means we won't land the current patch, and a new one is required.

It seems like we have a good path forward for this issue. In parallel, we've reached out to Walmart via two different channels to let them know 1) we are working on this, and 2) the design of their site and use of service workers is not optimal for users of any browser.

Flags: needinfo?(mconca)

Summarizing my current understanding of what's going on and what should happen (but currently doesn't):

  1. The service worker is installed.
  2. The service worker shuts down.
  3. The user navigates to the website, triggering the Handle Fetch algorithm through step 3.1 of the HTTP fetch algorithm.
  4. In step 18 of the algorithm the Run Service Worker algorithm is called.
  5. In step 7.12 the worker's script is evaluated. During the evaluation the script calls importScripts, and fails the check in step 4.1, so a NetworkError is thrown.
  6. In step 7.13 of the Run Service Worker algorithm we erroneously set startFailed to true and abort, instead of continuing.
  7. Back in step 18 of the Handle Fetch algorithm we erroneously set handleFetchFailed to true and skip step 19.
  8. In step 23 we erroneously return a network error instead of continuing to step 24 and returning response, which is null.
  9. Back in the HTTP fetch algorithm we don't fall back to the network, as we should.

So this situation is actually specified correctly and our implementation deviates from the spec starting at point 6 above.

Adding to comment 38, in particular, step 7.12 of the Run Service Worker algorithm states:

Let evaluationStatus be the result of running the classic script script if script is a classic script, otherwise, the result of running the module script script if script is a module script.

The run a classic script algorithm is run with rethrow errors not given, so its value is false. Additionally, script's error to rethrow is null; this value is only non-null in the event of a script parse error, which couldn't have happened because the script is known to have been evaluated when the ServiceWorker installs, and script evaluation implies no parse error. So, we definitely reach step 7 of run a classic script:

Otherwise, set evaluationStatus to ScriptEvaluation(script's record).

We also definitely reach step 12.a of the ScriptEvaluation algorithm:

Set result to the result of evaluating scriptBody.

Because evaluating the ServiceWorker script in its activated state throws an uncaught exception, result.[[Type]] is "throw" and evaluationStatus from run a classic script is an abrupt completion. Then, because rethrow errors is false, step 8.3.1 of run a classic script returns the evaluationStatus. Finally, because evaluationStatus.[[Value]] is not empty (a NetworkError was thrown) and the [Terminate Service Worker] algorithm doesn't run, neither steps 7.13 nor 7.14 of Run Service Worker sets startFailed to true. Rather, startFailed remains false throughout Run Service Worker and the algorithm returns the evaluationStatus of the ServiceWorker script (which is the serviceWorker's start status) and not failure.

The final result is that, in this case, the Handle Fetch's step 18

If the result of running the Run Service Worker algorithm with activeWorker is failure, then set handleFetchFailed to true.

does not set handleFetchFailed to true because the result of Run Service Worker is not failure.

Assignee: ytausky → perry
Status: NEW → ASSIGNED
Attachment #9122387 - Attachment is obsolete: true
Attachment #9122387 - Attachment is obsolete: false
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fb7235b2a1b3
worker script evaluation resulting in abrupt completion shouldn't be a start failure r=dom-workers-and-storage-reviewers,asuth
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21760 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Flags: qe-verify?
Status: ASSIGNED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75
Upstream PR merged by moz-wptsync-bot

Backed out for permafailing on shared-worker-import-data-url.any.html

backout: https://hg.mozilla.org/integration/autoland/rev/ff928c66923e98f671f44592a1f179c780ca7a75

push: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=03d5db3bdc732d62f6eef6354caa7c711f813fca&tochange=f0f39d5db933ca61a22e48b336702ee48aa9d47b&fromchange=f710854f4d46464196d9db682a8353c8013cf7b9&selectedJob=288713695

failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288713695&repo=autoland&lineNumber=9737

[task 2020-02-13T13:34:03.867Z] 13:34:03 INFO - TEST-START | /workers/modules/shared-worker-import-data-url.any.html
[task 2020-02-13T13:34:03.872Z] 13:34:03 INFO - Closing window 53
[task 2020-02-13T13:34:14.152Z] 13:34:14 INFO - TEST-UNEXPECTED-ERROR | /workers/modules/shared-worker-import-data-url.any.html | SyntaxError: import declarations may only appear at top level of a module
[task 2020-02-13T13:34:14.152Z] 13:34:14 INFO - data:text/javascript,import "http://web-platform.test:8000/workers/modules/resources/static-import-worker.js?pipe=header(Access-Control-Allow-Origin, *)";:1:0
[task 2020-02-13T13:34:14.152Z] 13:34:14 INFO - TEST-INFO expected TIMEOUT | took 10287ms

Flags: needinfo?(perry)
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21779 for changes under testing/web-platform/tests
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21780 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Upstream PR merged by moz-wptsync-bot
Upstream PR merged by moz-wptsync-bot
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla75 → ---

The failure is "TEST-UNEXPECTED-ERROR | /workers/modules/shared-worker-import-data-url.any.html | SyntaxError: import declarations may only appear at top level of a module", but the expectation described in shared-worker-import-data-url.any.js.ini is "[TIMEOUT, ERROR]" for the test case, so I'm not sure why it was reported as an unexpected error. I'll re-run the WPTs.

Flags: needinfo?(perry)

(In reply to Perry Jiang [:perry] from comment #54)

The failure is "TEST-UNEXPECTED-ERROR | /workers/modules/shared-worker-import-data-url.any.html | SyntaxError: import declarations may only appear at top level of a module", but the expectation described in shared-worker-import-data-url.any.js.ini is "[TIMEOUT, ERROR]" for the test case, so I'm not sure why it was reported as an unexpected error. I'll re-run the WPTs.

It's a frequent intermittent: bug 1612699.

See Also: → 1612699

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/85bb15ccb2d1
worker script evaluation resulting in abrupt completion shouldn't be a start failure r=dom-workers-and-storage-reviewers,asuth
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21861 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75
Upstream PR merged by moz-wptsync-bot

I assume, we want to uplift this to beta 74?

Flags: needinfo?(perry)

Comment on attachment 9125694 [details]
Bug 1603484 - worker script evaluation resulting in abrupt completion shouldn't be a start failure r?#dom-workers-and-storage,asuth

Beta/Release Uplift Approval Request

  • User impact if declined: Workers that throw an uncaught exception during initial script evaluation will be considered closed, when they should be considered running (according to the web/shared workers specifications).
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: https://bugzilla.mozilla.org/show_bug.cgi?id=1603484#c0
  • List of other uplifts needed: https://phabricator.services.mozilla.com/D63348
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is just allowing the worker to continue running if it throws an uncaught exception.
  • String changes made/needed:
Flags: needinfo?(perry)
Attachment #9125694 - Flags: approval-mozilla-beta?
Flags: qe-verify? → qe-verify+

Verified the issue with Firefox 75.0a1 (20200219215002) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. The https://grocery.walmart.com/ website is working after following the STR from comment 0 and comment 7.
The only problem is that after loading https://grocery.walmart.com/ in a new tab Firefox tries to register a new service worker for the page in about:debugging#/runtime/this-firefox and for a brief period of time. Is this the expected behavior? Thanks!

Flags: needinfo?(perry)

AFAIU, yes. The site is behaving in a strange and unexpected but not forbidden way. See comment 16 and comment 39 for details. We added also an explicit WPT test.

QA Whiteboard: [qa-triaged]

Comment on attachment 9125694 [details]
Bug 1603484 - worker script evaluation resulting in abrupt completion shouldn't be a start failure r?#dom-workers-and-storage,asuth

Fix for a top site, patch has tests, uplift approved for 74.0b6, thanks.
Note to Sheriffs: there is an additional patch to uplift to beta mentioned in the uplift request.

Attachment #9125694 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Alexandru Trif, QA [:atrif] from comment #65)

Verified the issue with Firefox 75.0a1 (20200219215002) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. The https://grocery.walmart.com/ website is working after following the STR from comment 0 and comment 7.
The only problem is that after loading https://grocery.walmart.com/ in a new tab Firefox tries to register a new service worker for the page in about:debugging#/runtime/this-firefox and for a brief period of time. Is this the expected behavior? Thanks!

Yes, this looks like the expected behavior. The second service worker that appears to be registering has the same URL as the one already registered (at least it does for me when I try), so the second service worker should disappear from about:debugging.

Flags: needinfo?(perry)

(In reply to Perry Jiang [:perry] from comment #68)

(In reply to Alexandru Trif, QA [:atrif] from comment #65)

Verified the issue with Firefox 75.0a1 (20200219215002) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. The https://grocery.walmart.com/ website is working after following the STR from comment 0 and comment 7.
The only problem is that after loading https://grocery.walmart.com/ in a new tab Firefox tries to register a new service worker for the page in about:debugging#/runtime/this-firefox and for a brief period of time. Is this the expected behavior? Thanks!

Yes, this looks like the expected behavior. The second service worker that appears to be registering has the same URL as the one already registered (at least it does for me when I try), so the second service worker should disappear from about:debugging.

Yep is disappearing after a period of time. That being said the issue is verified with Firefox 75.0a1 (20200219215002) on Windows 10x64, macOS 10.15 and Ubuntu 18.04.

Verified with Firefox 74.0b6 (20200221001238) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. The website is working after following STR from comment 0 and comment 7.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.