Open Bug 1752655 Opened 5 months ago Updated 9 days ago

Slack issues on session restore: custom font not loaded, sidebar with thread conversation broken or: "For some reason, Slack couldn’t load"

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

ASSIGNED
Webcompat Priority P1
Tracking Status
firefox-esr91 --- ?
firefox98 + wontfix
firefox99 + wontfix
firefox100 + wontfix
firefox101 + wontfix
firefox102 + wontfix
firefox103 --- affected

People

(Reporter: aryx, Assigned: asuth)

References

Details

(Keywords: regression)

Attachments

(2 files)

Firefox 98.0a1 20220128094730 on Windows 8.1

Observed 4x over the last 2 days, RyanVM also observed the issue.

Slack has issues on session restore after Firefox got closed and launched again: the custom font doesn't get loaded (the default, serif browser font gets used), the sidebar a conversation thread is broken.

Edit: Hard reload fixes the issue for that session.

Severity: -- → S3
Priority: -- → P2

[Tracking Requested - why for this release]: Seems to be common, :jrconlin also reported this. On this machine, it's observed for >50% of the Nightly restarts.

It started to use the prettified "For some reason, Slack couldn’t load" error message again.

Summary: Slack issues on session restore: custom font not loaded, sidebar with thread conversation broken → Slack issues on session restore: custom font not loaded, sidebar with thread conversation broken or: "For some reason, Slack couldn’t load"

Andrew, does this bug seem familiar to you?

Flags: needinfo?(bugmail)

There could be a bunch of things going wrong ServiceWorkers-wise, and the logging in bug 1506892 will help us understand this better and save people (including me) a lot of work. :jesup has recently de-bitrotted my patches there so it's fairly feasible for me to clean those up and land them and then get feedback from those experiencing this over the course of the next week.

Note that there could also potentially be weird SessionStore things going on where SessionStore might be poking some state into the page that results in some weirdness, but a ServiceWorker-related problem is very possible. In particular, bug 1668743 can very possibly occur for slack but should immediately trigger mitigations. But the mitigations only operate on navigation fetches, where it sounds like what's being witnessed here involves subresource fetches failing.

Component: Storage: IndexedDB → DOM: Service Workers
See Also: → 1506892
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

For what it's worth, this is happening to me very frequently.

The last failures on this machine had initially no visual indication of the issue until the user tries to open a Slack thread which fails to load its messages.

I can confirm Comment 6. Fonts are OK, but threads fail to load.

Should we reassess the severity here? If so many Mozillians are hitting this, it must be super common.

Flags: needinfo?(jstutte)

Andrew, IIUC you still have this on your radar. I would assume that we'd see more reports from external users, but this might just indicate that the usage of slack inside the browser in general is not too common given the alternative desktop app.

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Not sure, but bug 1759938 may be a dupe.

Duplicate of this bug: 1759938

I run into the font issue mentioned here every once in a while, but it always sorts itself out over time. A refresh will reset things usually.

I also have a problem when accessing threads - about 50% of the time the panel is blank. It's pretty easy to reproduce.

I'm using Nightly, session restore, and Slack is in a pinned tab.

Happy to help debug if we think any of this is on the Firefox side of things.

When I see this, the following is logged in the browser console (minus "info" logs because there's about 500 of them):


Mar-30 13:26:30.889 [SERVICE-WORKER] normalizing request URL for cache https://app.slack.com/client/T027LFU12 service-worker.js:1:56162
Mar-30 13:26:30.911 [SERVICE-WORKER] responding from cache (normalized) /boot/client.html service-worker.js:1:56162
NS_ERROR_INVALID_CONTENT_ENCODING: Component returned failure code: 0x804b001b (NS_ERROR_INVALID_CONTENT_ENCODING) [nsIStreamListener.onDataAvailable] 2 network-response-listener.js:590
Content Security Policy: Couldn't parse invalid host 'wasm-eval'
Content Security Policy: Couldn't parse invalid host 'wasm-unsafe-eval'
Content Security Policy: Couldn't parse invalid host 'wasm-eval'
Content Security Policy: Couldn't parse invalid host 'wasm-unsafe-eval'
Mar-30 13:26:31.117 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/slack-icons-v2-fe043a5.woff2 service-worker.js:1:56162
Mar-30 13:26:31.131 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/lato-regular-d9ce515.woff2 service-worker.js:1:56162
Mar-30 13:26:31.144 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/lato-black-b64f5e4.woff2 service-worker.js:1:56162
Mar-30 13:26:31.157 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/lato-bold-4b1dc11.woff2 service-worker.js:1:56162
Mar-30 13:26:31.177 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/slack-kit-styles.02eca50.css service-worker.js:1:56162
Mar-30 13:26:31.198 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/client-boot-styles.614ea87.css service-worker.js:1:56162
Mar-30 13:26:31.217 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/helper-styles.6e85b79.css service-worker.js:1:56162
Mar-30 13:26:31.231 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/gantry-vendors.b83fc3b.min.js service-worker.js:1:56162
Mar-30 13:26:31.243 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/gantry-shared.7241864.min.js service-worker.js:1:56162
Mar-30 13:26:31.255 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/client-boot.aba29b2.min.js service-worker.js:1:56162
[WD] All assets have reported back; s: 6; f: 0 T027LFU12:1:945
Mar-30 13:26:31.883 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/gantry-vendors-async.58fd38d.min.js service-worker.js:1:56162
Mar-30 13:26:31.891 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/gantry-vendors-async-client.3c603c7.min.js service-worker.js:1:56162
Mar-30 13:26:31.911 [SERVICE-WORKER] responding from cache https://a.slack-edge.com/bv1-9/client-boot-imports.28a6804.min.js service-worker.js:1:56162
[WD] All assets have reported back; s: 9; f: 0 T027LFU12:1:945
Mar-30 13:26:34.120 [SERVICE-WORKER] responding with fetch https://mozilla.slack.com/api/api.features?_x_gantry=true&_x…f3f496e04e71aec7959795457752ed3f1cdbe0965b40e04a337e13bcc8f8 service-worker.js:1:56162
Mar-30 13:26:34.120 [SERVICE-WORKER] responding with fetch https://slack-imgs.com/ 3 service-worker.js:1:56162
Mar-30 13:26:34.544 [SERVICE-WORKER] responding with fetch https://mozilla.slack.com/api/experiments.getByUser?_x_gantr…f3f496e04e71aec7959795457752ed3f1cdbe0965b40e04a337e13bcc8f8 service-worker.js:1:56162
Use of Mutation Events is deprecated. Use MutationObserver instead. gantry-vendors-async.58fd38d.min.js:25:17177
Mar-30 13:26:35.455 [SERVICE-WORKER] responding with fetch https://slack-imgs.com/ service-worker.js:1:56162
Mar-30 13:26:35.457 [SERVICE-WORKER] responding with fetch https://slack-imgs.com/ 2 service-worker.js:1:56162
Mar-30 13:26:35.987 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.214 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.223 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.310 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.704 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.782 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
Mar-30 13:26:36.854 Couldn't complete metrics measurement for start mark 'membership-update-mark' gantry-shared.7241864.min.js:1:567182
The resource at “https://a.slack-edge.com/bv1-9/lato-regular-d9ce515.woff2” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. C029R6D5VUN-1648599697.812909
The resource at “https://a.slack-edge.com/bv1-9/lato-black-b64f5e4.woff2” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. C029R6D5VUN-1648599697.812909
The resource at “https://a.slack-edge.com/bv1-9/lato-bold-4b1dc11.woff2” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. C029R6D5VUN-1648599697.812909
Some cookies are misusing the recommended “SameSite“ attribute 2
Attempt to set a forbidden header was denied: Connection 1747225210-client_js_prod_kix_core.js:2441:471
Persisting sync data up to s95364055_276459_74114_84166880_14553710_60233_2779397_64730169_3715 bdc4736dc6bebfc09f05.worker.js:1:62668
​

If I compare a "good" slack tab with a "bad" slack tab, the good one shows that fonts are set by a font-family rule:

[lang] body {
  font-family: Slack-Lato,Slack-Fractions,appleLogo,sans-serif;
}

from https://a.slack-edge.com/bv1-9/slack-kit-styles.02eca50.css?cacheKey=gantry-1648624050 . Unclear if that URL is the same across users. Anyway, that rule is completely missing in the broken tab. The devtools style editor does indicate that the stylesheet was loaded, but claims it has 0 rules - but when selecting it in the style editor, compressed and correctly syntax-highlighted CSS shows up (I'm guessing, because it gets re-fetched). If I then add a space at the very start of the CSS content in the style editor, the content in the tab suddenly gets styled correctly.

So ISTM the service worker fetch for some CSS files (and perhaps JS files, given the broken thread views) is returning nothing or broken content, for some reason. At the moment this is happening a lot for me locally (as others have said, a refresh fixes it for that tab, but the next slack tab I open has the same issue) - I'm happy to help debug further but I don't know what information would be helpful.

Thank you :Gijs! The NS_ERROR_INVALID_CONTENT_ENCODING is most interesting as that could explain how there could be apparent corruption despite the design of fetch and cache.add making it hard to get corrupt data into storage. This provides a good indication of the type of error we can also prioritize logging for the about:serviceworkers logging since in general the sub-resource loads would be overly voluminous!

I believe Eden is looking into this now and I will be out on PTO for a bit, so I'm changing the assignee. Note that there are a lot of ongoing high priority Service Worker efforts with deadlines associated and Eden is already responsible for many of them, so in-depth investigations like in comment 13 are VERY appreciated!

Assignee: bugmail → echuang
Flags: needinfo?(bugmail)

Also ran into this problem today, refreshing the page doesn't help. Here is a profiled result, hope it would help.
[update] Ctrl + F5 solves this issue, but Ctrl + R doesn't. But if you run Ctrl + F5 (back to normal) then Ctrl + R, the layout would be broken again.

(In reply to Alastor Wu [:alwu] from comment #16)

Also ran into this problem today, refreshing the page doesn't help. Here is a profiled result, hope it would help.
[update] Ctrl + F5 solves this issue, but Ctrl + R doesn't. But if you run Ctrl + F5 (back to normal) then Ctrl + R, the layout would be broken again.

I experienced this problem today (with Firefox Nightly 100) and, like Alastor reports, Ctrl + R didn't fix the problem, but Ctrl + F5 did.

I understand ctrl + f5 to be the same as ctrl + shift + r which will bypass the ServiceWorker (and attempt to revalidate all cached resources). This is consistent with the problem being understood to involve ServiceWorkers (which is always good confirmation to have and to help make people aware that they can use these mechanisms to bypass ServiceWorker problems!).

Are people experiencing this potentially using any web-extensions that might potentially manipulate or otherwise introspect things via the webRequest API surface?

I have an extension that I understand uses webRequest.

I'm going to look asap in a bit more details into the webRequest side of the known STR, but in the meantime I though to share this details related to the content of the cached responses that seems to be triggering the issue visible from a user perspective (but I got the feeling that the underlying issue may be triggered before it actually presents to the user by making the slack thread sidebar to stay blank when the slack tab gets restored).

"rejected by sanitizer" errors being logged in the console

I was curious about what was in the font request responses that seems to be triggering those "rejected by sanitizer" errors logged in the browser console when I was reproducing this issue issue, and so I tried to record it with rr.

Based on what I've been able to dig into in the recorded session, it seems that the "rejected by sanitizer errors" are raised because the length of the response for the font being preloaded did have a 0 length.

I wanted to confirm if the response in the cache for that .woff2 font request did have an empty body and so I did put a breakpoint in the slack service worker to be sure to be looking to the cache storage with the same cacheKey that slack service worker was using internally.

Using that cacheKey I confirmed from the webconsole tab opened on the same slack tab where the issue was triggered (and the slack Thread sidebar was blank) that those response related to the .woff2 requests that triggered the "reject by sanitizer" errors did have a zero-length body.

I tried to get some more details into how an extension using webRequest (e.g. ublock) could be able to trigger the actual underlying issue, but unfortunately I've not been able to recreate the same issue.

Based on a quick look it looks that ublock may be suspending those .woff2 request when it does intercept them, but it should then resume them without blocking them (and without changing their body using something like webRequest.filterResponseData).

At least for the issue as I can reproduce it locally:

  • once the font are cached with zero-length body, disabling the extension (even all extensions) didn't change the outcome (the slack thread sidebar is blank)
  • clearing from the devtools storage panel the data stored by slack in the cache API storage did fix the issue on all next reloads or on restoring that tab of browser restart (the fonts got loaded and the slack thread sidebar rendered as expected)
  • re-enabling the extensions didn't seem to be able to trigger the underlying issue again anymore (at least it didn't do it over some restarts and after using that profile for a short while)

And so it seems that we may be missing some steps of the actual STR (one that would be consistently re-introducing the underlying issue).

I'm wondering if we may have introduced the actual underlying issue in a range of nightly versions for a short while, which got the zero-length response body cached for the Nightly users affected, and after that the corrupted cache entries are staying in the cache and they keep triggerring the visible issue for the Nightly users that got the unexpected zero-length responses cached.

Do we know if this has ever been reported on another release channel?
or if some of us has been able to reproduce it in a brand new Firefox profile?

or if some of us has been able to reproduce it in a brand new Firefox profile?

I wasn't able to reproduce in a new Firefox Nightly profile.

I was seeing this bug every time I restarted Firefox Nightly starting on Tuesday (after I rebooted my machine to apply a Windows 11 OS update). Clearing "Cached Web Content" in Firefox's Cookies and Site Data settings today seemed to have stopped the problem for now.

Correction: I now see the font problem every I restart Nightly again. Ctrl + F5 still fixes the problem.

I disabled all my extensions and disabled ETP, but I still experience this bug every time I restart Firefox Nightly, even after reloading the page with Ctrl + F5 to "fix" the fonts before each restart.

Chris, can you try clearing the storage for the slack.com domain as documented at https://support.mozilla.org/en-US/kb/clear-cookies-and-site-data-firefox#w_clear-cookies-for-any-website and continue to run for extensions disabled for a bit? The presumption is that the damage to the Cache API storage in your profile is already done and there's no self-healing mechanism in slack to address this (or in Firefox at this time, although this is obviously something we will be thinking about adding to our suite of interventions).

It should be fine to have ETP on, I don't think we currently view that as a likely cause of problems since I don't think slack hosts anything that would be classified as a tracker by ETP?

Flags: needinfo?(cpeterson)

Hey Chris,
before clearing all the data stored for the app.slack.com domain, would you mind to check if you got something inside the storage/default/https+++app.slack.com/cache/morgue/ directory in that profile?

I did notice that I had 1.7G morgue directory for app.slack.com, and even just renaming the morgue directory but leaving it there (or removing) did fix the issue in the test profile I'm using to keep triggering this issue as clearing all the data stored for app.slack.com, and I'd like to confirm if other user affected have the cache morgue directory in a similar state (e.g. with an amount of data higher than usual).

I believe that may just be part of the side-effects of the actual underlying issue, but it may be part of the symptoms and may also help to get some ideas about what conditions may we getting us into that corrupted state.

About the morgue directory, after looking a bit into the code that is managing that morgue subdir I started to suspect that the name may be a bit misleading and that its role for the Cache API storage is similar to the role of the files subdir that is used by the IndexedDB storage.

I had a brief chat with :asuth and he confirmed that suspicious. And so moving or removing the morgue dir is in practice similar to removing all the data stored in the Cache API from the devtools storage panel (with the downside of likely being living the Cache API storage in an inconsistent state).

Chris looked to the size of his morgue directory for the app.slack.domain and it was ~4G.

My current suspicious (about the relationship between the extension APIs and triggering this issue) is that:

  • when the size of the storage directory becomes pretty big
  • then that may slow down the startup of the extensions that are using webRequest to suspend the intercepted web requests if they are also waiting for a browser.storage.local API call to complete (which will not be completed until the Quota Manager Service is fully initialize, which if I recall correctly can take longer when the storage directory grows in size and number of files to go through to estimate the amount of storage in use)
  • and maybe when that happens the extension may have a chance to introduce some issue for the request that the service worker does and then store in the cache (e.g. some timeout not handled correctly by the slack service worker?).

I collected some logs (using a couple of log points added in the WebExtensions internals for the WebRequest API) which shows that the extension (uBlock in this case) would be intercepting from the webRequest API both the request triggered by the webpage to load the font (which is going to be handled by the service worker) and then the one triggered by the service worker if the font was not already cached:

The logpoints with tabId -1 are the one related to the service worker request intercepted.

webRequest.filterResponseData could be another one way that an extension may directly be able to change the body of the intercepted request and make it a 0-length body (maybe just by mistake, e.g. because of an unexpected exception or rejection not handled), I did add a logpoint for when webRequest.filterReponseData is called, but it seems that it wasn't triggered in the tests I did locally.

Flags: needinfo?(cpeterson)

The attached patch includes a mochitest-browser test file that recreates the conditions described below and confirms that it would lead to a response stored in the cache API storage with a 0-length body.

The purpose was to confirm that an extension would be able to trigger the issue by:

  • intercepting the requests that the service worker does to fetch the responses added into the cache
  • then calling webRequest.filterResponseData to be able to intercept the body data, but without doing anything at all in the filterResponseData events ("start", "stop", "data")

e.g. like an extension that may be misbehaving by mistake (like failing to handle some error condition and never actually handle any on the filterResponseData events as a side effect of that).

The attached extension includes a small content script that scans all the cache API data on app.slack.com (as soon as there is a tab open on it) and removes any response with an empty body.

I'm attaching the extension here in case it may be useful to someone else to fixup the issue in their profile (without having to clear all the data and cookies stored by slack), it just need to be loaded temporarily from about:debugging (it does log some console messages in the webconsole associated to the slack tabs being cleared).


I was curious what other resources were also cached with an empty response body, because I was assuming that the font was just what was making the issue visible in the Browser Console (due to the error logged internally on the font sanity checks) but that the cached responses that were likely being breaking the slack thread sidebar were actually other ones (likely .js files storage into the cache as empty), and so I created this extension as a quick way to look into that.

Locally (in one of my profile affected by this issue that I have archived last week to run some more tests) I also tweaked this extension to clear only the .js files with an empty body (instead of any response with an empty body, as in the version of the extension I'm attaching) and on the next browser startup the slack thread sidebar was not empty anymore.

Since it is caused by ServiceWorker calls respondWith() with a corrupt response, I am developing a patch to verify the source response and cached response should be the same to make sure Cache API doesn't save a corrupt response.

But I think it does not help the case if an extension modifies the source response which Luca shows in comment 27, Cache API has no idea whether this source response is corrupt or not.

Webcompat Priority: --- → ?

(In reply to Eden Chuang[:edenchuang] from comment #29)

Since it is caused by ServiceWorker calls respondWith() with a corrupt response, I am developing a patch to verify the source response and cached response should be the same to make sure Cache API doesn't save a corrupt response.

But I think it does not help the case if an extension modifies the source response which Luca shows in comment 27, Cache API has no idea whether this source response is corrupt or not.

I agree with Eden that the Cache API can't really determine that the source response is "corrupted", from its perspective the Response is successful, it got an empty body but also a status 200.

I wanted to confirm if instead the StreamFilter could determine that it got into a broken status (and if that is the case then look if StreamFilter could make sure that the request fails explicitly instead of resolving successfully).

Based on an initial look it seems that that it may be actually the case, and so I filed Bug 1766053 to look more into that.

In Bug 1766053 comment 0 I tried to describe in a bit more detail the sequence that leads to the "corrupted" responses, and the fact that as part of that sequence StreamFilterParent::Broken is actually being called, as a proof that StreamFilter internals aren't in the expected states when it is being destroyed (or at least StreamFilterParent isn't in the expected state when it gets destroyed).

Setting WebCompat Priority P1: Slack is a major site, and the breakage is very significant.

Webcompat Priority: ? → P1

The severity field for this bug is set to S3. However, the bug is marked as tracked for firefox100 (release).
:edenchuang, could you consider increasing the severity of this tracked bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(echuang)

This is also a Web-Compat P1, which we would generally consider to mean it meets the "high impact" criteria of an S2.

Severity: S3 → S2
Assignee: echuang → bugmail

Removing Eden's ni? , in case questions go to :asuth, thanks.

Flags: needinfo?(echuang)
You need to log in before you can comment on or make changes to this bug.