Open Bug 1756471 Opened 3 years ago Updated 2 hours ago

Intermittent error loading js chunk seemingly related to caching

Categories

(Core :: Networking: HTTP, defect, P2)

Firefox 96
defect

Tracking

()

UNCONFIRMED

People

(Reporter: danielkarp, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Attached image combined2.png

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36

Steps to reproduce:

Intermittently, Firefox (and only Firefox) users of our Angular site (with lazy loading of chunks via WebPack) will get ChunkLoadErrors that prevent our site from loading properly, that are cleared up when the user clears their cache. This has been maddeningly intermittent and I have not been able to reproduce it, but I've collected some HAR files from users with details that may point in the right direction. I can't totally rule out an issue with plugins/add-ons/extension, but from the list of plugins etc users have reported, there is no plugin in common between them all, I believe, and some have said they have seen the issue when they run vanilla Firefox.

Basically, what happens is that users load our site, one js chunk does not load properly, and so the site fails to run. They report that clearing their cache solves the issue.

Actual results:

Here is an example from one user. I have additional HAR files as well.

I'm not super adept at decoding HAR files, but it is interesting is that when chunks fail to load, there appears to be something odd with the Protocol.

I've attached three screenshots (combined into one image) from the HAR viewer, pointing out some specific issues. I'm reluctant to upload the whole files, because there is sensitive information in them, but I might be able to get permission to share them privately.
In screenshot one:
I've focused on a chunk that loaded properly (it seems).

In screenshot 2, from the same HAR, a chunk that did not load properly (notice status code 0, -1 B transferred). Notice also that the GET line in the raw Request omits HTTP/2. I believe that this file was the missing chunk that broke the page.

In screenshot 3, there is a request that seems to have loaded properly, but for some reason has status code 0? This one could just be loading from Cache or something, but I'm not sure.

The key is that when the user clears their cache, the file that previously would not load loads possibly (same filename, etc). Again, this occurs only for Firefox users.

For more discussion of this: I first asked about this on StackOverflow, here:
https://stackoverflow.com/questions/71070943/what-is-the-source-of-this-intermittent-webpack-chunkloaderror-in-firefox (The only person to answer thought this was a firefox bug).

I also brought it up as a possible Webpack issue, here:
https://github.com/webpack/webpack/discussions/15422

Finally, you can see other user reports of the same issue in our forums:
https://boardgamegeek.com/thread/2788693/specific-videos-page-immediately-redirects-me-empt
https://boardgamegeek.com/thread/2800974/cant-open-forum-threads-firefox

Expected results:

We shouldn't be getting those chunk load errors.

I was one of the users experiencing this bug (very intermittently; twice in months of heavy usage). I have seen a similar issue using Adobe Assets through Firefox - reviewing a particular file fails, clearing site cache resolves the issue.

Since the issue is intermittent it's hard for me to reproduce it in a short time, but it's clearly that many users are facing this issue.
I'm going to assign "Core: Javascript Engine" component, it could be a good starting point so the engineering team could take a look at this issue.

Component: Untriaged → JavaScript Engine
Product: Firefox → Core

Thank you! Anyone working on this, feel free to reach out to me, and I will, with the users' permission, share any HAR files that might be of use.

Component: JavaScript Engine → Networking: HTTP

Thank you for the report.
Check out our guide for reporting networking bugs
Specifically here, the HAR file and preferrably a HTTP log would help. Thanks!

Flags: needinfo?(geoff)
Flags: needinfo?(danielkarp)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

Thank you for the report.
Check out our guide for reporting networking bugs
Specifically here, the HAR file and preferrably a HTTP log would help. Thanks!

I've sent some HAR files to you directly by email, and will talk with affected users about HTTP logging.

Flags: needinfo?(danielkarp)
Flags: needinfo?(geoff)

I sent some HAR files direct to Valentin, but I’m not sure if he is the person actually working on this. Again, because the HAR files may contain sensitive information, I do not want to post them here, but I have permission to send them to whomever might be working on the bug—just let me know!

(In reply to danielkarp from comment #6)

I sent some HAR files direct to Valentin, but I’m not sure if he is the person actually working on this. Again, because the HAR files may contain sensitive information, I do not want to post them here, but I have permission to send them to whomever might be working on the bug—just let me know!

Networking team will work on this bug.
I've also seen the HAR files, but I can't figure out the root cause unfortunately.
Could you try to get some http log?
You could send the log to necko at mozilla dot com or kershaw at mozilla dot com. Thanks.

Flags: needinfo?(danielkarp)

I will try to get one of the users who can reproduce the bug to try getting an HTTP log. Does the networking team also handle HTTP caching? From the outside looking in, with no idea what the code is actually doing, the behavior they report makes me think the problem could be at the caching layer.

Flags: needinfo?(danielkarp)

It is also possible this could be related to bug 1640212.
Could you also ask the reporters if setting the network.http.rcwn.enabled pref to false improves things?

Keep the needinfo flag to indicate that we are still waiting for more information.

Flags: needinfo?(danielkarp)

I have asked the users to set that flag to false, and also asked them to try to get HTTP logs. Still waiting to hear back.

Flags: needinfo?(danielkarp)

D'oh! Didn't mean to clear the needinfo. I guess it was automatically checked when I replied?

Flags: needinfo?(danielkarp)

(In reply to danielkarp from comment #12)

D'oh! Didn't mean to clear the needinfo. I guess it was automatically checked when I replied?

Yes, it's automatically checked. No worries.
Thanks for your help!

So far, it seems that setting the network.http.rcwn.enabled pref to false is working for people. No one has been able to provide an HTTP log yet.

OK, I have privately sent on an HTTP Log captured by one of our users. I'll forward on any other HTTP logs users are able to capture as well. Is there anything else that would be helpful?

Oh, one more thing I forgot to mention. Until recently, the domain that has been giving us trouble, cf.geekdo-static.com, did not support IPv6. We added support for IPv6 last week, but it has not seemed to affect whether this bug occurs.

FYI, the user who provided that HTTP log also reported that setting network.http.rcwn.enabled to false didn't fix the problem for him. That calls into question whether he is having the same issue the other users are having, since for most users that setting seems to fix the problem for them.

From the log, I can only tell there's one js request failed.

2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: D/nsHttp HttpChannelParent::RecvCancel [this=2231b371440]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: IPDL Background]: V/nsHttp HttpBackgroundChannelParent::ActorDestroy [this=2231decbee0]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: V/nsHttp nsHttpChannel::Cancel [this=2231b16ed00 status=804b001b]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: V/nsHttp nsHttpChannel::CancelInternal [this=2231b16ed00]

The failed reason is NS_ERROR_INVALID_CONTENT_ENCODING.

FYI, it continues to be the case that for most users reporting this issue, setting network.http.rcwn.enabled to false completely solves the problem for them. That suggests that the underlying problem here might be the same as Bug 1640212, which also seems to be solved by changing that setting to false.

Flags: needinfo?(danielkarp)

I have spent some time trying to reproduce the issue and I was not able to. I tried without logging in. The resource that was failing in the log is not accessible without a login (I am getting 403 responses).

The log only shows the failing requst. The request fails because it cannot be decoded, i.e. gzip decoding is failing. I am wondering why is that. One possible issue would be that we store corrupted content. That is unlikely because changing the pref network.http.rcwn.enabled would not fix the problem but make it worse.
Maybe our RCWN feature is corrupting the content.

How hard is to reproduce th issue? Is it possigle to record log for a longeer period? I would need onee successful load followed by a failed load. I am interested to see what is happening in between.

Flags: needinfo?(danielkarp)

So unfortunately, I've never been able to reproduce it myself. All of the recordings come from other users, but some of them DO seem to be able to reproduce it semi-reliably, but by that I mean at most a couple times a day. Still, I'll see if I can get one of them to record a log for a longer period.

Flags: needinfo?(danielkarp)

Oh, also, you can create a free account on BGG (or I can provide you with one) if you want to try further to reproduce this, if not having an account is part of the problem.

So users who are willing to try to create the log has asked the following:

And then they want use to be recording a SUCCESSFUL one followed by a failure as the next one, literally? Since it's apparently random and impossible to predict in advance when the next one is going to be a failure, it's unclear how feasible that is, unless we just continually log and log and log until a failure happens. (Maybe that's feasible to do, I have no idea how big these logs get...)

I'd love to help, but as Russ mentioned, this is pretty hard. It's pretty random. Or at least, I could not notice a pattern.
Two questions:

  • does it have to be the exact same page ?
  • can the log be on all the time, then when it happens we just have to get it ? Or does it require some specific manipulation from our part each time we want to get a log ?

So how big are these logs? How long is it feasible to record a log? Do you need logs that have a success followed immediately by a failure?

Flags: needinfo?(dd.mozilla)

(In reply to danielkarp from comment #23)

So users who are willing to try to create the log has asked the following:

And then they want use to be recording a SUCCESSFUL one followed by a failure as the next one, literally? Since it's apparently random and impossible to predict in advance when the next one is going to be a failure, it's unclear how feasible that is, unless we just continually log and log and log until a failure happens. (Maybe that's feasible to do, I have no idea how big these logs get...)

I'd love to help, but as Russ mentioned, this is pretty hard. It's pretty random. Or at least, I could not notice a pattern.
Two questions:

  • does it have to be the exact same page ?
  • can the log be on all the time, then when it happens we just have to get it ? Or does it require some specific manipulation from our part each time we want to get a log ?

So how big are these logs? How long is it feasible to record a log? Do you need logs that have a success followed immediately by a failure?

You can limit the size of a log by adding "rotate:200" this would mean the max size would be 200MB.
You can record a log for a very long time, but please add the limit otherwise they will be too big.
I needed a log that contained both successful and failed loads, it is not important how far apart they are.

Flags: needinfo?(dd.mozilla)

Set the ni flag to indicate that we are waiting for more information.

Flags: needinfo?(danielkarp)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:valentin, since the bug has recent activity, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(danielkarp) → needinfo?(valentin.gosu)

This seems like a real bug, so even though it's incomplete.

(In reply to Kershaw Chang [:kershaw] from comment #18)

From the log, I can only tell there's one js request failed.

2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: D/nsHttp HttpChannelParent::RecvCancel [this=2231b371440]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: IPDL Background]: V/nsHttp HttpBackgroundChannelParent::ActorDestroy [this=2231decbee0]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: V/nsHttp nsHttpChannel::Cancel [this=2231b16ed00 status=804b001b]
2022-03-10 08:47:17.535000 UTC - [Parent 12576: Main Thread]: V/nsHttp nsHttpChannel::CancelInternal [this=2231b16ed00]

The failed reason is NS_ERROR_INVALID_CONTENT_ENCODING.

Hmm, this makes me think that we have the wrong headers. The only way this could happen is if we mixed network and cached responses.
I think I have an idea where to look. Might also be able to write a test for it.

Assignee: nobody → valentin.gosu
Blocks: RCWN
Severity: -- → S3
Flags: needinfo?(valentin.gosu)
Priority: -- → P2
Whiteboard: [necko-triaged]

Sorry I haven't been able to come through with a log. The problem is that the users who reported this issue (I've never seen it myself) have mostly discovered, in our forums, the trick of setting network.http.rcwn.enabled to false, and are mostly no longer concerned with it. It is a lot to ask them to change their settings again, start logging, and hoping that the bug reoccurs--I had no takers the last time I asked. But I can try again.

No worries, I totally understand. :) I'll get back to you in a few days, it's possible I've actually found the bug.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #29)

No worries, I totally understand. :) I'll get back to you in a few days, it's possible I've actually found the bug.

In the mean time, one of the users who was experiencing the bug actually got the logs you were asking for, I believe. I have emailed you links to the files, but you should download them soon, as those links will expire shortly.

I'd like to reengage on this issue, if possible.

Our Firefox users (and only our Firefox users) continue to have problems loading js chunks on our site. I reported the issue to Angular, but they believe the problem is not on their end: https://github.com/angular/angular-cli/issues/28361

One user recently provided the following from his console:

Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-7VTT4A6L.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-KRXMVMZV.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-UOJLIIXE.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-UKMIOM3P.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-JIBBZ5SB.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-SFTBLGAN.js”. boardgamegeek.com
Loading failed for the module with source “https://cf.geekdo-static.com/frontend/chunk-LYULK75U.js”. boardgamegeek.com

I immediately checked the files--all of them were still available from the CDN, and after he went into his settings and cleared his cache for the last day, the problem went away for him. He mentioned that reloading with Ctrl-F5 did NOT fix the problem.

So it seems that this continues to be an issue.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: