Closed Bug 1700968 Opened 3 years ago Closed 3 years ago

Problems loading images on youtube.com, wikipedia.org, and some images opened from discord

Categories

(Core :: Networking, defect, P2)

Firefox 89
ARM64
macOS
defect

Tracking

()

VERIFIED FIXED
90 Branch
Tracking Status
firefox90 --- verified

People

(Reporter: AwesomeSheep48, Assigned: dragana)

Details

(Whiteboard: [not-a-fission-bug][necko-triaged])

Attachments

(12 files, 1 obsolete file)

1.29 MB, video/quicktime
Details
854.24 KB, video/mp4
Details
2.61 MB, video/mp4
Details
375.25 KB, image/png
Details
194.30 KB, image/png
Details
6.77 MB, video/quicktime
Details
1.10 MB, image/png
Details
979.16 KB, image/png
Details
3.51 MB, video/quicktime
Details
8.01 MB, application/x-7z-compressed
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0

Steps to reproduce:

Open https://github.com/pietervanheijningen/clickbait-remover-for-youtube or https://forum.paradoxplaza.com/forum/forums/stellaris.900/

Actual results:

Pages loads for ~3 minutes, then it stops

Expected results:

Pages should load properly like when fission is turned off

This only happens after using Firefox for a while, it is fixed by restarting

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Navigation' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → DOM: Navigation
Product: Firefox → Core

Comment on attachment 9211831 [details]
Screen Recording 2021-03-26 at 11.53.27 AM.mov

It only loads after cmd+shft+r

On further testing it seems that fission might be unrelated, I'll do some more

I wonder if this is related to http3.
See bug 1695717.

(In reply to Olli Pettay [:smaug] from comment #7)

I wonder if this is related to http3.
See bug 1695717.

I think it might be

Strike that, something just happened

Opening a file from the discord app is when I notice it the most

(In reply to daviswill048 from comment #4)

It only loads after cmd+shft+r

Opening a file from the discord app is when I notice it the most

If you have seen this problem on both YouTube and Discord and the problem seems to go away after Cmd+Shift+R, then this bug might be related to sites trying to update their Service Workers while the Service Worker is in use.

When you next reproduce this bug, can you please record a performance profile using the Firefox Profiler? This will include information about Service Workers and what code Firefox is running at the time. Here are instructions for enabling the Firefox Profiler ahead of time:

https://profiler.firefox.com/

I recommend changing these profiler settings:

  1. Click the profiler toolbar button's down arrow.
  2. Change Settings to "Custom".
  3. Click "Edit Settings".
  4. Then search for and check the "DOM Worker" and "IPC Messages" checkboxes.

William Davis, please see comment 12 for some additional info that we need. Thanks!

Flags: needinfo?(daviswill048)

Seems to be fixed now, I'll comment and try to send the performance profile if it happens again

Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(daviswill048)
Resolution: --- → WORKSFORME

Command Shift R isn't working either https://share.firefox.dev/3v1NoeT

For images to load, I need to open them in a new tab, then command shift r

Randell, please use the profile recording to figure out the cause of this issue. Thanks!

Status: RESOLVED → REOPENED
Ever confirmed: true
Flags: needinfo?(rjesup)
Resolution: WORKSFORME → ---

The profile shows a bunch of images that have been requested, but apparently not received from the server.

I don't see anything fission-related here; forwarding to the necko team to look into why we're not receiving the images. Do we have any info on http3 use in this instance?

William: The necko team may want to you collect network logs; you can turn them on via about:networking.

Component: DOM: Navigation → Networking
Flags: needinfo?(rjesup) → needinfo?(dd.mozilla)

I am using http3, but it was still happening without it as seen in comment 10. It isn'y happening at the moment, but I'll get network logs if it starts again.

This sound like bug 1703934.
But, probably it is not. The other bug is showing if HTTP3 is enabled and pref network.dns.use_https_rr_as_altsvc is true and DoH is used and probably network.http.speculative-parallel-limit is 0.
But you see the bug without HTTP3 so his is no the same.

Can you disable HTTP3 to limit the place where thing can go wrong and make a http log for me:

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

The log may contain cookies so please log out of the site to invalidate cookies that are in the log.

Flags: needinfo?(dd.mozilla)
Flags: needinfo?(daviswill048)

[(In reply to Randell Jesup [:jesup] (needinfo me) from comment #24)

I don't see anything fission-related here; forwarding to the necko team to look into why we're not receiving the images. Do we have any info on http3 use in this instance?

In that case, I will remove "Fission" from the bug summary.

@ Andrew, the Fission bug triage team wonders if this might be a Service Worker issue. Are there any additional instructions for debugging Service Worker issues that you'd like to add to what I already suggested in comment 12?

Flags: needinfo?(bugmail)
Whiteboard: [not-a-fission-bug]

ServiceWorkers could perhaps be involved in some of the cases? But it seems like some of these profiles are for wiktionary.org and wikipedia.org, neither of which seem to serve me a ServicewWorker?

:julienw is currently implementing/driving/landing a bunch of enhancements to how the profiler logs what ServiceWorkers are doing in bug 1567222 which should ideally make it easier to further figure out what's going on from just the profiler.

:juliew, could you quickly take a look at some of the profiles here and help me better understand if they potentially indicate that ServiceWorkers are involved? One meta-issue is I don't know if the profiler runs are started early enough to catch the relevant events. For example, the profiler run in comment 18 only has a favicon fetch in the network tab.

Flags: needinfo?(bugmail) → needinfo?(felash)

I looked especially at the profiler URL in comment 18.

Firs, take care that this URL will select the webextensions thread, but there are more. The parent process is at the top, and the web process at the bottom. There are a lot more network requests in those. The requests in the parent process look odd (lots of unfinished requests, but that may be normal with tracking stuff, OR anti-tracking exposes a bug to our profiler code) but the requests in the isolated web process look about right.

Then I wanted to see if a service worker was involved, so I checked all "DOM Worker" threads from the blue button at the top, and inspected them all. I turned the call tree "implementation" option to "JavaScript" (instead of "All Stacks"), and looked at all of them. And there seems to be one related in the "Isolated Web Process" indeed (see direct link https://share.firefox.dev/3ekLij8), because the name of the file includes "serviceworker".
Another way, maybe easier, is going to the Marker Chart, and looking at the DOMEvents specific to service workers (especially "fetch"). In this case I see no "fetch" DOMEvents, so I'd be cautious. But being a notification-only service worker (from the name of the script) it's probably normal we don't have these events.

Lastly I moved forward again and just curl-ed the script URl https://www.youtube.com/s/desktop/e84fb691/jsbin/serviceworker-notifications.vflset/serviceworker-notifications.js, obviously that's minified but I do see occurrences of self.registration.pushManager.getSubscription(). So that seems to be it.

To help with analyzing using the profiler, we could:

  • have a different name for service worker threads, besides just "DOM Worker". That should be reasonably easy, but I don't know how all this works :-) In the world of Fission the Worker would be in the same process as the content page, so that makes things easier too.
  • show in the network requests that they're being handled by a service worker -- that's more on our team.

Hope this helps, please needinfo again if you want me to look at some of the other profiles!

Flags: needinfo?(felash)

(In reply to Dragana Damjanovic [:dragana] from comment #26)

This sound like bug 1703934.
But, probably it is not. The other bug is showing if HTTP3 is enabled and pref network.dns.use_https_rr_as_altsvc is true and DoH is used and probably network.http.speculative-parallel-limit is 0.
But you see the bug without HTTP3 so his is no the same.

Can you disable HTTP3 to limit the place where thing can go wrong and make a http log for me:

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

The log may contain cookies so please log out of the site to invalidate cookies that are in the log.

https://share.firefox.dev/33cN2Gk

Flags: needinfo?(daviswill048)
Summary: Fission is slowing down loading times or stopping pages from loading → Problems loading images on youtube.com and maybe discord
OS: Unspecified → macOS
Hardware: Unspecified → ARM64
Summary: Problems loading images on youtube.com and maybe discord → Problems loading images on youtube.com, wikipedia.org and some images opened from discord
Summary: Problems loading images on youtube.com, wikipedia.org and some images opened from discord → Problems loading images on youtube.com, wikipedia.org, and some images opened from discord
Assignee: nobody → dd.mozilla
Severity: -- → S4
Status: REOPENED → ASSIGNED
Priority: -- → P2
Whiteboard: [not-a-fission-bug] → [not-a-fission-bug][necko-triaged]

When TRR is used in mode 3(only in mode 3) AsyncResolveNative can fail immediately (sync) and we need to make sure that we retry it in the same way as in DnsAndConnectSocket::TransportSetup::OnLookupComplete. If RESOLVE_IP_HINT is set we already retry he lookup, but we do not retry if mRetryWithDifferentIPFamily is set.

When TRR is used in mode 3(only in mode 3) AsyncResolveNative can fail immediately (sync) and we need to make sure that we retry it in the same way as in DnsAndConnectSocket::TransportSetup::OnLookupComplete. If RESOLVE_IP_HINT is set we already retry he lookup, but we do not retry if mRetryWithDifferentIPFamily is set.

Attachment #9220689 - Attachment is obsolete: true
Pushed by ddamjanovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/52530d6a6a2b
Make API to start a HTTP test server on a IPv6 address. r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/1d222151411a
Make sure to retry if AsyncResolveNative fails r=necko-reviewers,kershaw

It looks like on mac a server listening on a local ipv6 address is not open. This patch introduce a server on IPv6 address, we did not have them before. Locally it works. I will disable the new test on mac.

Flags: needinfo?(dd.mozilla)
Pushed by ddamjanovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/090bf8bb67b2
Make API to start a HTTP test server on a IPv6 address. r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/d693453c66ad
Make sure to retry if AsyncResolveNative fails r=necko-reviewers,kershaw

Backed out 2 changesets (Bug 1700968) for causing xpcshell failures in test_prefer_address_version_fail_trr3_1
Backout link: https://hg.mozilla.org/integration/autoland/rev/db63fba98dabc8d6af1e09d9d1da486576588460
Push with failures, failure log.

Flags: needinfo?(dd.mozilla)
Pushed by ddamjanovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fec6e7e6278f
Make API to start a HTTP test server on a IPv6 address. r=necko-reviewers,kershaw
https://hg.mozilla.org/integration/autoland/rev/06495cee5ce2
Make sure to retry if AsyncResolveNative fails r=necko-reviewers,kershaw
Flags: needinfo?(dd.mozilla)

Disabling the test on the socket process. This will be fix in a separate bug, since the socket process project is on hold now.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
Flags: qe-verify+

Hello,

I can't seem to reproduce this on Mac OS 11.4 ARM on Fx Nightly (BuildID: 20210325085523).

@WIlliamDavis can you please confirm that this issue is fixed for you? Here is a link to 90.0b5 build

Flags: needinfo?(daviswill048)

The issue seems to be fixed for me

Flags: needinfo?(daviswill048)

Marking the bug as verified per reporter.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: