Closed Bug 1786242 Opened 2 years ago Closed 2 years ago

Intermittent loss of connectivity and UI failures on current nightly.

Categories

(Core :: Performance, defect)

Firefox 105
defect

Tracking

()

VERIFIED FIXED
106 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox104 --- unaffected
firefox105 + disabled
firefox106 --- verified

People

(Reporter: syskin2, Unassigned)

References

(Regression)

Details

(Keywords: hang, regression)

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0

Steps to reproduce:

Ever since 20220813 nightly I am suffering from the browser stopping to work after 0..30 minutes from startup. What I see is this:

  • a specific domain no longer has any networking (things just spin forever). Other domains still work.
  • all open tabs of that domain have their favicon blank
  • the URL autocomplete stops showing at all, or alternatively shows but stops updating as I type

In addition, several other UI elements no longer work

  • the Downloads panel button no longer opens the panel
  • the message "Youtube is now fullscreen" appears but never goes away

All of it is intermittent, and sometimes clears itself up after a few minutes.

I see it on two different PCs (both on Windows 10). It happens without any extensions present. There is one extra person on mozillazine forums who reports the same problems, but others do not.

I hate bugs like this but I can't not file it....

I would appreciate any suggestions for how to debug this further.

Component: Untriaged → Performance
Product: Firefox → Core

Small correction: after the problem appears, all new tabs have their favicon either spinning or blank for minutes.

If I try to observe the traffic using F12, the panel does not update (even if I successfully refresh the page).

However, after a few minutes, the panel sometimes recovers and then it always shows favicon requests, attributed to FaviconLoader.jsm:186, with the result NS_BINDING_ABORTED.
To clarify, it's a symptom of a wider problem, but maybe it will help. Or maybe this is normal and I'm barking the wrong tree.

Attached image youtube-example.PNG

This is an example of what happens once a domain is "broken". Here, youtube subscription page is doing its best to download the thumbnails of all the videos. However all of them transfer nothing, the thumbnails remain blank. The tab itself has a spinner and has been spinning like that for the last 20 minutes.

If I were to copy the URL of any thumbnail and paste it in a new tab, I get the picture immediately. So the domain is not completely dead across all tabs -- but it is dead across all youtube.com tabs. It's as if it was blocked based on both website's domain and target domain.

At this point I am in "Troubleshoot" mode, so definitely no extensions. Also earlier today I backed up my browser cache, so this is a completely fresh cache database.

QA Whiteboard: [qa-regression-triage]

The bug has a release status flag that shows some version of Firefox is affected, thus it will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Does this resemble the issues we've been seeing intermittently in CI?

Flags: needinfo?(hskupin)

It's as if it was blocked based on both website's domain and target domain

Heh, sounds like cookie isolation logic doesn't it. I should have noticed this sooner.
While the problem is intermittent, I can't see the problem anymore after switching off cookie isolation.

So, to reiterate:
STR:

  1. Activate Cookie Isolation
  2. Use the browser for more than 10 minutes (very random)

Result: pain and suffering and lots of spinners
Expected: all works

Regression range:
20220812 works
20220813 broken

My apologies. After disabling cookie isolation it took unprecedented 2 hours for problems to appear, but they did appear. It must have been a random chance :(

I'm also experiencing this issue. I've been discussing it extensively on Slack, but we have yet to learn much about the problem. It doesn't reproduce on my machine with a clean profile, so I'm in the process of reproducing my old profile in a new one. If I'm eventually able to reproduce the problem in the new profile, I'll post an update.

Tracking this for 105 though it isn't clear to me that it's affecting anything other than Nightly builds. Will be very interesting to see if we get more reports as Beta105 rolls out more widely.

I've also experienced a whole bunch of other symptoms that I'll list. They are all intermittent.

  • When I hit Alt to show the menu bar (this is Windows) and then click something else or press Alt again, the menu bar doesn't hide itself again afterwards.
  • The awesome bar doesn't update or doesn't show at all.
  • Shutdown Hangs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
  • The browser will just refuse to navigate. Often I'm trying to navigate to an about: page, but it won't even try to load it. Sometimes switching to another tab and switching back resolves the issue.
  • Unfinished search terms will be sent to the search engine from the URL bar. For example, I'll type "githib<backspace><backspace>ub rebase" and hit enter and I'll end up with a search for "githib".
  • I'll mouse over a link but no status panel appears.
  • The sound indicator on tabs will break. For example, a Slack notification usually makes the audio indicator on that tab appear only briefly. But lately it often appears and never leaves.
  • If I can get the Download panel open, clicking on the downloaded file doesn't open the file.
  • Tabs will appear to me to be done loading, but the loading indicator will be displayed forever.
  • I use the Gmail feature that tells you how many unread emails via the favicon, and it frequently just gets stuck.

:mconley was looking at a profile that I made when I was experiencing the issue at one point and made this interesting observation:

So one kind bizarre thing kinda jumps out at me from the profile so far - a ton of requests for this gmail thing called cleardot.gif all seem to resolve around the same time ... some of those requests are over 1000 seconds old.

Things that I've tried:

  • Disabling all extensions.
  • Memory check. No errors detected.
  • mozregression. The issue is intermittent enough that I'm not 100% sure that I got the right commit, but this is what I ended up with.

The intermittence of the issue is really annoying for tracking the problem down. I think that I've been experiencing it the most frequently of anyone else that I know of (just the bug reporter and :aminomancer). But even so, sometimes the issue will just decide that it's done for the day and won't reproduce again until tomorrow. I do shutdown everyday, but a simple reboot does not necessarily make the issues come back.

I made an "anonymized" copy of my profile to send to :mconley and after purging addons, history, cache, bookmarks, etc, I was still able to reproduce the problem, but it didn't seem as bad. Like when I would double tap Alt and the menu bar would "stick", but would eventually hide itself properly.

When I try to reproduce the issue (say for mozregression), I had pretty good success reproducing it by following these steps:

  1. Launch browser.
  2. Hit Alt a few times and see if the menu bar gets "stuck".
  3. Start a youtube video playing. (If it ends while I'm still testing, I'd start a new one)
  4. Open some sites that link to lots of other sites (ex: News Aggregator like Hacker News) and repeat the following for maybe 10 minutes:
    a. Open a dozen tabs
    b. Tap Alt a few times to see if the menu bar "sticks".
    c. Open each of the new tabs, scroll down a bit, close it, repeat until all new tabs are closed.
    d. Tap Alt a few times to see if the menu bar "sticks".

If I ever saw the menu bar stick, I would consider the issue reproduced. If I went for 10 minutes and it never stuck assume I was on a "good" version.

Oh, I should also note that the problems typically happen together. For example, if I see a favicon fail to load, I would immediately try tapping Alt and it would always stick at that point.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #4)

Does this resemble the issues we've been seeing intermittently in CI?

I don't know. Maybe? I just saw the same in Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:106.0) Gecko/20100101 Firefox/106.0 ID:20220825222149 now after I've updated from an older Firefox Nightly build. When I tried to file a new issue on GitHub Firefox was hang for a while and after that stopped updating the UI. Once the hang was over I was able to switch tabs but the UI didn't got updated anymore - only the title of the page got updated within the title bar.

It feels like something is definitely horribly broken in Nightly these days but not sure if this is related to bug 1784591. At least lets add this bug to the see also field.

Flags: needinfo?(hskupin)
See Also: → 1784591
See Also: → 1786388

I'm linking my bug 1785209, because I think it's actually the same issue and my bug has no dev activity.

I know it can be hard to find, but regression range would be really useful here.
(https://mozilla.github.io/mozregression/ is possibly useful tool)

I know it can be hard to find, but regression range would be really useful here.

My regression window is still:
20220812093714 good
20220813092239 bad

Which is consistent with the tighter window by :bytesized in comment 9.

If there are builds between 20220812093714 and 20220812214215 that would narrow it down further, I will gladly test them.

OK it took only a few minutes and I can 100% confirm 20220812214215 has failed already.
I can also 100% confirm 20220812093714 is good because it's my fallback for whenever I need a working browser.

So, I concur with :bytesized's comment 9.

(In reply to Radek 'sysKin' Czyz from comment #14)

My regression window is still:
20220812093714 good
20220813092239 bad

This corresponds to

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9ce1bc0acf1544186bb85b062417d2a0a65efdb3&tochange=cbd753d186199d816e1d097631573f601932b96e

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #9)

Bug 1777497 doesn't backout cleanly, but the other bugs in this range do. Here's some Try pushes to test:

I guess that if all 3 of these are still broken, that kinda points the finger at bug 1777497 by process of elimination.

If desired, these builds can be tested directly with mozregression: mozregression --repo try --launch <rev>

I have been using the first build (Bug 1486949 backed out, SourceStamp=e394995277) since it was built, no issues so far. I don't want to jump to any conclusions and will test the other two builds for sanity later today.

Mostly writing to say that (1) testing is happening and (2) there's some evidence it's the TextStreams bug

I have bad news :(

All three test builds are OK. In desperation I got the latest nightly instead (20220829094551) and got the problem in minutes.

Is there any common difference between nightlies and those three builds?

mozregression doesn't give more precise regression range?

Kirk's range starts with a busted push and ends with its backout, so not surprising that it couldn't bisect further than that.

One notable difference about shippable builds created on Try vs. "real" ones from m-c is that Try builds have an update channel of nightly-try set instead of nightly. If we have code checking the update channel, that could potentially explain why Try builds don't reproduce.

One could modify this default setting by editing the file under firefox/defaults/pref/channel-prefs.js. So please download such a try build, install/unpack it, modify the file to refer to just nightly and then start Firefox. Maybe this helps to get it reproduced?

I'll try the same for bug 1784591.

Does this still reproduce if you set toolkit.content-background-hang-monitor.disabled to true in about:config? Note that a restart will be needed for the change to take effect.

Flags: needinfo?(syskin2)

I have discovered a way to reproduce this issue in a clean profile which, as I mentioned in Comment 7, I could not do before.

  1. Launch browser.
  2. Navigate to about:config and change privacy.resistFingerprinting to true.
  3. Restart the browser.
  4. Hit Alt a few times and see if the menu bar gets "stuck".
  5. Start a youtube video playing. (If it ends while still testing, start a new one)
  6. Open some sites that link to lots of other sites (ex: News Aggregator like Hacker News) and repeat the following for maybe 10 minutes:
    a. Open a dozen external links in tabs.
    b. Tap Alt a few times to see if the menu bar "sticks".
    c. Open each of the new tabs, scroll down a bit, close it, repeat until all new tabs are closed.
    d. Tap Alt a few times to see if the menu bar "sticks".

Do the others that can reproduce this have privacy.resistFingerprinting enabled?

(In reply to Ryan VanderMeulen [:RyanVM] from comment #24)

Does this still reproduce if you set toolkit.content-background-hang-monitor.disabled to true in about:config? Note that a restart will be needed for the change to take effect.

Yes, yes it does. Tested with 20220829094551.

(In reply to Kirk Steuber (he/him) [:bytesized] from comment #25)

Do the others that can reproduce this have privacy.resistFingerprinting enabled?

Actually it seems to be false for me, on both affected PCs.

Flags: needinfo?(syskin2)

What about testing the shippable builds from when the revisions in the regression range landed on autoland?

Here's the treeherder link for the regression range from Kirk (subset of the range some sysKin) https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&fromchange=a764edc8c73da2b182e11a9970090e70c8ed3f26&tochange=eecfa46043c20f261e0f6e825609fcd658d49ea3&searchStr=windows%2Cshippable%2Cbuild

And you can also run them via mozregression --repo autoland --launch <rev>

What about testing the shippable builds from when the revisions in the regression range landed on autoland?

I am doing all I can but I can't see the bug with any treeherder builds. Not builds in that range, not random today's builds either. I even switched the firefox/defaults/pref/channel-prefs.js update channel to nightly, no effect.

I seriously don't understand how this is possible.

Out of morbid curiosity I switched to win32 builds, and it's the same thing: nightly (20220829094551) shows the bug in the first minute of running. Treehoarder (20220831045044, SourceStamp=11e997d3cf) was running fine for hours.

Since bug 1784591 started around the same time, does https://bugzilla.mozilla.org/show_bug.cgi?id=1784591#c53 make any difference here?

Playing Frankenstein, I started sewing together Nightly builds and very-close-ones Treeherder builds, and concluded that in order to reproduce the bug I definitely need omni.ja/modules/AppConstants.jsm to have the line MOZ_UPDATE_CHANNEL: "nightly".

If it says nightly-autoland or nightly-try, like it does with the treeherder builds, I have no problems at all.

This means that comment 22 was on to something big, while comment 23 - which asked to change thew update channel using firefox/defaults/pref/channel-prefs.js - was not related. Somehow.

The good news is, this means I can start testing the treeherder builds. The bad news is, I already tested all three test builds from comment 17 and they all failed (ie they all have the problem).

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #30)

Since bug 1784591 started around the same time, does https://bugzilla.mozilla.org/show_bug.cgi?id=1784591#c53 make any difference here?

I believe this was asked in comment 24 with negative result in comment 26.

Heh everyone, got it!

BAD: 9ef37c6da56bFQ Bug 965392 - Use a timer instead of a condvar to run the BHMgr Monitor thread, r=dthayer.
GOOD: 18fce6e2b6f6CS Bug 1783416 - Skip test_basic.js on coverage builds due to permafailure. r=intermittent-reviewers,jmaher DONTBUILD

This means that:

  1. It is the BHMgr Monitor thing that everyone is asking about
  2. toolkit.content-background-hang-monitor.disabled does NOT prevent the problem
  3. omni.ja/modules/AppConstants.jsm needs to say MOZ_UPDATE_CHANNEL: "nightly". or you don't see the bug (and wipe startupCache for this to take effect)

Good news, the backout of bug 965392 appears to have resolved the timeouts in CI we were seeing in bug 1784591. Radek, this backout will be in the next scheduled nightlies shipping in 7-8hr, but if you wanted to get a jump start on testing, the below build has the backout:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/XdZ1AQN6Spe3g-lFRUbf-w/runs/0/artifacts/public/build/target.zip

Thanks for all your help testing!

Flags: needinfo?(syskin2)

I believe that privacy.resistFingerprinting=true has some significance to this. I noticed that tests with fingerprinting resistance enabled kept hanging on win10. I did a git bisect testing a random test (browser/base/content/test/general/browser_documentnavigation.js) that should pass with it enabled and disabled. The same commit (https://hg.mozilla.org/mozilla-central/rev/9ef37c6da56b) was the first bad where the test started to timeout with fingerprinting resistance enabled, but not when disabled. Reverting that change reliably solved this issue for me.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #33)

Radek, this backout will be in the next scheduled nightlies shipping in 7-8hr,

After using it for a full working day, I can't see any problems :)

Thanks for all your help testing!

Thank you and everyone for finding the culprit!

Flags: needinfo?(syskin2)

This is great to hear Radek and thanks for your help! As such I'm going to mark this bug as fixed now.

(In reply to matc from comment #34)

I believe that privacy.resistFingerprinting=true has some significance to this. I noticed that tests with fingerprinting resistance enabled kept hanging on win10. I did a git bisect testing a random test (browser/base/content/test/general/browser_documentnavigation.js) that should pass with it enabled and disabled. The same commit (https://hg.mozilla.org/mozilla-central/rev/9ef37c6da56b) was the first bad where the test started to timeout with fingerprinting resistance enabled, but not when disabled. Reverting that change reliably solved this issue for me.

This is also pretty helpful information and I took the steps as identified to create a small Marionette test that easily reproduces this problem. See bug 965392 comment 21 for details.

Status: NEW → RESOLVED
Closed: 2 years ago
Regressed by: 965392
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch

Set release status flags based on info from the regressing bug 965392

Flags: qe-verify+

I was able to consistently reproduce this issue on Firefox 105.0a1 (2022-08-22) on Windows 11 by following the infos provided in Comment 9 and Comment 25. Tried reproducing as well on Ubuntu 22 but it seems that it is not affected at all.

The issue is fixed on Firefox 106.0b9 on the same system.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: