Open Bug 1687675 Opened 3 years ago Updated 10 months ago

setTimeout function stops working in tab, does not work even after reload or switching to another webpage

Categories

(Toolkit :: Content Prompts, defect)

Firefox 84
defect

Tracking

()

ASSIGNED
Tracking Status
firefox-esr78 --- wontfix
firefox86 --- wontfix
firefox87 --- fix-optional

People

(Reporter: czechowski, Assigned: enndeakin)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0

Steps to reproduce:

I am maintaining https://jwinf.de/. Since a few days, users started reporting that occasionally that the tasks on https://jwinf.de/ stop working as intended and are no longer usable, untill they restarted Firefox. Apparently the bug occurs when you spend a long time (more than an hour) in a contest / training contest. However, I am not sure, whether a long time with open contest is actually necessary or just raises the probability of the bug occuring.

Yesterday this bug occurred to me too, so I investigated.

Actual results:

It turned out that the tasks were no longer working because a call to setTimeout stopped setting a timeout – or the timeout event was never fired – or the event handler was never called.

A basic test page I set up can show that the setTimeout function is no longer working: https://webtest.bwinf.de/timeouttest/

I made a screen recording showing the usage of this testpage in the broken tab. It can be found here: https://www.youtube.com/watch?v=a8uwab0kDMU

Apparently, the "brokenness" is restricted to the tab. Opening the page in a new tab does not show the problem. The bug persists even after reloading the page, reloading the page without caching and after changing to other pages.

Expected results:

setTimeout should not stop working.

Even if there is a bug, that makes setTimeout no longer usable, all functionality should work again after a page reload / a switch to another page.

Component: Untriaged → DOM: Core & HTML
Product: Firefox → Core

Smaug, any ideas?

Flags: needinfo?(bugs)

A basic test page I set up can show that the setTimeout function is no longer working: https://webtest.bwinf.de/timeouttest/

How is one supposed to reproduce the issue with that? It isn't quite clear from the video.
Is there any other information you could give?

Flags: needinfo?(bugs) → needinfo?(czechowski)

No, sorry, the test page serves only as an indicator / proof that the bug is currently occurring.

It is rather hard to reproduce. (Apparently you need spend a long time on a contest on https://jwinf.de.)

If that would help, I could try to generate a memory image of the Firefox process when I get to reproduce this. Is there a certain format for memory images that would help you dissect the issue? (I.e. would just a gcore dump be sufficient?)

By accident, I now found a quite easy way to reproduce the bug:

Step 1: Go to https://jwinf.de/contest/1

Step 2: Select any task, e.g. https://jwinf.de/task/1

Step 3: Now a bit of timing is required: You'll se the task environment appear for a short moment and then disappear for about one second. During this second, if you click on any Link in the top bar (one of "Einführungsaufgaben", "🡅 Übersicht", "🡆 Nächste Aufgabe", "🡄 Vorherige Aufgabe"), you will very likely trigger the bug. (Works every time for me.)

You will shortly see an alert-box ("Laden fehlgeschlagen", meaning "load failed") that is created by the webpage, before the new page is loaded. After that, the tab is broken. The tasks on our website no longer work. You can use the page https://webtest.bwinf.de/timeouttest/ to confirm, that the tab is now indeed broken and will no longer fire any timers, even from other web pages and even after reloading.

Flags: needinfo?(czechowski)

(In reply to Robert Czechowski from comment #4)

By accident, I now found a quite easy way to reproduce the bug:

Step 1: Go to https://jwinf.de/contest/1

Step 2: Select any task, e.g. https://jwinf.de/task/1

Step 3: Now a bit of timing is required: You'll se the task environment appear for a short moment and then disappear for about one second. During this second, if you click on any Link in the top bar (one of "Einführungsaufgaben", "🡅 Übersicht", "🡆 Nächste Aufgabe", "🡄 Vorherige Aufgabe"), you will very likely trigger the bug. (Works every time for me.)

You will shortly see an alert-box ("Laden fehlgeschlagen", meaning "load failed") that is created by the webpage, before the new page is loaded. After that, the tab is broken. The tasks on our website no longer work. You can use the page https://webtest.bwinf.de/timeouttest/ to confirm, that the tab is now indeed broken and will no longer fire any timers, even from other web pages and even after reloading.

I've tried 10-20 times to reproduce this, and managed once. The timing for the click seems to be critical. It seems one has even less than a second. I indeed, very shortly, saw an alert-box, couldn't even read the text, because it disappeared after less than a second.

Happened with Firefox 85.0.1 on Ubuntu 18.04.

Status: UNCONFIRMED → NEW
Ever confirmed: true

:Robert Czechowski: it would be interesting to know if you could ever reproduce this with other browsers.

If this is a regression, whoever can reproduce, could you please try to find the regression range. That would be super helpful.
https://mozilla.github.io/mozregression/ can be useful.

In the original testcase are there possibly synchronous XMLHttpRequests somewhere in the page, or alert()?
And is the 'alert-box' in the comment 4 browser's alert() ?

Flags: needinfo?(czechowski)

(In reply to Mirko Brodesser (:mbrodesser) from comment #5)

(In reply to Robert Czechowski from comment #4)

Step 3: Now a bit of timing is required: You'll se the task environment appear for a short moment and then disappear for about one second. During this second, if you click on any Link in the top bar (one of "Einführungsaufgaben", "🡅 Übersicht", "🡆 Nächste Aufgabe", "🡄 Vorherige Aufgabe"), you will very likely trigger the bug. (Works every time for me.)

You will shortly see an alert-box ("Laden fehlgeschlagen", meaning "load failed") that is created by the webpage, before the new page is loaded. After that, the tab is broken. The tasks on our website no longer work. You can use the page https://webtest.bwinf.de/timeouttest/ to confirm, that the tab is now indeed broken and will no longer fire any timers, even from other web pages and even after reloading.

I've tried 10-20 times to reproduce this, and managed once. The timing for the click seems to be critical. It seems one has even less than a second. I indeed, very shortly, saw an alert-box, couldn't even read the text, because it disappeared after less than a second.

Ah, yes, apparently the one second long delay was a bug on our backend causing a long loading time for one request. This bug has now been fixed, so it is a bit harder again to provoke the Firefox bug.

However, now that I know how to provoke it, I will be setting up a test page that does it. This will make it much easier to make the bug appear. Probably won't manage to do that before Monday, though.

(In reply to Mirko Brodesser (:mbrodesser) from comment #6)

:Robert Czechowski: it would be interesting to know if you could ever reproduce this with other browsers.

So far I only tested Chromium / Chrome, there the bug does not seem to appear.

(In reply to Olli Pettay [:smaug] from comment #8)

In the original testcase are there possibly synchronous XMLHttpRequests somewhere in the page, or alert()?
And is the 'alert-box' in the comment 4 browser's alert() ?

As far as I know, there are no synchronous XMLHttpRequests, but I will look that up to be sure! There is an alert(), yes, and the 'alert-box' is the browser alert().

(In reply to Olli Pettay [:smaug] from comment #7)

If this is a regression, whoever can reproduce, could you please try to find the regression range. That would be super helpful.
https://mozilla.github.io/mozregression/ can be useful.

Ah, thanks for the link! I did not know this page existed. (Probably should have looked for this better!) I will test this! But again, probably won't manage to do that before Monday.

Okay, I now set up a test page that can cause the bug quite reliably:

Step 1: Go to https://timeout.test.bwinf.de/contest/39

Step 2: Go to https://timeout.test.bwinf.de/task/204 (or any task from the previous page)

Step 3: Wait for about 1 second for the page to completely load. (You will see some graphics appear shortly and then almost immediately disappear again. As soon as the graphics is disappeared, step 4 will work.)

Step 4: Click on the link with the text "Click me within about 10 seconds." on the top of the page within 10 seconds after loading the page.

Using mozregression, I found that the last good build is nightly 2019-07-30 and the first bad build is nightly 2019-07-31:

good: https://archive.mozilla.org/pub/firefox/nightly/2019/07/2019-07-30-21-53-16-mozilla-central/firefox-70.0a1.en-US.linux-x86_64.tar.bz2
bad: https://archive.mozilla.org/pub/firefox/nightly/2019/07/2019-07-31-21-55-44-mozilla-central/firefox-70.0a1.en-US.linux-x86_64.tar.bz2

Flags: needinfo?(czechowski)

mozregression should tell the regression range, basically a link to the changesets
Something like
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=...
Could you perhaps paste that link? It would be very helpful.

And thanks a ton for running mozregression!

I can reproduce the issue with the STR comment#10 in Nightly87.0a1 Windows10 and Ubuntu20.04.

Regression window(slightly narrowed from comment#12):
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1e64b8a0c546a49459d404aaf930d5b1f621246a&tochange=b0124f06562982dce60b820d95aad23afd5cec90

Has Regression Range: --- → yes
Has STR: --- → yes
Keywords: regression

Thanks all!
Bug 1555711 looks very suspicious.

I assume Window Actor is destroyed while timeouts are still suspended.

Flags: needinfo?(mconley)
Flags: needinfo?(enndeakin)
Regressed by: 1555711
Severity: S2 → --
Component: DOM: Core & HTML → Notifications and Alerts
Product: Core → Toolkit

I built Firefox on ubuntu locally.
And I confirm that this is regressed by Bug 1555711.

Regression window(w/ local builds)
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=9ddbcd39c9c74fbfc37fc9d745d5e136bf9b0672&tochange=5332d6ce3bb5a92d244ea58812e86f02dff9e1fe

Assignee: nobody → enndeakin
Status: NEW → ASSIGNED
Flags: needinfo?(enndeakin)
Flags: needinfo?(mconley)

The page here has a child frame that calls alert() which shows the dialog box. It then redirects the top-level to another page which then has a setTimeout call. The loading of the new page should close the alert dialog box, but it doesn't and the window believes the modal state to be active.

enterModalState is called two times. The first time is at https://searchfox.org/mozilla-central/rev/26330a08b1f9d06938faa0aa5e0f8c7a58064aa2/toolkit/components/prompts/src/Prompter.jsm#1146 on the iframe's window. The second time is at https://searchfox.org/mozilla-central/rev/26330a08b1f9d06938faa0aa5e0f8c7a58064aa2/browser/actors/PromptParent.jsm#278 for the iframe's parent window.

leaveModalState is only called once, for the iframe.

PromptParent.jsm seems to support two types of prompts: content prompts and tab-dialog-box prompts. The former adds dialog boxes that are opened to a map gBrowserPrompts and then forces them closed when the prompt actor is destroyed. The latter type of prompt does not do this. If it did, I think it should cancel the prompt and leave the modal state when the page is unloaded.

However, fixing that won't be enough. When nsGlobalWindowOuter::LeaveModalState() is called during actor destruction (actor's didDestroy), it fails because topWin is null, so it seems that it is too late to call leaveModalState then.

Going to ask Gijs if he knows whether both prompts should be cancelling the modal state when the actor is destroyed.

And, going to ask smaug about topWin being null and whether we should be calling LeaveModalState at some other time.

Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(bugs)

(In reply to Neil Deakin from comment #17)

Going to ask Gijs if he knows whether both prompts should be cancelling the modal state when the actor is destroyed.

Can you clarify what you mean by "both prompts"? There are 5 dialog implementations at this point:

  1. window modal prompts
  2. "internal" window modal prompts (behind a pref, on for nightly)
  3. tab modal dialogs (things like the print dialog, http auth dialog, and the insecure form submission warning) that are shown by Firefox rather than directly via webpage APIs
  4. content prompts (like alert and beforeunload) implemented using similar architecture as (3) (behind a pref, on for early beta or earlier right now)
  5. content prompts (ditto), but implemented using tabmodalprompt / tabprompts.jsm (default dialog type on release and late beta)

From comment #16 I can't tell if you're talking about 3, 4 or both. Please clarify, and then bounce the needinfo to :pbz for the answer to your question. :-)

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(enndeakin)

I am referring to the popups opened by openPromptWithTabDialogBox at https://searchfox.org/mozilla-central/rev/63fcc3f1a2cc73488d8986f4cf91fce2cd4b7564/browser/actors/PromptParent.jsm#251

I'm not clear what the different is between all the prompts, so I'm not sure how to proceed here.

The prompts opened with openContentPrompt seem to handle the actor being destroyed, but the ones opened with openPromptWithTabDialogBox do not. pbz, see comment 16 for more details.

Flags: needinfo?(enndeakin) → needinfo?(pbz)

When was nsGlobalWindowOuter::LeaveModalState() called before the change?
didDestroy() does sound very late.

Flags: needinfo?(bugs)

I'd expect the onLocationChange handler to clean up the dialog: https://searchfox.org/mozilla-central/rev/63fcc3f1a2cc73488d8986f4cf91fce2cd4b7564/browser/base/content/browser.js#9058
We could keep track of the dialogs opened in PromptParent and close them in didDestroy, but ideally the TabDialogBox which owns the dialogs should clean up.

Flags: needinfo?(pbz)

(In reply to Olli Pettay [:smaug] from comment #20)

When was nsGlobalWindowOuter::LeaveModalState() called before the change?
didDestroy() does sound very late.

I can't get a build working from back then, so I'm not fully sure. However, looking at the patch from bug 1555711, it only calls enterModalState once, for the child iframe. Post-patch, enterModalState is called twice, once for the iframe and once for the top-level browser. leaveModalState is called twice, as well, but fails one of those times with a null top window, so the modal state is never exited.

Note that the child iframe is in the same process as its parent window, so the top-level window that goes into the modal state is the same in both cases. If I manually just call leaveModalState two times in a row, the bug goes away.

The question is, should the prompt code simply check if the document is already in a modal state before doing so again? This also fixes the bug, or is there a problem with doing this? It looks like only the prompt code calls browser.leaveModalState()

This patch just doesn't enter the modal state again. Seems to work, but it feels a bit off to do it this way.

I can no longer reproduce this bug in Firefox 89. Was this to be expected? Then this bug report can be closed … :)

Ah, I'm sorry, I made a mistake when testing. The bug can still be reproduced!

Anyway, are there plans to fix this in the next months? Otherwise I will start on a workaround for our contests …

Severity: -- → S3
Component: Notifications and Alerts → Content Prompts
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: