Some common form of e10s testing behavior results in opening infinite tabs

RESOLVED INCOMPLETE

Status

()

Core
General
RESOLVED INCOMPLETE
a year ago
5 months ago

People

(Reporter: philor, Unassigned)

Tracking

(Blocks: 1 bug)

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s-)

Details

(Whiteboard: [ele:1b])

(Reporter)

Description

a year ago
I'm suspicious of https://hg.mozilla.org/mozilla-central/annotate/7e873393cc11d326338779e5a3ed2da031e30936/browser/base/content/test/general/head.js#l617 both because it seems to be used by the example failures I have, and because any use of setTimeout with a magic number makes me see so much red I can't see anything else.

Examples I know of so far are bug 1264427, bug 1280561, bug 1273375, bug 997635, bug 1183302, bug 1285593, and to have at least one which isn't a plugin test, bug 1282079.

Windows 7 seems to be very very strongly associated with it (I didn't ever notice that these failures were literally "timed out, followed by finding an infinite number of tabs" until bug 1262336 pointed out that the reason releng is now unable to complete reconfigs on the Windows test masters is because we switched to running Win7 tests on AWS VMs, and we're not able to kill the process on them when the infinite tab open/close hits the 3 hour maximum time for a test run), which triggers a vague memory of something about time precision on some versions of Windows. The rare cases where we're still reporting failures on Win7 are cases like https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32/1472217917/mozilla-central_win7_vm_test-mochitest-e10s-browser-chrome-5-bm139-tests1-windows-build0.txt.gz where "remote browser crashed while on about:blank" giving us a chance to regain control from the infinite spawning - otherwise as far as I know the AWS instance continues opening tabs and having them closed infinitely, until something like us being outbid for it terminates it.

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=997635&entireHistory=true&tree=trunk and https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1183302&entireHistory=true&tree=trunk both interest me because they started out as cross-platform failures, and then in March/April became Windows 7 only.
(Reporter)

Comment 1

a year ago
Clipboard is hard: bug 1295446 is the releng bug which pointed out to me that we were spawning infinite tabs.
Blocks: 1295446

Updated

a year ago
tracking-e10s: --- → ?

Updated

a year ago
tracking-e10s: ? → -
Blake, looks like you've been involved in the code that philor's pointing at in #c0. This bug is particularly ugly because it leaves the test machines in a broken state until AWS terminates the spot instance, so it has ramification beyond just the specific test failure itself. Can you please take a look and/or suggest a better owner?
Flags: needinfo?(mrbkap)
I filed bug 1314812 on the setTimeout (and the few copy/pasted versions of it scattered around the tree). That being said, while looking through logs for this bug, the setTimeout only appears to be responsible for other problems (namely that if the load doesn't happen for some reason, we fail the test properly and then during some random later test, the setTimeout fires causing a second test to fail).

I don't think that we're creating an infinite number of tabs. Instead, it appears that we end up with a single tab open at the end of a test that refuses to close. I've stared at the tab-closing code for a long time now and I don't know how it would happen, but if we try to close a tab and fail to for any reason, subsequent calls to close it (that is gBrowser.removeTab) would be no-ops and the code at [1] would spin forever. Could that possibly be the cause of the failed reconfigs? If so, we could probably work around this by detecting that we're not succeeding in closing the unexpected tab and bailing out of the suite entirely. We should also figure out why these tabs aren't closing.

CC'ing Gijs and Felipe who might have insight here as well.

[1] http://searchfox.org/mozilla-central/rev/f5c9e9a249637c9abd88754c8963ecb3838475cb/testing/mochitest/browser-test.js#279-284
Flags: needinfo?(mrbkap)
I'll look my tomorrow, it's 11.20pm now.
Flags: needinfo?(gijskruitbosch+bugs)
Do we have logs of cases where this head.js code treads water that *don't* end with a dead content process? The logs linked from e.g. bug 1280561 don't seem to have either the head.js code treading water or the dead content process - just any old timeouts? It also looks like a significant portion of them have the "received tab load" message so presumably the issues weren't with the setTimeout, which is cleared when the event is received. :-\

Off the top of my head, reasons for the log linked in comment #0 and/or removeTab failing include:

1) actual large number of tabs (seems unlikely, we'd probably have crashed out before then)
2) _beginRemoveTab returns false (indicating "don't close this tab"), so it's a no-op
3) content process is unresponsive so we hang in _beginRemoveTab trying to figure out if we're OK to close the tab.... but that should only last 5 seconds (see bug 1238856) and the loop shouldn't advance while that's happening, so it doesn't really explain these kinds of logs.

Ways of making this potential cause of problems go away / diagnosing include:

- make the failure log message output number of remaining tabs (ie gBrowser.tabs.length)
- pass skipPermitUnload in the removeTab options (arguably a good idea here, but on the flip side, I kind of want to know *why* we're not closing tabs correctly and not just fix it for this automation case leaving Real Users to keep running into it, or something).
Flags: needinfo?(gijskruitbosch+bugs)
(I could potentially say more if I had a better understanding of how to get logs for the cases comment #0 cares about and/or if bug 1314812 fixed these issues - but some of the bugs in comment #0 seem to have stopped being frequent intermittents months ago anyway. So I'm struggling to understand what the "good" and "bad" cases are that we're trying to fix - like, is there a "bad slave" state we get into that breaks infra but doesn't show up in any of those plugin bugs because they die mid-test-run somehow?)
I have noticed a related problem while working on the Responsive Design Mode in DevTools, which may or may not be happening here.  Looking at the general flow of removing tabs:

1. `_beginRemoveTab` is called
 a. The tab is added to `this._removingTabs` array
 b. `TabClose` event is emitted
 c. `aTab._endRemoveArgs` is set
2. `_endRemoveTab` is called
 a. If `!aTab || !aTab._endRemoveArgs`, abort
 b. Remove the tab from `this._removingTabs` array
 c. If `aCloseWindow`, loop over all tabs in `this._removingTabs` array and call `_endRemoveTab`

The issue I saw is that listeners for `TabClose` can trigger arbitrary behavior, including removing a second tab, leading to follow timeline of actions:

1. `_beginRemoveTab` is called for tab 1
2. Tab 1 is added to `this._removingTabs` array
  * NOTE: `_endRemoveArgs` has _not_ been set for tab 1 yet
3. `TabClose` event emitted for tab 1
4. Some `TabClose` listener calls `removeTab(tab2)`
5. `_beginRemoveTab` is called for tab 2
6. `_endRemoveTab` is called for tab 2, close window is true
7. Loop over all tabs in `this._removingTabs` array and call `_endRemoveTab`
  * Here we loop forever, because the `this._removingTabs` contains a tab without `_endRemoveArgs` so it just returns immediately

I tried to adjust the timing of when `_endRemoveArgs` is set, but I couldn't find the precise point to do so without its own negative effects.

(If this is unrelated, sorry for the noise, I'll move it elsewhere.)
(Reporter)

Comment 8

a year ago
I stopped seeing the Win7 runs that just kept going forever some time back, I think when releng started just terminating instances much more quickly than they used to, so I don't have any useful current failures to point out. The bulk of it was plugins, which are pretty much gone. https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1285593&entireHistory=true&tree=all agrees with my gut feeling about it, this problem is falling off the back car of the train now.
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #7)
> I have noticed a related problem while working on the Responsive Design Mode
> in DevTools, which may or may not be happening here.  Looking at the general
> flow of removing tabs:
> 
> 1. `_beginRemoveTab` is called
>  a. The tab is added to `this._removingTabs` array
>  b. `TabClose` event is emitted
>  c. `aTab._endRemoveArgs` is set
> 2. `_endRemoveTab` is called
>  a. If `!aTab || !aTab._endRemoveArgs`, abort
>  b. Remove the tab from `this._removingTabs` array
>  c. If `aCloseWindow`, loop over all tabs in `this._removingTabs` array and
> call `_endRemoveTab`
> 
> The issue I saw is that listeners for `TabClose` can trigger arbitrary
> behavior, including removing a second tab, leading to follow timeline of
> actions:
> 
> 1. `_beginRemoveTab` is called for tab 1
> 2. Tab 1 is added to `this._removingTabs` array
>   * NOTE: `_endRemoveArgs` has _not_ been set for tab 1 yet
> 3. `TabClose` event emitted for tab 1
> 4. Some `TabClose` listener calls `removeTab(tab2)`
> 5. `_beginRemoveTab` is called for tab 2
> 6. `_endRemoveTab` is called for tab 2, close window is true
> 7. Loop over all tabs in `this._removingTabs` array and call `_endRemoveTab`
>   * Here we loop forever, because the `this._removingTabs` contains a tab
> without `_endRemoveArgs` so it just returns immediately
> 
> I tried to adjust the timing of when `_endRemoveArgs` is set, but I couldn't
> find the precise point to do so without its own negative effects.
> 
> (If this is unrelated, sorry for the noise, I'll move it elsewhere.)

I don't know if it's related or not, but it feels like we should fix it anyway - can you file a separate bug?
Flags: needinfo?(jryans)
See Also: → bug 1316126
(In reply to :Gijs Kruitbosch from comment #9)
> I don't know if it's related or not, but it feels like we should fix it
> anyway - can you file a separate bug?

Sure, filed bug 1316126.
Flags: needinfo?(jryans)
Hello, I'm asking your help with an experiment with making decisions on bugs. You've been needinfo'ed on this bug. I'd like you to take one action to help this bug make progress toward a decision. The things you can do include:

* If you know or have a good guess of which product and component this bug belongs to, change the product and component of the bug
* If you know of the right person to ask about this bug, redirect the needinfo to them
* If you cannot reproduce the bug, close it

All we need you to do is one thing that will help us make a decision on the bug or resolve it.

Thank you for your help with this. If you have questions, please contact emma@mozilla.com.
Flags: needinfo?(philip)
Whiteboard: [ele:1b]
(Reporter)

Comment 12

5 months ago
Sigh.
Status: NEW → RESOLVED
Last Resolved: 5 months ago
Flags: needinfo?(philip)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.