Closed Bug 1218266 Opened 4 years ago Closed 4 years ago

Hardening against AsyncShutdown timeouts in Experiments.jsm

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P1, critical)

43 Branch
Unspecified
All
defect

Tracking

(firefox43+ wontfix, firefox44+ wontfix, firefox45 fixed, firefox46 fixed)

RESOLVED FIXED
Firefox 46
Tracking Status
firefox43 + wontfix
firefox44 + wontfix
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: alice0775, Assigned: gfritzsche)

References

Details

(Keywords: crash, reproducible, testcase, Whiteboard: [measurement:client])

Crash Data

Attachments

(3 files, 1 obsolete file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-3f8fce45-3ea2-487c-9f43-eb8702151026.
=============================================================

Build Identifier:
https://hg.mozilla.org/releases/mozilla-aurora/rev/455307d2ea2cc4ac12249b799ce11102757ad245
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0 ID:20151025004028

On DevEdition, It always crash at shut down with test profile since 2-3 days ago.
And I found a minimum profile to crash DevEdition.

Reproducible: always

Steps To Reproduce:
1. Create New empty profile folder
2. Copy attached files to the profile
3. Start DevEdition with the profile, wait for ~10sec(CPU settles)
4. Close DevEdition with [x] buttun of title bar

Actual Results:
Crash reporter comes up after for a while.

Expected Results:
No crash.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort]
The crash does not occur on Aurora hourly tinderbox build(i.e., app.update.channel=default). 
It happens only on Aurora nighty build(i.e., app.update.channel=aurora).
Does Telemetry Experiments problem?
Component: Untriaged → Client: Desktop
Product: Core → Firefox Health Report
This looks like an AsyncShutdown abort caused by timing out on "yield this._mainTask", it's very likely to be related to the experiment in bug 1193089
Benjamin: have you seen this AsyncShutdown crash in experiments before?
Flags: needinfo?(benjamin)
Blocks: 1193089
I can reproduce the crash with the STR from comment #0, basically:

- create a new profile using firefox.exe -p
- dump the two files from the 7zip file into the profile
- launch fx, wait 10 seconds or so and close the browser
- wait about 1 minute and the crash reporter will appear

Crashes:
* bp-62498f53-388c-481f-ba06-db1142151026
* bp-64af3973-3eb8-4254-b08d-cb8232151026
* bp-ac554aa4-4248-41a9-b6f0-3bf5b2151026
I reproduced the crash too. I compared the TelemetryExperiment logs generated when this user.js is present in the profile dir vs absent. It seems that the Telemetry Experiment does not finish initialization correctly with this user.js file present.

Healthy initialization (no user.js):
==============================================================
Browser.Experiments.Experiments	TRACE	ExperimentEntry #0::reconcileAddonState()
Browser.Experiments.Experiments	INFO	ExperimentEntry #0::Activating add-on: e10s-enabled-aurora-20151020@experiments.mozilla.org
Browser.Experiments.Experiments	INFO	ExperimentEntry #0::Add-on has been enabled: e10s-enabled-aurora-20151020@experiments.mozilla.org
Browser.Experiments.Experiments	TRACE	Experiments #0::_saveToCache
Browser.Experiments.Experiments	DEBUG	Experiments #0::_saveToCache saved to c:\vladan\compost\as_debug1\experiments.json
Browser.Experiments.Experiments	TRACE	Experiments #0::_main finished, scheduling next run
Browser.Experiments.Experiments	TRACE	Experiments #0::scheduleExperimentEvaluation() - scheduling for 1446336000000, now: 1445898686953
addons.xpi	WARN	Add-on e10s-enabled-aurora-20151020@experiments.mozilla.org is missing bootstrap method shutdown
addons.xpi-utils	WARN	Error: XPI database modified after shutdown began (resource://gre/modules/addons/XPIProvider.jsm -> resource://gre/modules/addons/XPIProviderUtils.js:451:17) JS Stack trace: this.XPIDatabase.saveChanges@XPIProviderUtils.js:451:17 < XPIDB_updateActiveAddons@XPIProviderUtils.js:1398:9 < XPI_shutdown@XPIProvider.jsm:2639:7 < _startProvider/AMProviderShutdown/<@AddonManager.jsm:843:21 < Promise@Promise-backend.js:385:5 < _startProvider/AMProviderShutdown@AddonManager.jsm:841:16 < Barrier/this.client.addBlocker/promise</trigger@AsyncShutdown.jsm:686:23 < Barrier.prototype<._wait@AsyncShutdown.jsm:830:7 < Barrier.prototype<.wait@AsyncShutdown.jsm:814:28 < AddonManagerInternal.shutdownManager<@AddonManager.jsm:1233:15


Unhealthy log (with user.js, causes AsyncShutdown abort):
==============================================================
Browser.Experiments.Experiments	TRACE	ExperimentEntry #0::reconcileAddonState()
Browser.Experiments.Experiments	INFO	ExperimentEntry #0::Activating add-on: e10s-enabled-aurora-20151020@experiments.mozilla.org
Alice: how did you end up with extensions.checkCompatibility set to False? Do you use Add-on Compatibility Reporter in your main profile?

Info on the pref: http://kb.mozillazine.org/Extensions.checkCompatibility
Flags: needinfo?(alice0775)
Additionally, with this pref, my about:addons shows the warnings "Add-on comapatibility checking is disabled. You may have incompatible add-ons" and the Experiment is marked "Complete NaN days ago".

Alice: do you see this as well?
I think what's happening here is that this extensions.checkCompatibility pref prevents the addon from getting properly installed, so the Experiments skeleton never gets a callback from the Addon Manager that the addon was installed:

http://hg.mozilla.org/mozilla-central/annotate/28068d907290/browser/experiments/Experiments.jsm#l1953

The Experiments code then gets stuck waiting on this callback during shutdown:

http://hg.mozilla.org/mozilla-central/annotate/28068d907290/browser/experiments/Experiments.jsm#l467

Mossop: why do these prefs prevent an experiment from getting installed properly?

user_pref("extensions.checkCompatibility.43.0", false);
user_pref("extensions.checkCompatibility.43.0a", false);
Flags: needinfo?(dtownsend)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #9)
> Mossop: why do these prefs prevent an experiment from getting installed
> properly?

I'm not seeing any reason why they would assuming the XPI in use is the one in bug 1193089. Can you enable extensions.logging.enabled and reproduce as you did in comment 6 and get the log that shows up?
Flags: needinfo?(dtownsend) → needinfo?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #7)
> Alice: how did you end up with extensions.checkCompatibility set to False?
> Do you use Add-on Compatibility Reporter in your main profile?
> 
> Info on the pref: http://kb.mozillazine.org/Extensions.checkCompatibility

I do not use Add-on Compatibility Reporter.

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #8)
> Additionally, with this pref, my about:addons shows the warnings "Add-on
> comapatibility checking is disabled. You may have incompatible add-ons" and
> the Experiment is marked "Complete NaN days ago".
> 
> Alice: do you see this as well?

Yes, I can see the warning.
Flags: needinfo?(alice0775)
Mossop: See attached log. 

How to reproduce this log:

1. Create a new Aurora profile and set following prefs in about:config

extensions.logging.enabled = true
Create new bool pref "extensions.checkCompatibility.43.0" = false
Create new bool pref "extensions.checkCompatibility.43.0a" = false

2. Restart Firefox and wait for experiment to install, or just "tickle" experiments.manifest.uri by adding an extra slash somewhere (this will cause experiment to install immediately)

3. Open about:addons

Expected: Experiment installed successfully
Actual: Experiment looks like it's in an odd state in about:addons, Aurora hangs on shutdown hang because the experiment code forever waits for addon manager to notify it that the experiment addon was enabled
Flags: needinfo?(vladan.bugzilla)
Attachment #8679177 - Flags: feedback?(dtownsend)
Ok I see the issue.

When the checkCompatibility preference is true (the default) an add-on will only be compatible if the Firefox version or Toolkit version matches one of the ranges in the add-on's targetApplications listed in install.rdf.

When the checkCompatibility preference is false though we don't check the version is in the range, we just check that the add-on has a targetApplication entry for Firefox or Toolkit: https://hg.mozilla.org/mozilla-central/annotate/28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/XPIProvider.jsm#l684.

I wouldn't have expected that to be an issue because the experiment in question has a Firefox targetApplication in its install.rdf however it turns out that when experiments are installed the add-ons manager throws away all of those: https://hg.mozilla.org/mozilla-central/annotate/28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/XPIProvider.jsm#l1158

So this makes the add-ons manager think the experiment isn't compatible with the application when the pref is set to false and so when the experiments service tries to enable it it refuses.

We should patch isUsableAddon to always assume compatibility for experiments.
Attachment #8679177 - Flags: feedback?(dtownsend)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #4)
> Benjamin: have you seen this AsyncShutdown crash in experiments before?

I've seen similar, very low-volume, shutdown timeouts before but didn't have anything to go on.
Flags: needinfo?(benjamin)
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #4)
> > Benjamin: have you seen this AsyncShutdown crash in experiments before?
> 
> I've seen similar, very low-volume, shutdown timeouts before but didn't have
> anything to go on.

(and the AddonManager async callback paths are one of the major pain points for the AsyncShutdown paths, we already found a few bugs before)
Some questions here Dave:

I'm trying to follow the addon manager code path for .userDisabled here:
https://hg.mozilla.org/mozilla-central/annotate/28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/XPIProvider.jsm#l6886
Do we need to worry about the |!aAddon.inDatabase| path? If yes, is there a setter defined somewhere for addon.userDisabled that would trigger a callback?

Are there any similar paths like this one that don't call any callbacks:
https://hg.mozilla.org/mozilla-central/annotate/28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/XPIProvider.jsm#l4461
Can we protect against those issues from Experiments.jsm in any way?

Also, from this part:
https://hg.mozilla.org/mozilla-central/annotate/28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/XPIProvider.jsm#l4498
... i assume we need to implement more listeners to avoid waiting for callbacks infinitely?
What is the minimal required set here?
Flags: needinfo?(dtownsend)
Tracking for 43 and 44 since this is a recent regression. Hoping we can fix this by Friday since we go to build on Monday Nov. 2.
Priority: -- → P2
Whiteboard: [measurement:client]
(In reply to Georg Fritzsche [:gfritzsche] from comment #16)
> Some questions here Dave:
> 
> I'm trying to follow the addon manager code path for .userDisabled here:
> https://hg.mozilla.org/mozilla-central/annotate/
> 28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/
> XPIProvider.jsm#l6886
> Do we need to worry about the |!aAddon.inDatabase| path? If yes, is there a
> setter defined somewhere for addon.userDisabled that would trigger a
> callback?

You're only dealing with installed experiments so no you don't need to worry about that case.

> Are there any similar paths like this one that don't call any callbacks:
> https://hg.mozilla.org/mozilla-central/annotate/
> 28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/
> XPIProvider.jsm#l4461
> Can we protect against those issues from Experiments.jsm in any way?

The thing to do would be to check whether addon.permissions includes PERM_CAN_ENABLE before attempting to enable. In theory that should indicate whether setting userDisabled to false will work. We could also add a new notification for these cases.

> Also, from this part:
> https://hg.mozilla.org/mozilla-central/annotate/
> 28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/
> XPIProvider.jsm#l4498
> ... i assume we need to implement more listeners to avoid waiting for
> callbacks infinitely?
> What is the minimal required set here?

https://developer.mozilla.org/en-US/Add-ons/Add-on_Manager/AddonListener documents all of the events we use.
Flags: needinfo?(dtownsend)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #17)
> Tracking for 43 and 44 since this is a recent regression. Hoping we can fix
> this by Friday since we go to build on Monday Nov. 2.

The bug is not a recent regression, but the bug was exposed on Aurora 43 as a result of an active Telemetry Experiment. Once the Telemetry Experiment ends on Friday, the spike in shutdown hang signatures will disappear from top-crash stats.
Mossop: can you fix the addon manager bug? We'll be re-running this e10s Telemetry Experiment on Beta pretty soon
Flags: needinfo?(dtownsend)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #20)
> Mossop: can you fix the addon manager bug? We'll be re-running this e10s
> Telemetry Experiment on Beta pretty soon

I'll fix it in bug 1220198
Flags: needinfo?(dtownsend)
Points: --- → 3
Priority: P2 → P1
Summary: crash at shutdown on DevEdition with test profile → AsyncShutdown timeouts in Experiments.jsm
Assignee: nobody → gfritzsche
(In reply to Dave Townsend [:mossop] from comment #18)
> > Are there any similar paths like this one that don't call any callbacks:
> > https://hg.mozilla.org/mozilla-central/annotate/
> > 28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/
> > XPIProvider.jsm#l4461
> > Can we protect against those issues from Experiments.jsm in any way?
> 
> The thing to do would be to check whether addon.permissions includes
> PERM_CAN_ENABLE before attempting to enable. In theory that should indicate
> whether setting userDisabled to false will work. We could also add a new
> notification for these cases.

So, checking PERM_CAN_ENABLE doesn't help now because of this:
https://dxr.mozilla.org/mozilla-central/rev/9f69202d82752e093a653a8f15b0274e347db33a/toolkit/mozapps/extensions/internal/XPIProvider.jsm#6527

I think we will:
* need bug 1220198 to fix things on the trains
* have to do further follow-ups for handling this properly after bug 1220911
* ... unless there are other options to protect here?
Blocks: 1222894
No longer blocks: e10s-measurement
No longer blocks: 1222894
Summary: AsyncShutdown timeouts in Experiments.jsm → Hardening against AsyncShutdown timeouts in Experiments.jsm
FWIW: crashes pointing to this bug also occur on Intel G41 chipset using 8.15.10.2869 driver that suffers insufficient contiguous vmem (bug 1172351) after exiting FF because of that.
Bug 1174937 has STR for a potential timeout from the issues here:

(In reply to Kamil Jozwiak [:kjozwiak] from bug 1174937, comment #98)
> While investigating the shutdown issue with the experiment mentioned in
> comment # 75 & comment # 76, I ran into the following crash:
> 
> *
> https://crash-stats.mozilla.com/report/index/daeb70e0-0f48-4bee-bb99-
> c00702151110
> *
> https://crash-stats.mozilla.com/report/index/c23a7162-4c38-4698-9f13-
> d8ea52151110
> *
> https://crash-stats.mozilla.com/report/index/ab616dca-75ea-4cdb-87b4-
> 04ba32151110
> 
> STR (time sensitive... need to "command + q" at the right time during the
> installation process)
> 
> - install fx beta
> - change all the correct prefs to get the experiment installed
> - once the installation process for the experiment begins, quickly "command
> + q"
> 
> At this point, FX will become "non-responsive" and will end up crashing
> after about a minute.
No longer blocks: 1193089
Depends on: 1220911, 1220198
(In reply to Georg Fritzsche [:gfritzsche] from comment #22)
> (In reply to Dave Townsend [:mossop] from comment #18)
> > > Are there any similar paths like this one that don't call any callbacks:
> > > https://hg.mozilla.org/mozilla-central/annotate/
> > > 28068d907290d1f5138a0b9e59ae2233a1c1b7a3/toolkit/mozapps/extensions/internal/
> > > XPIProvider.jsm#l4461
> > > Can we protect against those issues from Experiments.jsm in any way?
> > 
> > The thing to do would be to check whether addon.permissions includes
> > PERM_CAN_ENABLE before attempting to enable. In theory that should indicate
> > whether setting userDisabled to false will work. We could also add a new
> > notification for these cases.
> 
> So, checking PERM_CAN_ENABLE doesn't help now because of this:
> https://dxr.mozilla.org/mozilla-central/rev/
> 9f69202d82752e093a653a8f15b0274e347db33a/toolkit/mozapps/extensions/internal/
> XPIProvider.jsm#6527

Checking PERM_CAN_ENABLE should behave correctly now that bug 1220911 is fixed.
(In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Created attachment 8682481 [details] [diff] [review]
> AsyncShutdown timeouts in Experiments.jsm
> 
> WIP

Georg, this tracked bug is associated with a crash on DevEd44 with signature [mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort]. Is there a patch that we might want to uplift to Aurora44 to address this? Currently the associated crash is at #20 and usually we try to chase the top 25 crashes.
Flags: needinfo?(gfritzsche)
This crash signature is still showing up  significantly in 43.0b8 and beta 9. 
 2159 crashes on release in the last week, and a few hundred in the later 43 betas. 

We are about to go to build for the release candidate for 43, tomorrow. If there is a simple fix for this we could potentially still take this in an RC2 build later in the week.   If not, wontfix for 43.
Sorry, i'm a bit swamped currently.
There is no trivial fix here, this will need some investigation first.

I think this should currently only show up significantly on 43 beta because of active experiments (and the e10s experiment using all or most of the beta population).
On release this should not be a problem (few or only small-scale experiments run there AFAIK).
Flags: needinfo?(gfritzsche)
Comment on attachment 8682481 [details] [diff] [review]
AsyncShutdown timeouts in Experiments.jsm

Review of attachment 8682481 [details] [diff] [review]:
-----------------------------------------------------------------

This passes now after bug 1220911 was fixed.
Dave, does this look sufficient to you?
Attachment #8682481 - Flags: review?(dtownsend)
Comment on attachment 8682481 [details] [diff] [review]
AsyncShutdown timeouts in Experiments.jsm

Review of attachment 8682481 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +1942,5 @@
>  
> +    // Check permissions to see if we can enable the addon.
> +    if (!(addon.permissions & AddonManager.PERM_CAN_ENABLE)) {
> +      throw new Error("Don't have permission to enable addon " + addon.id + ", perm=" + addon.permission);
> +    }

Lets also check addon.operationsRequiringRestart against AddonManager.OP_NEEDS_RESTART_ENABLE. In theory experiments should never need a restart to enable but might as well check to be sure.

@@ +1967,5 @@
> +
> +        AddonManager.removeAddonListener(listener);
> +        deferred.reject("Failed to enable addon " + addon.id + " due to: " + handler);
> +      };
> +    }

I don't know if this block is necessary, I can't see any case where we'd see one of these events without seeing onEnabled first except for the case above. That said there is no harm leaving it in if you want to be extra cautious.
Attachment #8682481 - Flags: review?(dtownsend) → review+
Attachment #8700080 - Flags: review+
Keywords: checkin-needed
OS: Windows 7 → All
https://hg.mozilla.org/mozilla-central/rev/d599eda7ee9f
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 46
Georg, Katie: Should we uplift this patch to Beta44? If yes, please nominate for uplift.
Flags: needinfo?(kparlante)
Attachment #8682481 - Attachment is obsolete: true
Comment on attachment 8700080 [details] [diff] [review]
Harden against AsyncShutdown timeouts in Experiments.jsm

Approval Request Comment
[Feature/regressing bug #]: Telemetry Experiments.
[User impact if declined]: Higher potential for AsyncShutdownTimeouts with Telemetry Experiments.
[Describe test coverage new/current, TreeHerder]: Automation coverage, baked on m-c for a bit.
[Risks and why]: Low-risk, this should just add handling for unexpected failure scenarios.
[String/UUID change made/needed]: None.
Flags: needinfo?(kparlante)
Attachment #8700080 - Flags: approval-mozilla-aurora?
Comment on attachment 8700080 [details] [diff] [review]
Harden against AsyncShutdown timeouts in Experiments.jsm

Taking it to aurora. Don't know if we want to that beta (AFAIK, we don't do an experiment on this channel currently).
Attachment #8700080 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
>> Taking it to aurora. Don't know if we want to that beta
>> (AFAIK, we don't do an experiment on this channel currently).

There was the "multi-process firefox A/B" experiment (Bug # 1229104) that ran on the beta channel which actually ended today. I believe the "unified urlbar-searchbar" experiment (Bug # 1219505) which should start sometime this week (I believe it's Wed Jan 06) will also be running under the beta channel.
Comment on attachment 8700080 [details] [diff] [review]
Harden against AsyncShutdown timeouts in Experiments.jsm

There should be no risks with this patch and this may help with ongoing experiments, so requesting beta approval too.

Approval Request Comment
[Feature/regressing bug #]: Telemetry Experiments.
[User impact if declined]: Higher potential for AsyncShutdownTimeouts with Telemetry Experiments.
[Describe test coverage new/current, TreeHerder]: Automation coverage, baked on m-c for a bit.
[Risks and why]: Low-risk, this should just add handling for unexpected failure scenarios.
[String/UUID change made/needed]: None.
Attachment #8700080 - Flags: approval-mozilla-beta?
Comment on attachment 8700080 [details] [diff] [review]
Harden against AsyncShutdown timeouts in Experiments.jsm

The patch seems small'ish and contained, Beta44+
Attachment #8700080 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Sorry, that would be due to this depending on bug 1220911, which is on 45 and is probably not something we want to uplift to beta.
Flags: needinfo?(gfritzsche)
Can we get a patch to work around not having 1220911, or clear the beta approval on the current patch?
Flags: needinfo?(gfritzsche)
Without 1220911 this patch is not useful, clearing the approval.
Flags: needinfo?(gfritzsche)
Attachment #8700080 - Flags: approval-mozilla-beta+
Marking it wontfix based on comment 47
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.