Closed Bug 1012924 Opened 6 years ago Closed 5 years ago

Intermittent ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"Experiments.jsm shutdown","state":"(none)","filename":"resource://app/modules/experiments/Experiments.jsm","lineNumber":408},...

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

x86_64
Linux
defect
Not set

Tracking

(firefox30- wontfix, firefox31+ wontfix, firefox32+ wontfix, firefox33+ fixed, firefox34 fixed, firefox-esr24 unaffected, firefox-esr31 wontfix)

RESOLVED FIXED
Firefox 34
Tracking Status
firefox30 - wontfix
firefox31 + wontfix
firefox32 + wontfix
firefox33 + fixed
firefox34 --- fixed
firefox-esr24 --- unaffected
firefox-esr31 --- wontfix

People

(Reporter: KWierso, Assigned: benjamin)

References

Details

(Keywords: assertion, intermittent-failure, topcrash-win)

Crash Data

Attachments

(6 files, 3 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=39970804&tree=Fx-Team
Ubuntu VM 12.04 x64 fx-team pgo test mochitest-browser-chrome-3 on 2014-05-19 14:23:23 PDT for push 324ce72cc5e3

slave: tst-linux64-spot-792



14:33:59     INFO -  INFO TEST-START | Shutdown
14:33:59     INFO -  Browser Chrome Test Summary
14:33:59     INFO -  	Passed: 13940
14:33:59     INFO -  	Failed: 0
14:33:59     INFO -  	Todo: 30
14:33:59     INFO -  *** End BrowserChrome Test Results ***
14:33:59     INFO -  1400535239765	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #4::shutdown()
14:33:59     INFO -  1400535239769	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
14:33:59     INFO -  1400535239770	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
14:33:59     INFO -  1400535239770	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: no previous shutdown
14:33:59     INFO -  1400535239771	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering instance with Addon Manager.
14:33:59     INFO -  1400535239772	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering previous experiment add-on provider.
14:33:59     INFO -  1400535239773	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #4::shutdown()
14:33:59     INFO -  1400535239774	addons.manager	ERROR	Exception calling provider shutdown: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://app/modules/experiments/Experiments.jsm :: this.Experiments.PreviousExperimentProvider.prototype<.shutdown :: line 2091"  data: no] Stack trace: this.Experiments.PreviousExperimentProvider.prototype<.shutdown()@resource://app/modules/experiments/Experiments.jsm:2091 < callProvider()@resource://gre/modules/AddonManager.jsm:194 < AMI_unregisterProvider()@resource://gre/modules/AddonManager.jsm:849 < AMP_unregisterProvider()@resource://gre/modules/AddonManager.jsm:2322 < Experiments.Experiments.prototype._unregisterWithAddonManager()@resource://app/modules/experiments/Experiments.jsm:500 < Experiments.Experiments.prototype.uninit<()@resource://app/modules/experiments/Experiments.jsm:446 < TaskImpl_run()@resource://gre/modules/Task.jsm:282 < TaskImpl_handleResultValue()@resource://gre/modules/Task.jsm:338 < TaskImpl_run()@resource://gre/modules/Task.jsm:290 < TaskImpl()@resource://gre/modules/Task.jsm:247 < createAsyncFunction/asyncFunction()@resource://gre/modules/Task.jsm:224 < Barrier.prototype<._wait()@resource://gre/modules/AsyncShutdown.jsm:552 < Barrier.prototype<.wait()@resource://gre/modules/AsyncShutdown.jsm:518 < Spinner.prototype.observe()@resource://gre/modules/AsyncShutdown.jsm:337 < <file:unknown>
14:33:59     INFO -  1400535239777	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: waiting on _mainTask
14:33:59     INFO -  1400535239779	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
14:33:59     INFO -  1400535239780	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
14:33:59     INFO -  1400535239781	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: waiting on _mainTask
14:33:59     INFO -  1400535239782	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
14:33:59     INFO -  1400535239782	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
14:33:59     INFO -  1400535239783	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: waiting on _mainTask
14:33:59     INFO -  1400535239784	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
14:33:59     INFO -  1400535239785	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
14:33:59     INFO -  1400535239785	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: waiting on _mainTask
14:33:59     INFO -  1400535239786	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
14:33:59     INFO -  1400535239787	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
14:33:59     INFO -  1400535239788	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: waiting on _mainTask
14:34:09     INFO -  WARNING: At least one completion condition is taking too long to complete. Conditions: Error getting state: TypeError: state is not a function at safeGetState@resource://gre/modules/AsyncShutdown.jsm:118:5
14:34:09     INFO -  Barrier.prototype<._wait/<@resource://gre/modules/AsyncShutdown.jsm:634:1
14:34:09     INFO -  Spinner.prototype.observe@resource://gre/modules/AsyncShutdown.jsm:344:7
14:34:09     INFO -   Barrier: profile-before-change
14:35:00     INFO -  ERROR: At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources. Conditions: [{"name":"Experiments.jsm shutdown","state":"(none)"},{"name":"Experiments.jsm shutdown","state":"(none)"},{"name":"Experiments.jsm shutdown","state":"(none)"},{"name":"Experiments.jsm shutdown","state":"(none)"},{"name":"Experiments.jsm shutdown","state":"(none)"}] Barrier: profile-before-change
14:35:00     INFO -  [Parent 2416] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 691
14:35:00     INFO -  [Parent 2416] ###!!! ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 691
14:35:01     INFO -  [2995] ###!!! ABORT: Aborting on channel error.: file /builds/slave/fx-team-l64-pgo-00000000000000/build/ipc/glue/MessageChannel.cpp, line 1531
14:35:01     INFO -  [2995] ###!!! ABORT: Aborting on channel error.: file /builds/slave/fx-team-l64-pgo-00000000000000/build/ipc/glue/MessageChannel.cpp, line 1531
14:35:01     INFO -  TEST-INFO | Main app process: killed by SIGSEGV
14:35:01  WARNING -  TEST-UNEXPECTED-FAIL | Shutdown | application terminated with exit code 11
14:35:01     INFO -  INFO | runtests.py | Application ran for: 0:09:44.013159
14:35:01     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpwFnDfqpidlog
14:35:01     INFO -  ==> process 2416 launched child process 2473
14:35:01     INFO -  ==> process 2416 launched child process 2477
14:35:01     INFO -  ==> process 2416 launched child process 2501
14:35:01     INFO -  ==> process 2416 launched child process 2516
14:35:01     INFO -  ==> process 2416 launched child process 2531
14:35:01     INFO -  ==> process 2416 launched child process 2546
14:35:01     INFO -  ==> process 2416 launched child process 2561
14:35:01     INFO -  ==> process 2416 launched child process 2576
14:35:01     INFO -  ==> process 2416 launched child process 2591
14:35:01     INFO -  ==> process 2416 launched child process 2606
14:35:01     INFO -  ==> process 2416 launched child process 2621
14:35:01     INFO -  ==> process 2416 launched child process 2636
14:35:01     INFO -  ==> process 2416 launched child process 2651
14:35:01     INFO -  ==> process 2416 launched child process 2666
14:35:01     INFO -  ==> process 2416 launched child process 2682
14:35:01     INFO -  ==> process 2416 launched child process 2697
14:35:01     INFO -  ==> process 2416 launched child process 2712
14:35:01     INFO -  ==> process 2416 launched child process 2727
14:35:01     INFO -  ==> process 2416 launched child process 2742
14:35:01     INFO -  ==> process 2416 launched child process 2757
14:35:01     INFO -  ==> process 2416 launched child process 2772
14:35:01     INFO -  ==> process 2416 launched child process 2787
14:35:01     INFO -  ==> process 2416 launched child process 2803
14:35:01     INFO -  ==> process 2416 launched child process 2818
14:35:01     INFO -  ==> process 2416 launched child process 2835
14:35:01     INFO -  ==> process 2416 launched child process 2851
14:35:01     INFO -  ==> process 2416 launched child process 2866
14:35:01     INFO -  ==> process 2416 launched child process 2882
14:35:01     INFO -  ==> process 2416 launched child process 2898
14:35:01     INFO -  ==> process 2416 launched child process 2916
14:35:01     INFO -  ==> process 2416 launched child process 2933
14:35:01     INFO -  ==> process 2416 launched child process 2949
14:35:01     INFO -  ==> process 2416 launched child process 2995
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2473
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2477
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2501
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2516
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2531
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2546
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2561
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2576
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2591
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2606
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2621
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2636
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2651
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2666
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2682
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2697
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2712
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2727
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2742
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2757
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2772
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2787
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2803
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2818
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2835
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2851
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2866
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2882
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2898
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2916
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2933
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2949
14:35:01     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2995
14:35:01     INFO -  mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/fx-team-linux64-pgo/1400499006/firefox-32.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
14:35:11  WARNING -  PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)]
14:35:11     INFO -  Crash dump filename: /tmp/tmpoNUJ8o/minidumps/6c2a4496-de73-2bba-764c476e-5a964095.dmp
14:35:11     INFO -  Operating system: Linux
14:35:11     INFO -                    0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64
14:35:11     INFO -  CPU: amd64
14:35:11     INFO -       family 6 model 45 stepping 7
14:35:11     INFO -       1 CPU
14:35:11     INFO -  Crash reason:  SIGSEGV
14:35:11     INFO -  Crash address: 0x0
14:35:11     INFO -  Thread 0 (crashed)
14:35:11     INFO -   0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:324ce72cc5e3 : 30 + 0x0]
14:35:11     INFO -      rbx = 0x00007f2475819828   r12 = 0x0000000000000003
14:35:11     INFO -      r13 = 0x00007f2475819828   r14 = 0x0000000000000000
14:35:11     INFO -      r15 = 0x0000000000000000   rip = 0x00007f2473e78530
14:35:11     INFO -      rsp = 0x00007fffdd519e40   rbp = 0x00007fffdd519e50
14:35:11     INFO -      Found by: given as instruction pointer in context
14:35:11     INFO -   1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:324ce72cc5e3 : 435 + 0x7]
14:35:11     INFO -      rbx = 0x00007fffdd519ea0   r12 = 0x0000000000000003
14:35:11     INFO -      r13 = 0x00007f2475819828   r14 = 0x0000000000000000
14:35:11     INFO -      r15 = 0x0000000000000000   rip = 0x00007f2470657468
14:35:11     INFO -      rsp = 0x00007fffdd519e60   rbp = 0x00007fffdd51a2c0
14:35:11     INFO -      Found by: call frame info
14:35:11     INFO -   2  libxul.so!nsDebugImpl::Abort(char const*, int) [nsDebugImpl.cpp:324ce72cc5e3 : 118 + 0x4]
14:35:11     INFO -      rbx = 0x00007f2471fb4618   r12 = 0x00007f2470693b58
14:35:11     INFO -      r13 = 0x00007f2475358360   r14 = 0x00007fffdd51a5b0
14:35:11     INFO -      r15 = 0x0000000000000002   rip = 0x00007f2470657648
14:35:11     INFO -      rsp = 0x00007fffdd51a2d0   rbp = 0x00007fffdd51a2d0
14:35:11     INFO -      Found by: call frame info
14:35:11     INFO -   3  libxul.so!NS_InvokeByIndex [xptcinvoke_x86_64_unix.cpp:324ce72cc5e3 : 164 + 0x38]
14:35:11     INFO -      rbx = 0x00007f2471fb4618   r12 = 0x00007f2470693b58
14:35:11     INFO -      r13 = 0x00007f2475358360   r14 = 0x00007fffdd51a5b0
14:35:11     INFO -      r15 = 0x0000000000000002   rip = 0x00007f2470693a60
14:35:11     INFO -      rsp = 0x00007fffdd51a2e0   rbp = 0x00007fffdd51a370
14:35:11     INFO -      Found by: call frame info
14:35:11     INFO -   4  libxul.so!XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) [XPCWrappedNative.cpp:324ce72cc5e3 : 2397 + 0xd]
14:35:11     INFO -      rbx = 0x0000000000000002   r12 = 0x00007fffdd51b180
14:35:11     INFO -      r13 = 0x00007f2475358360   r14 = 0x00007fffdd51a5b0
14:35:11     INFO -      r15 = 0x0000000000000002   rip = 0x00007f247041c4f5
14:35:11     INFO -      rsp = 0x00007fffdd51a380   rbp = 0x00007fffdd51a740
14:35:11     INFO -      Found by: call frame info
Flags: needinfo?(dteller)
« 14:34:09     INFO -  WARNING: At least one completion condition is taking too long to complete. Conditions: Error getting state: TypeError: state is not a function at safeGetState@resource://gre/modules/AsyncShutdown.jsm:118:5 » – filed as bug 1014473, but that's not the cause of the issue.
Flags: needinfo?(dteller)
Depends on: 1014473
(Tweaking summary to avoid mis-stars)
Summary: Intermittent Shutdown | application crashed [@ mozalloc_abort(char const*)] | after ABORT: file AsyncShutdown.jsm, line 691 and ABORT: Aborting on channel error.: file MessageChannel.cpp, line 15 → Intermittent Shutdown "ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 691" and more [@ mozalloc_abort(char const*)]
Tweaking further.
Summary: Intermittent Shutdown "ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 691" and more [@ mozalloc_abort(char const*)] → Intermittent AsyncShutdown crash Conditions: [{"name":"Experiments.jsm shutdown","state":"(none)"}
(In reply to David Rajchenbach Teller [:Yoric] from comment #5)
> Tweaking further.

This change means that TBPL won't match against this bug; was this intentional?
Flags: needinfo?(dteller)
(In reply to Ed Morley [:edmorley UTC+0] from comment #6)
> (In reply to David Rajchenbach Teller [:Yoric] from comment #5)
> > Tweaking further.
> 
> This change means that TBPL won't match against this bug; was this
> intentional?

I'm not sure about best practice.

The problem I'm trying to solve is that all AsyncShutdown crashes – which are caused by misbehaving clients – have that same line « Intermittent Shutdown "ABORT: file resource://gre/modules/AsyncShutdown.jsm, line 691" and more [@ mozalloc_abort(char const*)] ». Putting them all in the same bug wastes my time, as has happened so far, as I need to look at the log (the name of the clients appears in the lines immediately above) and redispatch.

I realize that my tweak is not optimal, so if you have a better idea, I'll be glad to adopt it.
Flags: needinfo?(dteller)
Component: Async Tooling → Client: Desktop
Product: Toolkit → Firefox Health Report
(In reply to David Rajchenbach Teller [:Yoric] from comment #7)
> I realize that my tweak is not optimal, so if you have a better idea, I'll
> be glad to adopt it.

Ah in which case we need to make the failure output TBPL parsable errors, so we can file bugs with summaries that are specific to each case. At present the log output isn't conducive to that, so dumping ground bugs are inevitable. I've filed bug 1018895 for this :-)
https://tbpl.mozilla.org/php/getParsedLog.php?id=42361239&tree=B2g-Inbound

(Tweaking summary so once bug 1018910 is fixed, this will be TBPL starrable).
Keywords: crashassertion
Summary: Intermittent AsyncShutdown crash Conditions: [{"name":"Experiments.jsm shutdown","state":"(none)"} → Intermittent ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"Experiments.jsm shutdown","state":"(none)","filename":"resource://app/modules/experiments/Experiments.jsm","lineNumber":408},...
We're seeing instances of this crash in the wild:
 aa8dd210-18b2-49ad-b8fd-9856d2140627
 65f9489b-c086-493a-812e-40b942140627
 550003dd-8f82-49db-bc6a-b99d02140628
 af9dad9b-a546-41c2-b701-2b3032140627

See bug 944873 comment 51 for more details.
Note, you can see this is being reported quite frequently:

1. Load https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char%20const*%20const%29%20|%20NS_DebugBreak%20|%20nsDebugImpl%3A%3AAbort%28char%20const*%2C%20int%29#tab-reports
2. Click "Change columns in this table" or scroll to the bottom of the page
3. Select "AsyncShutdownTimeout" from the "Available columns" list 
4. Click ">>" to add it to the list of "Chosen columns"
5. Click "Save and Reload"
6. Find in page "Experiments.jsm shutdown" and you'll see a lot of these crashes.

I don't know a good way to search crashstats to show only these crashes but based on what I'm seeing with Find In Page, I would guess it's in the thousands of crashes over the last week.
I wonder if those could be an instance of bug 1012466 (with the order being the other way around).
Yoric, any ideas on additional details about what could happen here?
Flags: needinfo?(dteller)
I don't know that code enough to be sure.
However, as I mentioned somewhere, if you wish to have more details about *when* the crash takes place, you should take advantage of the last argument of `addBlocker` to attach some state information upon crash.
Flags: needinfo?(dteller)
Flags: firefox-backlog+
On a sample of 900+ crashes from the last 7 days, that's 93% of the crashes in Firefox Beta 8. Not good.
Flags: needinfo?(benjamin)
Huge top crash. Tracking
Comment #5: you meant beta 7 ;)

Can someone prepare a patch to disable Experiments.jsm in beta 8? Go to build today.
Keywords: topcrashtopcrash-win
This should give us a better visibility regarding at which step the shutdown freezes.
Attachment #8451596 - Flags: review?(georg.fritzsche)
n-i to make sure that Georg sees it
Flags: needinfo?(georg.fritzsche)
Patch to flip the pref to disable experiments in case we need it.
However we probably can't disable it on Beta now because we have important experiments running.
Flags: needinfo?(georg.fritzsche)
I will leave Benjamin make the call on this but it is adding some noise to the beta cycle.

By the way, why don't you rely on EARLY_BETA_OR_EARLIER to have experiments running until beta 5?
Comment on attachment 8451596 [details] [diff] [review]
Instrumenting the shutdown of Experiments.jsm

Review of attachment 8451596 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me with the comments addressed.

::: browser/experiments/Experiments.jsm
@@ +357,5 @@
> +
> +  // At the time of this writing, Experiments.jsm has severe
> +  // crashes. For forensics purposes, keep the last few log
> +  // messages in memory and upload them in case of crash.
> +  this._forensicsLogs = [,,];

|new Array(3)|? This pattern seems counter-intuitive and gives me an array of length 2.

@@ +422,5 @@
>      gPrefsTelemetry.observe(PREF_TELEMETRY_ENABLED, this._telemetryStatusChanged, this);
>  
>      AsyncShutdown.profileBeforeChange.addBlocker("Experiments.jsm shutdown",
> +      this.uninit.bind(this),
> +      () => {

Can we move this into a separate method?
Attachment #8451596 - Flags: review?(georg.fritzsche) → review+
EARLY_BETA_ doesn't make much sense for this: we'd be flipping users in and out of experiments based on our place in the cycle, which would invalidate pretty much all experimental results.
Flags: needinfo?(benjamin)
ISTM that this might be a duplicate or caused by bug 1012466, which Georg is actively working on.
Comment on attachment 8451645 [details] [diff] [review]
Instrumenting the shutdown of Experiments.jsm, v2

I believe it would be useful to uplift this to Aurora to get more crash data.

Approval Request Comment
[Feature/regressing bug #]: Experiments
[User impact if declined]: This patch adds crash-time logging. I expect it will help us fix the current bug.
[Describe test coverage new/current, TBPL]: Just landed on m-c. Passes TBPL.
[Risks and why]: None.
[String/UUID change made/needed]: None.
Attachment #8451645 - Flags: approval-mozilla-aurora?
I reproduced bug 1012466 (removeObserver exception) when running tests against Mulet.
Adding try catch around two throwing removeObserver allowed me to stop the abort/crash during tests shutdown.
I have no idea if that's what are facing users.
Wontfix in 31 because it won't be activated in the release.
Comment on attachment 8451645 [details] [diff] [review]
Instrumenting the shutdown of Experiments.jsm, v2

Taking the investigation in 32. Thanks Sherlock
Attachment #8451645 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Per bug 1012466, comment 52, this appears fixed in 33. That bug was uplifted to Aurora, so let's check back on this on 32 in 3 days or so.
Depends on: 1012466
Depends on: 1040761
Attachment #8451614 - Attachment description: Disable experiments → Disable experiments [DO NOT LAND YET]
Ah, obviously, we forgot to land this on Nightly.
Keywords: checkin-needed
Apparently, the source of Experiments.jsm has changed too much.
Keywords: checkin-needed
The AsyncShutdown signature is now #1 in 32 beta with ~20% of all crashes, see https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/32.0b?days=7
Without looking into more details, I'm pretty sure it's this one. Can we please get this fixed on beta soon?
Depends on: 1046750
I've dug into the data i got from bug 1046750 and apparently there are still Experiments.jsm issues not resolved by bug 1012466 and Experiments.jsm is leading the timeout list by far.

Interesting is the changes so far from 32.0b1 to 32.0b2 - i wonder if something is skewed or if we really landed fixes for the timeouts inbetween.
Depends on: 1046910
(In reply to Robert Kaiser (:kairo@mozilla.com, vacation until Aug 10) from comment #33)
> The AsyncShutdown signature is now #1 in 32 beta with ~20% of all crashes,
> see
> https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/32.
> 0b?days=7
> Without looking into more details, I'm pretty sure it's this one. Can we
> please get this fixed on beta soon?

It's #5 on 32.0b2 BTW - is it too early to rely on the numbers for that?
Flags: needinfo?(kairo)
(KaiRo is away)

The beta2 stats ought to be reliable by this point. Looking at the numbers, it seems this signature has indeed gone down in volume.
Flags: needinfo?(kairo)
This seems to have stalled. Georg, are you still working on this?
Flags: needinfo?(georg.fritzsche)
I'm not actively on it right now - this wasn't #1 anymore and won't affect release as we don't have experiments turned on there.
Flags: needinfo?(georg.fritzsche)
FWIW, the data i pulled pointed out only one obvious low-occurence bug.
For further steps i would have to spend some time to get something out of the data or think of what else could help.
It's not #1, that's true, but it's #2 in the newest beta: https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/32.0b6
Recent reports from Aurora that contain the instrumentation patch:

https://crash-stats.mozilla.com/report/index/622e99ec-0287-4ef5-8168-e60742140813
 	{"phase":"AddonManager: Waiting for clients to shut down.","conditions":[{"name":"Experiments.jsm shutdown","state":{"isShutdown":true,"isEnabled":true,"isRefresh":true,"isDirty":false,"isFirstEvaluate":true,"hasLoadTask":false,"hasMainTask":true,"hasTimer":false,"hasAddonProvider":false,"latestLogs":[null,null,null],"experiments":{},"terminateReason":null},"filename":"resource://app/modules/experiments/Experiments.jsm","lineNumber":427}]}

It appears that the latestLogs instrumentation isn't working. There certainly should be recent trace entries in the log.
Benjamin - My understanding is that experiments will run for the entire beta cycle. Do we need to manually disable experiments before release? If not, do you have a way to confirm (perhaps this has already been done) that the experiments code, if not the experiments themselves, will not continue to be active in release resulting in the crash reported here?
Flags: needinfo?(benjamin)
(In reply to Lawrence Mandel [:lmandel] from comment #44)
> Benjamin - My understanding is that experiments will run for the entire beta
> cycle. Do we need to manually disable experiments before release? If not, do
> you have a way to confirm (perhaps this has already been done) that the
> experiments code, if not the experiments themselves, will not continue to be
> active in release resulting in the crash reported here?

Sylvestre confirmed that experiments are based on the channel and do not need to be manually disabled. He also confirmed that this bug was not present in 31 once it hit release.

If comment 30 is correct and this has been fixed in 33, I think we can resolve this bug.
Flags: needinfo?(benjamin)
The bugfix from comment 30 (bug 1012466) is in Firefox 32.0b5 and we're still seeing the issue. I don't think this is resolved. But I also don't think there's any safe change that we'd take in 32 for this anyway, given that we don't know what's going wrong.
Depends on: 1053836
Getting some logs from bug 1012466 now, see https://crash-stats.mozilla.com/report/index/5e9e1768-34ca-4267-bb4b-c99792140818 for an example. The logs aren't long enough, though:

"latestLogs":["10: Finished unregistering with addon manager.","10: Unregistering previous experiment add-on provider.","10: uninit: waiting on _mainTask"]

That's only the spew from unint() itself, when what we really want is to have the latest log message from _mainTask. Going to attach a patch to increase the log length.
Attachment #8476014 - Flags: review?(georg.fritzsche)
Attachment #8476014 - Flags: review?(georg.fritzsche) → review+
From https://crash-stats.mozilla.com/report/index/ebb36fb4-a0dc-4059-b894-69a372140821

"latestLogs":["10: httpGetRequest(https://telemetry-experiment.cdn.mozilla.net/manifest/v1/firefox/34.0a1/nightly)","10: uninit: started","10: uninit: finished with _loadTask","10: uninit: no previous shutdown","10: Unregistering instance with Addon Manager.","10: Removing install listener from add-on manager.","10: Removing addon listener from add-on manager.","10: Finished unregistering with addon manager.","10: Unregistering previous experiment add-on provider.","10: uninit: waiting on _mainTask"]

So the error here is that we're blocking Firefox shutdown on a network load. That's obviously incorrect; we designed this under the assumption that the network was going to be shut down by the time we hit experiment shutdown, but either the original assumption was incorrect or the refactoring to shut down via the addon manager instead of via profile-before-change.

Yoric/georg/irving, do you know: a) whether the network shuts down before AsyncShutdown.profileBeforeChange b) whether the network is supposed to shut down before AddonManager.shutdown ?

I think part of this is fixable by explicitly cancelling the pending network request from uninit(), but I don't know whether that will catch all the cases; in particular, we also have network activity from AddonManager.getInstallForURL which could have a similar network-effect, unless the addon manager explicitly cancels pending installs at shutdown.

Also, does XHR have an implicit reasonable timeout, or do we need to add one?
Flags: needinfo?(irving)
Flags: needinfo?(georg.fritzsche)
Flags: needinfo?(dteller)
Yay, some results here. Was this already standing out or should we bucket/group the new reports again to get an idea of the leading causes?

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #51)
> Also, does XHR have an implicit reasonable timeout, or do we need to add one?

Per the experiences with the wrong/missing test dummy URL in bug 1042161 we can't rely on XHR timing out reasonably.
Checking it now, both MDN and nsIXMLHttpRequest.idl agree on XMLHttpRequest.timeout:
> The number of milliseconds a request can take before automatically being terminated. A value of 0 (which is the default) means there is no timeout.

No timeout at all is weird and unexpected, but that means we could have problems with that in at least Experiments.jsm & GMPInstallManager.jsm.
Unless we are missing something here that would lead to proper request cancellation? Irving, Yoric?
Flags: needinfo?(georg.fritzsche)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #51)
> So the error here is that we're blocking Firefox shutdown on a network load.
> That's obviously incorrect; we designed this under the assumption that the
> network was going to be shut down by the time we hit experiment shutdown,
> but either the original assumption was incorrect or the refactoring to shut
> down via the addon manager instead of via profile-before-change.

The add-on manager also shuts down in profile-before-change, so you shouldn't see a difference in which of your dependencies are still available (unless you're racing against them also shutting down in p-b-c).

> Yoric/georg/irving, do you know: a) whether the network shuts down before
> AsyncShutdown.profileBeforeChange b) whether the network is supposed to shut
> down before AddonManager.shutdown ?

According to https://developer.mozilla.org/en-US/docs/Observer_Notifications, networking should shut down before profile-before-change, but I'm not sure how that would affect in-progress XHR requests.

Also, because the toolkit/mozapps/extensions (AddonManager et. al.) xpcshell test suite simulates shutdown/restart without going through all the notifications, tests can't rely on the network shutting down first (though this isn't your problem here).

> I think part of this is fixable by explicitly cancelling the pending network
> request from uninit(), but I don't know whether that will catch all the cases;
> in particular, we also have network activity from AddonManager.getInstallForURL
> which could have a similar network-effect, unless the addon manager explicitly
> cancels pending installs at shutdown.

XPIProvider tries to keep track of pending operations and cancel at shutdown. Any cases where it doesn't I'd consider to be a bug (see XPIProvider.doing(), XPIProvider.done(), etc).

> Also, does XHR have an implicit reasonable timeout, or do we need to add one?

XHR defaults to OS-level timeouts (DNS lookups, tcp connections). If you have an established connection and are waiting for the other side to send, this could be arbitrarily long.
Flags: needinfo?(irving)
ok, I'm going to use this bug to deal with both cancelling the XHR on shutdown and adding a reasonable timeout.

I'm going to file a separate bug on making sure that if we're installing an experiment during shutdown, that it both doesn't hang and tries again rather than permanently failing.
Assignee: nobody → benjamin
Points: --- → 5
Flags: needinfo?(dteller)
We have encountered problems with XHR not terminating during shutdown. This was with Worker XHR, not main thread XHR, but I believe it's worth investigating.
Status: NEW → ASSIGNED
Iteration: --- → 34.3
Flags: qe-verify?
Hi Benjamin, can you mark this bug as qe-verify+ or qe-verify- for verification.
Flags: needinfo?(benjamin)
We should definitely verify via crash-stats with the beta audience. I'm not sure if there's any need for manual QA before that, so I'm not sure what the correct value of the flag should be.
Flags: needinfo?(benjamin)
Flags: qe-verify? → qe-verify-
Attached patch experiments-net-shutdown (obsolete) — Splinter Review
Woot. This is the patch and test, and I've verified that the test fails (times out) before the fix and passes after. Georg assuming this is ok or has trivial review nits, please push this for me Monday so it hits this iteration.
Attachment #8481625 - Flags: review?(georg.fritzsche)
Comment on attachment 8481625 [details] [diff] [review]
experiments-net-shutdown

Review of attachment 8481625 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +991,4 @@
>          return;
>        }
>  
>        deferred.resolve(xhr.responseText);

Should null out the request here as well.

::: browser/experiments/test/xpcshell/test_nethang_bug1012924.js
@@ +37,5 @@
> +
> +  let experiments = Experiments.instance();
> +  experiments.updateManifest().then(
> +    () => {
> +      ok(true, "updateManifest finished successfully");

Lets use Assert.jsm like the other tests do already.
Attachment #8481625 - Flags: review?(georg.fritzsche) → review+
Attachment #8451614 - Attachment is obsolete: true
Attachment #8481625 - Attachment is obsolete: true
Attachment #8482276 - Flags: review+
All trees are still closed.
Keywords: checkin-needed
Iteration: 34.3 → 35.1
Whoops, didn't want leave-open for this. This made it in pre-branch for 34.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → Firefox 34
Is this something we're going to want on Beta33?
Flags: needinfo?(benjamin)
After verification on nightly, yes.
Flags: needinfo?(benjamin)
This bug appears to have helped on nightly, but the problem is still present in significant volume. I'm going to file a followup to track that, since I think we should uplift this regardless and the tracking/approval flags will start to go insane if it stays all in one bug.
Comment on attachment 8482276 [details] [diff] [review]
experiments-net-shutdown, v2

Approval Request Comment
[Feature/regressing bug #]: Experiments feature
[User impact if declined]: Shutdown hangs
[Describe test coverage new/current, TBPL]: Landed, checked via crash-stats
[Risks and why]: This is not completely low-risk, but the risk should be entirely localized to the experiment system, which is only enabled with telemetry.
[String/UUID change made/needed]: None
Attachment #8482276 - Flags: approval-mozilla-beta?
Attachment #8482276 - Flags: approval-mozilla-aurora?
Blocks: 1064470
Comment on attachment 8482276 [details] [diff] [review]
experiments-net-shutdown, v2

I'm dropping the Aurora approval request as Aurora is now 34, in which this bug has already been fixed. Although there is some risk, this has been a pita during the last two beta cycles. Let's see how this fix does in beta2.

Benjamin - As this doesn't completely resolve the issue, can you please mark your followup bug for tracking?
Attachment #8482276 - Flags: approval-mozilla-beta?
Attachment #8482276 - Flags: approval-mozilla-beta+
Attachment #8482276 - Flags: approval-mozilla-aurora?
Flags: needinfo?(benjamin)
Flags: needinfo?(benjamin)
Blocks: 1114567
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.