Closed Bug 1255745 Opened 4 years ago Closed 11 months ago

Async shutdown crash with "Experiments.jsm shutdown"

Categories

(Toolkit :: Telemetry, defect, P4, critical)

defect
Points:
3

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox45 --- wontfix
firefox46 + wontfix
firefox47 + fixed
firefox48 + wontfix
firefox49 + wontfix
firefox50 + wontfix
firefox51 + wontfix

People

(Reporter: mak, Unassigned)

Details

(Keywords: crash, regression, Whiteboard: [measurement:client] [check in 47 beta])

Crash Data

Attachments

(1 file, 4 obsolete files)

We are crashing on shutdown for some users due to the async shutdown timeout abort.
That indicates this code can block shutdown for a minute.

A couple example crashes from Nightly:
https://crash-stats.mozilla.com/report/index/71405002-5e97-4b21-9c6a-855bd2160310
https://crash-stats.mozilla.com/report/index/90fe93c0-6a8b-423b-b03a-af9b72160305

I've seen this crash in all the recent versions while investigating bug 1250424.
This is the third most common cause of shutdown aborts (19% of shutdown aborts).

It would be worth fixing this and uplifting a fix, crashing on shutdown after blocking firefox for one minute is not a pleasant experience for our users.

From a quick look, sounds like if _saveToCache fails constantly, sounds like we could be trapped into the _main loop forever due to 
while (this._refresh || this._terminateReason || this._dirty);
where this_dirty is always true.
Severity: normal → critical
Priority: -- → P1
Whiteboard: [measurement:client]
I wonder if that's somehow related to bug 1245544.
(In reply to Alessio Placitelli [:Dexter] from comment #1)
> I wonder if that's somehow related to bug 1245544.

well, if we stop seeing signatures of this after bug 1245544, it may be. it's early to tell.
But crash-stats should help figuring that in the next days.
this is on yesterday's nightly
https://crash-stats.mozilla.com/report/index/c3b3b880-79d6-4e53-8dbc-218d52160311
addon manager is waiting for experiments.jsm to shutdown, please check AsyncShutdownTimeout metadata.
The recent failures with _dirty state sound worth investigating.

For other failures, we do know of AsyncShutdown timeouts with active experiments (see e.g. bug 1218266 et al) - i do recall that we identified some scenarios where AddonManager might not trigger callbacks etc., and for those we have limited abilities for fixes in Experiments.jsm.
Assignee: nobody → gfritzsche
Tracking for 46+ since this is a significant crash on 46.0b1 and is also appearing on 47.
(In reply to Marco Bonardo [::mak] from comment #0)
> From a quick look, sounds like if _saveToCache fails constantly, sounds like
> we could be trapped into the _main loop forever due to 
> while (this._refresh || this._terminateReason || this._dirty);
> where this_dirty is always true.

This case actually is sound:
* on failure, _saveToCache throws:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#1006
* _main doesn't catch CacheWriteError, so we break out of it:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#799
* _run catches it and goes on to reschedule trying again for later:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#765
* ... but because we are in shutdown, we break out of that:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#774

I put quite a bit of time with Yoric into chasing AsyncShutdown timeouts for Experiments before.
We know that this spikes with experiments being active, so given that and shutdown crashes being not very problematic for users, i don't see this as critical unless there was a recent spike (other than experiments being activated).

I see "DisplayPort Size Tuning" in the first two metadata entries, on beta we currently have regular e10s experiments running:
https://telemetry-experiment.cdn.mozilla.net/
The two crashes from comment 0 actually have identical metadata:
https://crash-stats.mozilla.com/report/index/71405002-5e97-4b21-9c6a-855bd2160310
https://crash-stats.mozilla.com/report/index/90fe93c0-6a8b-423b-b03a-af9b72160305

...
"20: evaluateExperiments() - activating experiment displayport-tuning-nightly@experiments.mozilla.org",
"10: uninit: started",
"10: uninit: finished with _loadTask",
"10: uninit: no previous shutdown",
"10: Unregistering instance with Addon Manager.",
"10: Removing install listener from add-on manager.",
"10: Removing addon listener from add-on manager.",
"10: Finished unregistering with addon manager.",
"10: Unregistering previous experiment add-on provider.",
"10: uninit: waiting on _mainTask"

* "activating experiment ..." is here:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#1285
* from that i expect to see the log entry from experiment.start() here:
  https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#1756
* ... but oddly there is no "start() for ..." entry in the latestLogs
Adding the log level to the log entries is not really useful for anything here. I'm more interested in how close those events happen together.
Attachment #8730319 - Flags: review?(alessio.placitelli)
Attachment #8730319 - Flags: review?(alessio.placitelli) → review+
This improves the AsyncShutdown state info by fixing the experiments listing and adding the active experiment.
Attachment #8730832 - Flags: review?(alessio.placitelli)
Attachment #8730319 - Attachment is obsolete: true
Attachment #8730832 - Flags: review?(alessio.placitelli) → review+
Points: --- → 3
We were not adding log entries of ExperimentEntry and other instances before, we need at least ExperimentEntry to be able to narrow this down more.
This also increases the forensic log a bit to account for the additional entries and removes some unneeded big data addition to the logging.
Attachment #8731335 - Flags: review?(alessio.placitelli)
The last state we know those clients are in is here:
https://dxr.mozilla.org/mozilla-central/rev/f0c0480732d36153e8839c7f17394d45f679f87d/browser/experiments/Experiments.jsm#1285

So, experiment.start() is entered and doesn't reject/throw (otherwise we'd see "evaluateExperiments() - Unable to start experiment ..." in the log).
So most likely this is stuck in reconcileAddonState() waiting for an AddonManager callback which doesn't happen because AddonManager drops it because it's shutting down.

I've improved the forensics log a bit more to narrow down / confirm that.
Comment on attachment 8731335 [details] [diff] [review]
Make Experiments forensics log contain ExperimentEntry messages

Review of attachment 8731335 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. Both this patch and the previous one (" Make Experiments AsyncShutdown state add the time to the log entries") touch the same code. Would it make sense to merge them?
Attachment #8731335 - Flags: review?(alessio.placitelli) → review+
Assignee: gfritzsche → nobody
Priority: P1 → P2
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
This may be worth uplift at least to 47 since I think we plan to run beta 47 experiments. 
Would this help with the crashes/hangs, or just help diagnose them better? 
I don't think we should dismiss shutdown crashes as less important/annoying to users.
Flags: needinfo?(gfritzsche)
This only improves diagnosis, with new experiments on Beta an uplift would definitely be helpful.
Flags: needinfo?(gfritzsche)
Approval Request Comment
[Feature/regressing bug #]: Telemetry Experiments
[User impact if declined]: Not enough info on Experiments shutdown timeouts.
[Describe test coverage new/current, TreeHerder]: Baked fine on Nightly.
[Risks and why]: Low risk, this just has some diagnostics changes and improvements.
[String/UUID change made/needed]: None.
Attachment #8730832 - Attachment is obsolete: true
Attachment #8730833 - Attachment is obsolete: true
Attachment #8731335 - Attachment is obsolete: true
Attachment #8736280 - Flags: review+
Attachment #8736280 - Flags: approval-mozilla-beta?
Attachment #8736280 - Flags: approval-mozilla-aurora?
Comment on attachment 8736280 [details] [diff] [review]
Improve Experiments shutdown timeout diagnostics

OK for aurora uplift, will help with diagnostics once 47 hits beta. 
Too late for 46 though.
Attachment #8736280 - Flags: approval-mozilla-beta?
Attachment #8736280 - Flags: approval-mozilla-beta-
Attachment #8736280 - Flags: approval-mozilla-aurora?
Attachment #8736280 - Flags: approval-mozilla-aurora+
Whiteboard: [measurement:client] → [measurement:client] [check in 47 beta]
Priority: P2 → P3
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
Looks like this diagnostic patch landed in 48. We still have a lot of crash reports for https://crash-stats.mozilla.com/signature/?signature=Abort%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20NS_InvokeByIndex, but ther are also several open bugs. 

Georg should we be tracking one of the other open bugs? Should we count this bug as fixed now or do you prefer to keep it open?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(dbolter)
We should investigate this from the additional diagnostics and see if there is something to do here.
I don't really mind in which bug this happens.
Currently i don't know how soon i can get to this or if there is someone else who can.
Flags: needinfo?(gfritzsche)
Crash volume for signature 'mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex':
 - Ni.  (version 51): 0 crashes from 2016-08-01.
 - Aur. (version 50): 0 crashes from 2016-08-01.
 - Beta (version 49): 0 crashes from 2016-08-02.
 - Rel. (version 48): 0 crashes from 2016-07-25.
 - Esr  (version 45): 33 crashes from 2016-04-20.

Crash volume on the last weeks (Week N is from 08-08 to 08-14):
              W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - Esr  (45)       0       0       0       0       0       0       3

Affected platform: Windows

Crash volume for signature 'mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex':
 - Ni.  (version 51): 0 crashes from 2016-08-01.
 - Aur. (version 50): 0 crashes from 2016-08-01.
 - Beta (version 49): 0 crashes from 2016-08-02.
 - Rel. (version 48): 0 crashes from 2016-07-25.
 - Esr  (version 45): 940 crashes from 2016-04-20.

Crash volume on the last weeks (Week N is from 08-08 to 08-14):
              W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - Esr  (45)       0       0       0       0       0       0     125

Affected platforms: Windows, Linux

Crash volume for signature 'Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex':
 - Ni.  (version 51): 95 crashes from 2016-08-01.
 - Aur. (version 50): 140 crashes from 2016-08-01.
 - Beta (version 49): 3282 crashes from 2016-08-02.
 - Rel. (version 48): 2424 crashes from 2016-07-25.
 - Esr  (version 45): 1919 crashes from 2016-04-20.

Crash volume on the last weeks (Week N is from 08-08 to 08-14):
              W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - Ni.  (51)      55
 - Aur. (50)      37
 - Beta (49)    1001
 - Rel. (48)    1078       2
 - Esr  (45)     261     259     203     225     300     333     121

Affected platforms: Windows, Linux
On beta: "3282 crashes from 2016-08-02."

With this volume, I think we should start to invest more on this bug. Georg, do you know who could help with it? Thanks
Flags: needinfo?(gfritzsche)
I'm scheduling it for our next iteration.
Flags: needinfo?(gfritzsche)
Priority: P3 → P2
(In reply to Georg Fritzsche [:gfritzsche] from comment #25)
> I'm scheduling it for our next iteration.

Hi Georg, just trying to understand what to expect -- when is the next iteration?
Mossop do you know who might to able to jump on this one?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(dtownsend)
Flags: needinfo?(dbolter)
I would try to start looking into it in the next two weeks.
I'm currently a bit swamped though and not sure if i will make that.
Flags: needinfo?(gfritzsche)
Georg probably has better answers than me here.
Flags: needinfo?(dtownsend)
What experiments are you currently running? Do you think this is affecting only users who are part of the experiments?  Why are we seeing these crashes on ESR?
Flags: needinfo?(gfritzsche)
I am not running experiments, bsmedberg should know about that.
I don't see any on ESR though looking here: https://telemetry-experiment.cdn.mozilla.net/
From previous investigations we decided the bug investigated here should only affect users who ran experiments.

Looking at ESR, i can see 277 reports for one signature:
https://crash-stats.mozilla.com/signature/?release_channel=esr&signature=Abort%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20NS_InvokeByIndex&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports
... however, only 8 of these are related to Experiments:
https://crash-stats.mozilla.com/signature/?release_channel=esr&signature=Abort%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20NS_InvokeByIndex&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports

Sampling the first reports for AsyncShutdownTimeout i see:
* OS.File & CrashMonitor
* Media shutdown
* TelemetryController
* Places DatabaseShutdown

I think these bugs need further breakdown into AsyncShutdownTimeout states to give us useful prioritization.
Flags: needinfo?(gfritzsche)
We have several current bugs reported with the signature Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex, and right now that is our #3 top crash for beta 4 and 5 with around 1200 crashes per week on each version. That's pretty bad and I'd like someone to try to improve the situation before 49 release. 

Nicholas, an I get help from the Uptime team to sort out what's going on in beta? As we are heading into beta 7, I am increasingly pressed for time.   My worry here is that this crash rate may turn into a much higher number on 49 release, than it is currently in 48 or in 49 beta.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(dbolter)
Benjamin, can you shed any light on what is happening with this crash ? I am more worried about beta/release than I am about ESR here, but still puzzled why this would show up in ESR at all.
Flags: needinfo?(benjamin)
hi liz, regarding the [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex] signature on 49.0b: 50% of its volume is coming from bug 1258350 and 30% from bug 1295663 (regressing in 49), so those are the most promising starting points. the rest is scattered among multiple smaller issues...
I've been looking a little at these crashes on Nightly over in bug 1273770 for Uptime, though it sounds like philipp has done a better analysis already for beta.
Flags: needinfo?(n.nethercote)
For what it is worth, these aren't quite real crashes. I believe what is happening is that we're shutting down, but some component or another takes too long, so we just give up and exit. I suppose this could cause some data loss if we're in the middle of saving something down on exit and that fails.
This bug is specifically about Experiments.jsm async shutdowns, which appear to not be a problem on beta/release/esr. I'd like to remove the rest of the noise about the general crash signature into the appropriate other bugs. Liz if you need help breaking down the various facets of this signature, can you email me and philipp off the bug?
Flags: needinfo?(dbolter)
Flags: needinfo?(benjamin)
Also, bug 1035144 would mostly have solved this problem long ago. Liz you could ask the socorro team to re-prioritize that (I may take a stab at it, but can't commit to that right now).
Wontfix for now for 49 and 50. I'll comment in the Socorro bug too; cutting through this kind of confusion sounds like a good thing.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutd…
Nice, the new signatures are a great help. I'm limitting the signatures on this bug to relevant ones.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutd… → [@ AsyncShutdownTimeout | profile-before-change | Experiments.jsm shutdown ] [@ AsyncShutdownTimeout | AddonManager: Waiting for providers to shut down. | Experiments.jsm shutdown ]
Priority: P2 → P3
Very low volume of crash in 51. Mark 51 as fix-optional.
50 is in RC mode, too late to fix this in 50, the crash value is very low and I think we should be fine.
Priority: P3 → P4
The leave-open keyword is there and there is no activity for 6 months.
:gfritzsche, maybe it's time to close this bug?
Flags: needinfo?(gfritzsche)
Telemetry experiments were removed in 61 per bug 1420908, closing this.
Status: NEW → RESOLVED
Closed: 11 months ago
Flags: needinfo?(gfritzsche)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.