Closed Bug 1017706 Opened 10 years ago Closed 10 years ago

Intermittent HealthReport.jsm Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: file resource://gre/modules/HealthReport.jsm, line 4356"

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(firefox31 unaffected, firefox32 fixed, firefox33 fixed, firefox-esr24 unaffected)

RESOLVED FIXED
Firefox 33
Tracking Status
firefox31 --- unaffected
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: Yoric, Assigned: Yoric)

References

(Blocks 1 open bug, )

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 4 obsolete files)

      No description provided.
Summary: At least one completion condition is taking too long to complete. Conditions: [{"name":"FHR: Flushing storage shutdown","state":"(none)"}] Barrier: Metrics Storage Backend – followed by AsyncShutdown.jsm crash → Crash in AsyncShutdown.jsm, with warning « At least one completion condition is taking too long to complete. Conditions: [{"name":"FHR: Flushing storage shutdown","state":"(none)"}] Barrier: Metrics Storage Backend »
Basically, this means that there is a problem in or around healthreporter.jsm, with `this._promiseShutdown` never being resolved.
Previous JS exception:
System JS : ERROR resource://gre/components/nsUrlClassifierListManager.js:342 - NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIUrlClassifierDBService.getTables]

I don't know offhand if FHR has a dependency on url-classifier; gps do you know? Otherwise does anyone know if/how we could get a JS and maybe even c++ stack for that JS exception?
Flags: needinfo?(gps)
I don't think FHR has a direct dependency on the URL classifier service. Although, something that FHR pulls from (like Places) might.

I wasn't paying attention when bug 917883 finally landed. IIRC we experienced weird shutdown behavior (possibly a crash) the first time it landed. Perhaps it should be backed out?
Flags: needinfo?(gps)
There is clearly a relationship with bug 917883. However, what's not clear is whether landing it added a new bug or just made it more visible.
Attached patch Fix a typo, add logging (obsolete) — Splinter Review
Unfortunately, we have already landed a few bugs that depend on bug 917883 (obviously, I was too optimistic because I didn't see any errors), so I'd like to give a try to fixing the issue before we go the pains of reverting everything.

This patch does two things:
- it takes advantage of AsyncShutdown logging to tell us at which stage of shutdown we have failed;
- it fixes a (preexistent) typo that might possibly have contributed to intermittent issues during shutdown.
Assignee: nobody → dteller
Attachment #8432430 - Flags: review?(rnewman)
Attachment #8432430 - Flags: review?(gps)
(forgot a qref)
Attachment #8432430 - Attachment is obsolete: true
Attachment #8432430 - Flags: review?(rnewman)
Attachment #8432430 - Flags: review?(gps)
Attachment #8432447 - Flags: review?(rnewman)
Attachment #8432447 - Flags: review?(gps)
Comment on attachment 8432447 [details] [diff] [review]
Fix a typo, add logging

Review of attachment 8432447 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/healthreporter.jsm
@@ +581,5 @@
>  
>      if (this._initializeHadError) {
>        this._log.warn("Initialization had error. Shutting down immediately.");
>      } else {
> +      if (this._providerManagerInProgress) {

Oof.
Attachment #8432447 - Flags: review?(rnewman) → review+
Comment on attachment 8432447 [details] [diff] [review]
Fix a typo, add logging

Review of attachment 8432447 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/healthreporter.jsm
@@ +581,5 @@
>  
>      if (this._initializeHadError) {
>        this._log.warn("Initialization had error. Shutting down immediately.");
>      } else {
> +      if (this._providerManagerInProgress) {

And this is why I've been screaming for JavaScript code coverage for years.

At one time, I /thought/ I had 100% initialization state test coverage. I was wrong. Too bad I didn't have the data to tell me so.
Looks like Tomcat already pushed this but forgot to mark the bug.
https://hg.mozilla.org/integration/fx-team/rev/57a76b26c953
Keywords: checkin-needed
Whiteboard: [fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/57a76b26c953
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → Firefox 32
Bug is still present as bug 1022198.
Obviously, the typo was not the reason.
Resolution: FIXED → DUPLICATE
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Changing topic as bug 1018895 changed the output of AsyncShutdown.
Summary: Crash in AsyncShutdown.jsm, with warning « At least one completion condition is taking too long to complete. Conditions: [{"name":"FHR: Flushing storage shutdown","state":"(none)"}] Barrier: Metrics Storage Backend » → Intermittent OSX HealthReport.jsm Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: file resource://gre/modules/HealthReport.jsm, line 4356"
https://tbpl.mozilla.org/php/getParsedLog.php?id=41272501&tree=Mozilla-Inbound

Rev5 MacOSX Mountain Lion 10.8 mozilla-inbound debug test mochitest-other on 2014-06-07 03:14:30 PDT for push 49abfe360b14
slave: talos-mtnlion-r5-014

03:49:05     INFO -  System JS : ERROR resource://gre/components/nsUrlClassifierListManager.js:342 - NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIUrlClassifierDBService.getTables]
03:49:05     INFO -  WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"FHR: Flushing storage shutdown","state":{"shutdownInitiated":false,"initialized":true,"shutdownRequested":false,"initializeHadError":false,"providerManagerInProgress":false,"storageInProgress":false,"hasProviderManager":true,"hasStorage":true},"filename":"resource://gre/modules/HealthReport.jsm","lineNumber":4356}] Barrier: Metrics Storage Backend
03:49:05     INFO -  WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"Metrics Storage: Shutting down","state":"(none)","filename":"resource://gre/modules/HealthReport.jsm","lineNumber":2453}] Barrier: Sqlite.jsm: wait until all clients have completed their task
03:49:05     INFO -  WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"Sqlite.jsm shutdown blocker","state":{"description":"Waiting for the barrier to be lifted"},"filename":"resource://gre/modules/Sqlite.jsm","lineNumber":84}] Barrier: profile-before-change
03:49:56     INFO -  ERROR: undefined
03:49:56     INFO -  [1485] ###!!! ABORT: file resource://gre/modules/HealthReport.jsm, line 4356
Attached patch Fixing a typo in AsyncShutdown (obsolete) — Splinter Review
Fixing a second typo, this time in AsyncShutdown logging. This typo is responsible for the `ERROR: undefined` in the log, but not for the actual bug. Self-reviewing, as this is just a one-char change.
Attachment #8436360 - Flags: review+
Comment on attachment 8436360 [details] [diff] [review]
Fixing a typo in AsyncShutdown

Ah, it seems that there is a race condition between RyanVM and myself. This has already been landed as part of bug 1022198 before I could mark it obsolete over there.
Attachment #8436360 - Attachment is obsolete: true
No longer blocks: 1022198
According to the logs:
1. both `shutdownInitiated` and `shutdownRequested` are `false`, which means that we have never entered `_initiateShutdown`;
2. `storageInProgress` is `false`, `initialized` is `true`, `providerManagerInProgress` is `false`, which means that we are not caught during initialization;
3. `initializeHadError` is `false`, so initialization was satisfying.

...and we do not get the information on `_shutdownComplete` because I forgot an underscore in the logging code. The kind of errors that makes one wish JS were a statically typed language. Fortunately, the information would be redundant in this case.

Barring any error, 1. means that we never receive "quit-application". This is worrying. gps, is my interpretation of 1. correct? ALso, do you have any idea what could be causing this? I imagine that it is related to

System JS : ERROR resource://gre/components/nsUrlClassifierListManager.js:342 - NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIUrlClassifierDBService.getTables]
Flags: needinfo?(gps)
My current hypothesis is the following: some client of quit-application sitting before healthreporter.jsm spins the event loop. In some unknown circumstances, this spinning completely messes up with our shutdown sequence, causing nsUrlClassifierListManager to request updates too late during shutdown, and also causes healthreporter.jsm to never receive quit-application (or at least to receive it after profile-before-change).
Attached patch Bandaid patch (obsolete) — Splinter Review
Attaching a band-aid patch based on my hypothesis above. I will also investigate clients of quit-application, but I don't know how long this will take.
Attachment #8436382 - Flags: review?(rnewman)
Attachment #8436382 - Flags: review?(gps)
Benjamin, does comment 21 make sense to you and do you know of any component with such behavior?
Note that the issue is probably not caused by a test, as it has also been reported by a user of Nightly.
Flags: needinfo?(benjamin)
Summary: Intermittent OSX HealthReport.jsm Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: file resource://gre/modules/HealthReport.jsm, line 4356" → Intermittent HealthReport.jsm Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: file resource://gre/modules/HealthReport.jsm, line 4356"
Attached patch 2. Bandaid patch, v2 (obsolete) — Splinter Review
Let's not add new typos, we have seen enough in this bug.
Attachment #8436382 - Attachment is obsolete: true
Attachment #8436382 - Flags: review?(rnewman)
Attachment #8436382 - Flags: review?(gps)
Attachment #8436541 - Flags: review?(rnewman)
Attachment #8436541 - Flags: review?(gps)
Comment on attachment 8436541 [details] [diff] [review]
2. Bandaid patch, v2

Review of attachment 8436541 [details] [diff] [review]:
-----------------------------------------------------------------

This looks sane.
Attachment #8436541 - Flags: review?(rnewman)
Attachment #8436541 - Flags: review?(gps)
Attachment #8436541 - Flags: review+
Flags: needinfo?(gps)
Added reviewer.
Attachment #8436541 - Attachment is obsolete: true
Attachment #8437113 - Flags: review+
Target Milestone: Firefox 32 → ---
https://hg.mozilla.org/mozilla-central/rev/330b60fb2530
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 33
Please request Aurora approval for this when you get a chance.
Flags: needinfo?(dteller)
I am not aware of a situation where quit-application would not be fired if we entered into nsAppStartup::run, except perhaps if we never ran the event loop at all in some restart scenario.
Flags: needinfo?(benjamin)
Benjamin, what if some client of quit-application sitting before HealthReporter.jsm decided to spin the event loop. Look at our shutdown code, I have the impression that this could cause profile-before-change to be fired before quit-application reaches HealthReporter.jsm, wouldn't it?
Flags: needinfo?(dteller)
In normal shutdown,  quit-application is typically fired here: http://hg.mozilla.org/mozilla-central/annotate/c482c28b35b6/toolkit/components/startup/nsAppStartup.cpp#l456

and the profile notifications don't happen until the stack unwinds all the way to here: http://hg.mozilla.org/mozilla-central/annotate/7297dedf2416/toolkit/xre/nsAppRunner.cpp#l4108

So it sounds unlikely that spinning an event loop would change anything there. A completely missing notification sounds more likely.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4][PTO June 19-22] from comment #32)
> Please request Aurora approval for this when you get a chance.

Please? :)
Flags: needinfo?(dteller)
Comment on attachment 8437113 [details] [diff] [review]
2. Bandaid patch, v3

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 917883 (although I suspected it existed with different symptoms form beforehand).
User impact if declined: Occasional shutdown freeze/crashes.
Testing completed (on m-c, etc.): Has been on m-c for one week.
Risk to taking this patch (and alternatives if risky): Can't think of any.
String or IDL/UUID changes made by this patch: None.
Attachment #8437113 - Flags: approval-mozilla-aurora?
Flags: needinfo?(dteller)
Comment on attachment 8437113 [details] [diff] [review]
2. Bandaid patch, v3

Aurora approval granted.
Attachment #8437113 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: