Closed Bug 828149 Opened 7 years ago Closed 7 years ago

Firefox Health Report causes crash during application shutdown [on Talos Linux64 opt]

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set

Tracking

(firefox19 unaffected, firefox20+ fixed, firefox21+ fixed)

RESOLVED FIXED
Firefox 21
Tracking Status
firefox19 --- unaffected
firefox20 + fixed
firefox21 + fixed

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(1 file, 1 obsolete file)

We're seeing weird hangs and crashes on Talos with Firefox Health Report. We appear to encounter this 100% of the time with Linux64 opt builds. Other builders seem to encounter Talos red more frequently with FHR enabled.

From https://tbpl.mozilla.org/?tree=Try&rev=9f41fae3cc07:

NOISE: 1357666153772	Services.HealthReport.HealthReporter	INFO	Initializing health reporter instance against healthreport.
NOISE: 1357666153864	Sqlite.ConnectionOpener	INFO	Opening database: /tmp/tmpy-6Gx6/profile/healthreport.sqlite (healthreport.sqlite#1)
NOISE: 1357666153866	Sqlite.Connection.healthreport.sqlite	INFO	Conn #1: Opened
NOISE: 1357666153866	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Beginning transaction
NOISE: 1357666153866	Services.Metrics.MetricsStorage	DEBUG	Database schema up to date.
NOISE: 1357666153866	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Transaction committed.
NOISE: 1357666153867	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #0 SELECT * FROM types
NOISE: 1357666153871	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Stmt #0 finished
NOISE: 1357666153872	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #1 SELECT * FROM v_measurements
NOISE: 1357666153874	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Stmt #1 finished
NOISE: 1357666153874	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #2 SELECT * FROM v_fields
NOISE: 1357666153877	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Stmt #2 finished
NOISE: 1357666153877	Services.HealthReport.HealthReporter	INFO	Storage initialized.
NOISE: 1357666153877	Services.HealthReport.HealthReporter	INFO	Initializing collector.
NOISE: 1357666153877	Services.HealthReport.HealthReporter	INFO	Registering providers from category: healthreport-js-provider
NOISE: 1357666153877	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: AddonsProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153879	Services.Metrics.Collector	INFO	Initializing provider with storage: org.mozilla.addons
NOISE: 1357666153880	Services.Metrics.MetricsStorage	TRACE	Enqueueing operation.
NOISE: 1357666153880	Services.Metrics.MetricsStorage	TRACE	Performing queued operation.
NOISE: 1357666153880	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #3 INSERT INTO providers (name) VALUES (:provider) - {"provider":"org.mozilla.addons"}
NOISE: 1357666153880	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: AppInfoProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153881	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: CrashesProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153881	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: ProfileMetadataProvider from resource://gre/modules/services/healthreport/profile.jsm
NOISE: 1357666153884	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: SessionsProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153885	Services.HealthReport.HealthReporter	INFO	Attempting to load provider from category manager: SysInfoProvider from resource://gre/modules/services/healthreport/providers.jsm
NOISE: 1357666153997	Sqlite.Connection.healthreport.sqlite	DEBUG	Conn #1: Stmt #3 finished
NOISE: 1357666153997	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #4 SELECT id FROM providers WHERE name = :provider - {"provider":"org.mozilla.addons"}
NOISE: 1357666153998	Services.Metrics.MetricsStorage	WARN	Failure when performing queued operation: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999	Services.Metrics.Collector	WARN	Provider initialization failed: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999	Services.HealthReport.HealthReporter	ERROR	Error during initialization: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:991 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666153999	Services.HealthReport.HealthReporter	INFO	Request to shut down.
NOISE: 1357666153999	Services.HealthReport.HealthReporter	WARN	Collector is in progress of initializing. Waiting to finish.
NOISE: 1357666153999	Services.Metrics.Collector	INFO	Initializing provider with storage: org.mozilla.appInfo
NOISE: 1357666154000	Services.Metrics.MetricsStorage	TRACE	Enqueueing operation.
NOISE: 1357666154000	Services.Metrics.MetricsStorage	TRACE	Performing queued operation.
NOISE: 1357666154000	Sqlite.Connection.healthreport.sqlite	TRACE	Conn #1: Stmt #5 INSERT INTO providers (name) VALUES (:provider) - {"provider":"org.mozilla.appInfo"}
NOISE: 1357666154002	Services.Metrics.MetricsStorage	WARN	Failure when performing queued operation: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:989 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < createMeasurementOperation()@resource://gre/modules/services/metrics/storage.jsm:985 < resource://gre/modules/services/metrics/storage.jsm:1277 < resource://gre/modules/services/metrics/storage.jsm:1260 < resource://gre/modules/services/metrics/storage.jsm:984 < init()@resource://gre/modules/services/metrics/dataprovider.jsm:429 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < resource://gre/modules/services/metrics/dataprovider.jsm:421 < resource://gre/modules/services/metrics/collector.jsm:92 < onError()@resource://gre/modules/services/metrics/collector.jsm:122 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < onError()@resource://gre/modules/services/metrics/storage.jsm:1308 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>
NOISE: 1357666154003	Services.Metrics.Collector	WARN	Provider initialization failed: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [mozIStorageAsyncStatement.executeAsync] Stack trace: resource://gre/modules/Sqlite.jsm:600 < resource://gre/modules/Sqlite.jsm:373 < createMeasurement()@resource://gre/modules/services/metrics/storage.jsm:989 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < createMeasurementOperation()@resource://gre/modules/services/metrics/storage.jsm:985 < resource://gre/modules/services/metrics/storage.jsm:1277 < resource://gre/modules/services/metrics/storage.jsm:1260 < resource://gre/modules/services/metrics/storage.jsm:984 < init()@resource://gre/modules/services/metrics/dataprovider.jsm:429 < TaskImpl_run()@resource://gre/modules/Task.jsm:192 < TaskImpl()@resource://gre/modules/Task.jsm:163 < Task_spawn()@resource://gre/modules/Task.jsm:135 < resource://gre/modules/services/metrics/dataprovider.jsm:421 < resource://gre/modules/services/metrics/collector.jsm:92 < onError()@resource://gre/modules/services/metrics/collector.jsm:122 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < onError()@resource://gre/modules/services/metrics/storage.jsm:1308 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < rejectDeferred()@resource://gre/modules/commonjs/promise/core.js:126 < then()@resource://gre/modules/commonjs/promise/core.js:43 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < reject()@resource://gre/modules/commonjs/promise/core.js:190 < TaskImpl_run()@resource://gre/modules/Task.jsm:223 < effort()@resource://gre/modules/commonjs/promise/core.js:53 < resolveDeferred()@resource://gre/modules/commonjs/promise/core.js:125 < then()@resource://gre/modules/commonjs/promise/core.js:34 < resolve()@resource://gre/modules/commonjs/promise/core.js:167 < resource://gre/modules/Sqlite.jsm:608 < <file:unknown>

The order of operations is most weird. We're seeing an INSERT performed. Immediately afterward, a SELECT fails with NS_ERROR_UNEXPECTED. Immediately thereafter, we see a log entry for shutdown initialization.

It certainly appears that shutdown is causing some kind of assertion. However, we've taken steps to ensure that the FHR shutdown and database termination logic is sound. We have unit test coverage injecting shutdown requests at various phases of the FHR service lifecycle and everything appears to work. Furthermore, logs from other Talos runners reveal that shutdown is handled properly on other builders.

I've talked with Marco for the Storage perspective and Jeff Hammel for the Talos perspective and nothing obvious sticks out so far. Talos seems to initiate shutdown the same way as Firefox and it isn't employing any aggressive process killing that would cause database corruption. The current theories are:

1) The database is getting corrupted somehow. Possible check: push a try build with "PRAGMA integrity_check" on connection open and see if corruption is detected.
2) We're running into a Storage bug (try to repro with a debug build and get storage logs)

I'll be actively chasing down these and other minor theories.

There is a potential this will introduce a general crasher if landed in its current state. I've talked to Alex Keybl about this and he has blessed the idea of landing FHR on m-c enabled except on Linux64. If we see crashes on other platforms in the wild, we'll assess the severity and back out if needed.
(In reply to Gregory Szorc [:gps] from comment #0)
> The order of operations is most weird. We're seeing an INSERT performed.
> Immediately afterward, a SELECT fails with NS_ERROR_UNEXPECTED. Immediately
> thereafter, we see a log entry for shutdown initialization.

It sounds like your read might be failing with SQLITE_BUSY (which gets converted to NS_ERROR_UNEXPECTED). With the default rollback journal mode, your reads will fail if a write operation is occurring simultaneously

http://www.sqlite.org/draft/lockingv3.html
Interesting. We should only ever have 1 process consuming the db file and only 1 connection per process to the database. So, there should be no contention on the database. My reading of that document is that if this holds, we should have no trouble with SQLITE_BUSY unless there is a hot journal.

Marco: I see from section 4.1 of the linked document that you are supposed to take special actions when dealing with hot journals. Is this something Storage does for us or must all SQLite consumers implement this functionality?

Also, I'm wondering if there are some PRAGMA commands we should issue upon opening the database. Currently we use none (unless Storage is issuing some automatically).
Flags: needinfo?(mak77)
(In reply to Gregory Szorc [:gps] from comment #2)
> Interesting. We should only ever have 1 process consuming the db file and
> only 1 connection per process to the database. So, there should be no
> contention on the database. My reading of that document is that if this
> holds, we should have no trouble with SQLITE_BUSY unless there is a hot
> journal.

Didn't we discuss this afternoon what might happen if the DB is replaying the journal as you open it?
(In reply to Gregory Szorc [:gps] from comment #2)
> Interesting. We should only ever have 1 process consuming the db file and
> only 1 connection per process to the database. So, there should be no
> contention on the database. My reading of that document is that if this
> holds, we should have no trouble with SQLITE_BUSY unless there is a hot
> journal.

That's right, if you only have 1 thread in 1 process accessing the database there should be no problem with busy since everything is just serialized.

> Marco: I see from section 4.1 of the linked document that you are supposed
> to take special actions when dealing with hot journals. Is this something
> Storage does for us or must all SQLite consumers implement this
> functionality?

AFAIK the only thing Storage does is using a busy callback on the async thread to wait for mainthread. No special threatment for hot journals is provided, though, as the documentation states, hot journals are a special case and I don't think we are hitting that (and even if we would, I'd not expect issues with a single connection).

> Also, I'm wondering if there are some PRAGMA commands we should issue upon
> opening the database. Currently we use none (unless Storage is issuing some
> automatically).

You may do some experiments, and maybe these could give us an hint.
1. change journal mode: PRAGMA journal_mode = truncate
2. if you use temp tables: PRAGMA temp_store = MEMORY
3. change synchronous: PRAGMA synchronous = FULL

did you try to handle that problem with cancelling all statements on shutdown? I wonder if there's a relation with what you see.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [:mak] from comment #4)
> did you try to handle that problem with cancelling all statements on
> shutdown? I wonder if there's a relation with what you see.

This cannot be the issue because we only cancel statements in the Sqlite.jsm .close() routine and we aren't seeing the log message indicating that method is anywhere close to being called.
Blocks: 829887
No longer blocks: 718066
Assignee: nobody → gps
Ehsan provided me a log that clearly indicated FHR being initialized *after* quit-application was notified. He also confirmed that nsITimer instances aren't cancelled as part of shutdown. I'm reasonably confident we found the source of our shutdown crashes.
Attachment #702078 - Flags: review?(rnewman)
Attachment #702078 - Flags: feedback?(ehsan)
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1

rs=me.  I did test a build with this patch and I didn't observe the crash, and I promised gps to keep an eye on this once the patch lands if I see the aborts again.
Attachment #702078 - Flags: feedback?(ehsan) → feedback+
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1

>+        // Shutdown doesn't clear pending timers. So, we need to explicitly
>+        // cancel our health reporter initialization timer or else it will
>+        // attempt initialization after shutdown has commenced. This would
>+        // likely lead to stalls or crashes.
>+        if (this.timer) {
>+          this.timer.cancel();
>+        }
>+

Is this racey? i.e. could the timer fire and en-queue the callback (initialization function) before we call timer.cancel()? If so, then we could still sometimes get these crashes
Comment on attachment 702078 [details] [diff] [review]
Cancel initialization timer on shutdown, v1

Review of attachment 702078 [details] [diff] [review]:
-----------------------------------------------------------------

You should make a note somewhere that we're quitting, and check for it in `get healthReporter()`. If the timer has fired but the contents haven't run, we'll still die. (This is a narrower race.)

I would also suggest that HealthReporter install a quit listener in its constructor, and do a similar thing in init. Its initialization is asynchronous, so we can hit the point of fetching an instance on timer (constructor finished), but have the quit occur prior to execution of all the delayed init work.

I'll accept a proof that this is impossible instead.
Attachment #702078 - Flags: review?(rnewman) → feedback+
(In reply to Vladan Djeric (:vladan) from comment #8)
> Is this racey? i.e. could the timer fire and en-queue the callback
> (initialization function) before we call timer.cancel()? If so, then we
> could still sometimes get these crashes

I'm not sure! While I might be able to test it, I don't want to risk it. I'll revise the patch to record that "quit-application" has fired and check for it in the timer callback.

Regarding rnewman's point about HealthReporter installing a quit listener: it does! However, it's only installed just before storage is initialized, not in the ctor (https://hg.mozilla.org/services/services-central/file/6d3cb2d6873e/services/healthreport/healthreporter.jsm#l251). Same difference as far as I'm concerned.
Attachment #702078 - Attachment is obsolete: true
Attachment #702102 - Flags: review?(rnewman)
Comment on attachment 702102 [details] [diff] [review]
Cancel initialization timer on shutdown, v2

Review of attachment 702102 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM.
Attachment #702102 - Flags: review?(rnewman) → review+
Landed on Greg's behalf.

https://hg.mozilla.org/services/services-central/rev/ad87f1e26c5a

Ehsan, please verify after merge when you're happy that this is working! Marking you with NEEDINFO accordingly.

Thanks!
Status: NEW → ASSIGNED
Flags: needinfo?(ehsan)
Whiteboard: [fixed in services]
Enabled FHR on Linux:

https://hg.mozilla.org/services/services-central/rev/450f0995c9ad

If it doesn't fix the Talos issue, we'll back out before merging.
No need for the needinfo flag.  I'll complain if I see bug 830145 or bug 830448 again when this gets merged.

(And note that I don't actually know if this patch fixes this bug.  I think Gregory posted the patch here because he strongly believes this is a dupe of bug 830145, but I'd have expected to see this patch there...)
Flags: needinfo?(ehsan)
I posted the patch here because this is the oldest bug, was called out in my dev-planning post (thus has a number of interested followers), and I strongly feel your 2 shutdown-related bugs are dupes.
https://hg.mozilla.org/mozilla-central/rev/ad87f1e26c5a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed in services]
Target Milestone: --- → mozilla21
Comment on attachment 702102 [details] [diff] [review]
Cancel initialization timer on shutdown, v2

[Approval Request Comment]

Blanket FHR approval request. This patch seems to fix our Linux crasher!
Attachment #702102 - Flags: approval-mozilla-aurora?
Duplicate of this bug: 830448
Duplicate of this bug: 830145
Attachment #702102 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla21 → Firefox 21
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.