Closed Bug 961526 Opened 10 years ago Closed 10 years ago

crash in android.database.sqlite.SQLiteConstraintException: foreign key constraint failed (code 19) at android.database.sqlite.SQLiteConnection.nativeExecuteForLastInsertedRowId(Native Method)

Categories

(Firefox Health Report Graveyard :: Client: Android, defect)

27 Branch
All
Android
defect
Not set
critical

Tracking

(firefox26 unaffected, firefox27+ wontfix, firefox28+ fixed, firefox29 fixed, firefox30 fixed, b2g-v1.3 fixed, fennec27+)

RESOLVED FIXED
Firefox 30
Tracking Status
firefox26 --- unaffected
firefox27 + wontfix
firefox28 + fixed
firefox29 --- fixed
firefox30 --- fixed
b2g-v1.3 --- fixed
fennec 27+ ---

People

(Reporter: aaronmt, Assigned: mcomella)

References

Details

(Keywords: crash, regression, verifyme)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-40d00a1a-b33e-4dfa-95a5-37dac2140118.
=============================================================
android.database.sqlite.SQLiteConstraintException: foreign key constraint failed (code 19)
	at android.database.sqlite.SQLiteConnection.nativeExecuteForLastInsertedRowId(Native Method)
	at android.database.sqlite.SQLiteConnection.executeForLastInsertedRowId(SQLiteConnection.java:782)
	at android.database.sqlite.SQLiteSession.executeForLastInsertedRowId(SQLiteSession.java:788)
	at android.database.sqlite.SQLiteStatement.executeInsert(SQLiteStatement.java:86)
	at android.database.sqlite.SQLiteDatabase.insertWithOnConflict(SQLiteDatabase.java:1469)
	at android.database.sqlite.SQLiteDatabase.insertOrThrow(SQLiteDatabase.java:1365)
	at org.mozilla.gecko.background.healthreport.HealthReportDatabaseStorage.recordDailyDiscrete(HealthReportDatabaseStorage.java:1156)
	at org.mozilla.gecko.background.healthreport.HealthReportDatabaseStorage.recordDailyDiscrete(HealthReportDatabaseStorage.java:1169)
	at org.mozilla.gecko.health.BrowserHealthRecorder$5.run(BrowserHealthRecorder.java:830)
	at android.os.Handler.handleCallback(Handler.java:733)
	at android.os.Handler.dispatchMessage(Handler.java:95)
	at android.os.Looper.loop(Looper.java:136)
	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:32)
Component: General → Client: Android
Product: Firefox for Android → Firefox Health Report
Version: Trunk → 27 Branch
First appeared on Jan 13th on Fx27. Mixed number of devices according to the summary.
One crash comment implies that this is recording a search from a tapped suggestion.
tracking-fennec: --- → ?
This is probably the result of Bug 925517 -- that is, we now track all searches, and somewhere along the way we're trying to track one that's not valid.
Blocks: 925517
Given that this is a foreign key constraint, it implies that either the environment or the field doesn't exist in the DB.

We have an explicit check for the environment being -1, so that points the finger at the field. But the field is the location, and I don't see anywhere we might provide an invalid value for that.
Furthermore, the field can't be invalid -- otherwise storage.getField().getID() would throw.
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
This bug doesn't make any sense. The exact line that's reported as throwing:

    try {
      db.insertOrThrow(table, null, v);
    } catch (SQLiteConstraintException e) {
      throw new IllegalStateException("Event did not reference existing an environment or field.", e);
    }

This reported exception should have been an IllegalStateException, not an unwrapped SQLiteConstraintException.

We validate the environment before touching the DB.

A different exception would arise if the field wasn't registered.

The environment shouldn't have been pruned, because it should be the current one (and in any case, we only prune the oldest).

So... what's going on?

I have a patch that adds some validation, but nothing I add is really any stronger than what's already present!
Attachment #8362724 - Flags: review?(michael.l.comella)
I'm also interested in your analysis of this, mcomella.
(In reply to Richard Newman [:rnewman] from comment #8)
> This reported exception should have been an IllegalStateException, not an
> unwrapped SQLiteConstraintException.

I agree: I'm not sure why this is.

> A different exception would arise if the field wasn't registered.

Unless the fieldID changes from the last time we cached it. [1]

> The environment shouldn't have been pruned, because it should be the current
> one (and in any case, we only prune the oldest).

It could, if we somehow delete enough events to cascade delete the current environment - but that's *highly* unlikely.

But deleting the field from cascading deletions after pruning events is more likely. We cache the fieldIDs in memory, so it's possible pruning could have occurred in the background. Most of the uptimes (in seconds, confirmed w/ mfinkle) are quite large, meaning there is more time for this race to occur.

I will continue to investigate.

[1]: https://mxr.mozilla.org/mozilla-beta/source/mobile/android/base/background/healthreport/HealthReportDatabaseStorage.java?rev=6fe8a2c4841e#536
(In reply to Michael Comella (:mcomella) from comment #11)

> But deleting the field from cascading deletions after pruning events is more
> likely. We cache the fieldIDs in memory, so it's possible pruning could have
> occurred in the background. Most of the uptimes (in seconds, confirmed w/
> mfinkle) are quite large, meaning there is more time for this race to occur.

I didn't explore whether the field ID could change -- I assumed it could not, because so much code expects it to stay static once defined, and because there's basically no point in pruning fields. There aren't many, and they don't change often.

If it's possible for the pruning code to delete an active measurement's field, then we should fix that, because yeah, it could theoretically cause this bug (as well as causing more work later).

Nice sleuthing!
Attached file Forced failure logcat
(In reply to Richard Newman [:rnewman] from comment #8)
> This reported exception should have been an IllegalStateException, not an
> unwrapped SQLiteConstraintException.

I tested this by inserting `env + 1337` for env at [1]. It caught the Exception when it recorded the session data (and thus threw the IllegalStateException), but it crashed (without catching) when I tapped a search suggestion. The same thing happens with `field + 1337`.

[1]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/background/healthreport/HealthReportDatabaseStorage.java?rev=6fe8a2c4841e#1149
(In reply to Michael Comella (:mcomella) from comment #13)

> I tested this by inserting `env + 1337` for env at [1]. It caught the
> Exception when it recorded the session data (and thus threw the
> IllegalStateException), but it crashed (without catching) when I tapped a
> search suggestion. The same thing happens with `field + 1337`.

wat

Does this happen with Proguard disabled? Just to check…
Keywords: regression
If mcomella's theory is correct, this would be a regression from Bug 870171.
Seems like a plausible theory.

Crash first seen:     2013-09-28
Bug 870171 fixed on : 2013-09-19
(In reply to Richard Newman [:rnewman] from comment #14)
> (In reply to Michael Comella (:mcomella) from comment #13)
> 
> > I tested this by inserting `env + 1337` for env at [1]. It caught the
> > Exception when it recorded the session data (and thus threw the
> > IllegalStateException), but it crashed (without catching) when I tapped a
> > search suggestion. The same thing happens with `field + 1337`.
> 
> wat
> 
> Does this happen with Proguard disabled? Just to check…

With `export MOZ_DISABLE_PROGUARD=1`, the problem still occurs.
Comment on attachment 8362724 [details] [diff] [review]
Potential patch. v1

Review of attachment 8362724 [details] [diff] [review]:
-----------------------------------------------------------------

As you mentioned, while these don't solve the bug, they are good safety improvements. r+ after the enum question is answered.

::: mobile/android/base/health/BrowserHealthRecorder.java
@@ +808,5 @@
>          }
>  
> +        final int env = this.env;
> +
> +        if (env == -1) {

nit: I'd prefer not whitespace above this line.

@@ +809,5 @@
>  
> +        final int env = this.env;
> +
> +        if (env == -1) {
> +            Log.d(LOG_TAG, "No environment: not recording search.");

Just to double-check, I assume you don't want to throw here so we don't crash? We do throw for `location == null`, but I guess that makes more sense because it's an arg.

@@ +820,5 @@
>  
> +        // Ensure that we don't throw when trying to look up the field for an
> +        // unknown location. If you add a search location, you must extend the
> +        // list of search locations *and update the measurement version*.
> +        if (!SEARCH_LOCATIONS.contains(location)) {

Why don't we use an enum here instead? :)
Attachment #8362724 - Flags: review?(michael.l.comella) → feedback+
(In reply to Michael Comella (:mcomella) from comment #18)

> Just to double-check, I assume you don't want to throw here so we don't
> crash? We do throw for `location == null`, but I guess that makes more sense
> because it's an arg.

Because not yet having an environment is a routine occurrence: we'll only have an env after startup has finished. The pattern for recording stuff is basically "not set up yet? oh well, we'll drop this event on the floor".


> Why don't we use an enum here instead? :)

Urgh, lowercase enum keys! (valueOf converts the key name, not its value, to the enum.)

Also, we're not using it as an enum, we're using it as a set -- testing membership and enumerating members. Pity Java doesn't have syntax for static sets!
(In reply to Richard Newman [:rnewman] from comment #19)
> Urgh, lowercase enum keys! (valueOf converts the key name, not its value, to
> the enum.)

You can pass in the value, set it as a member variable of the enum, and write a `getName()` method.

> Also, we're not using it as an enum, we're using it as a set -- testing
> membership and enumerating members. Pity Java doesn't have syntax for static
> sets!

I'm not sure of the semantics on the traditional use of enums, but this seems better as a compile-time type than a String argument (though there is always the unfortunate null case).
These locations also come from JS, so there'd have to be a translation/validation step anyway -- which would be more verbose than simply checking for set membership. And then back to the string each time it's used. I'm not really seeing the benefit of an enum here :D
Please note, that the final beta candidate for mobile goes to build on Monday(1/27). Unless the upcoming patch is absolutely low risk and well tested, we should not be risking our final beta or RC build which goes to build right after beta..

I would rather prefer a backout option is that can be easily done to be in the same state as Fx26.
(In reply to bhavana bajaj [:bajaj] from comment #22)
> Please note, that the final beta candidate for mobile goes to build on
> Monday(1/27).

Thanks for the warning!


> Unless the upcoming patch is absolutely low risk and well
> tested, we should not be risking our final beta or RC build which goes to
> build right after beta..
> 
> I would rather prefer a backout option is that can be easily done to be in
> the same state as Fx26.

Just to reiterate: the patch in this bug is not a known fix for this bug; it was me noodling on areas of potential risk.

I'm very disinclined to blindly back out pruning: it might limit a small number of crashes, but would trade that for OOMs and disk space problems. (And I don't know how cleanly it would back out.)

Assuming that Michael's analysis of the problem is correct, it should be straightforward to limit pruning in a manner that we consider 100% safe and would restrict the situation we think is causing this exception, and have that landed and uplifted by Friday. I'd call that a "micro-backout".

Neither the safety improvements in this patch, nor the patch that I hope Michael can put together, are likely to have any risk at all.

Michael, thoughts?
Flags: needinfo?(michael.l.comella)
tracking-fennec: ? → 27+
(In reply to Richard Newman [:rnewman] from comment #23)
> I'm very disinclined to blindly back out pruning: it might limit a small
> number of crashes, but would trade that for OOMs and disk space problems.
> (And I don't know how cleanly it would back out.)

A full back out would probably be dirty but directly disabling pruning should be easy: we stop the PruneService from running by disabling the broadcasts that start it.

But I agree that it's better to leave it in. If my hypothesis was correct, these crashes happen only when the background Prune service prunes with Fennec running and all particular events of a certain type are too old or too numerous (i.e. presumably not often, though the probability increases with browser uptime). The browser should be fine on restart.

On the other hand, removing the pruning service could potential cause Fennec to take up too much disk space on devices, hindering general phone use and possibly causing an uninstall (or, at minimum, a "Clear data" and no one wants to do that).

> Assuming that Michael's analysis of the problem is correct, it should be
> straightforward to limit pruning in a manner that we consider 100% safe and
> would restrict the situation we think is causing this exception, and have
> that landed and uplifted by Friday. I'd call that a "micro-backout".

Some solutions are fairly complex while others are low-risk and reasonable. I see two approaches:

(tldr: I like #1-ii, it's by far the most reasonable - see the conclusion for reasons. The other solutions are poor, and thus unnecessary to read through, but I left them in so you can see what they look like, how complicated they are, and to save my time rewriting this. :)

--- Approach #1:  Stop the deletion of the measurements and fields values

Note for the following that I'm not sure how to detect if Fennec is in memory (i.e. running) or not but I assume it's possible.

i) Do not prune when Fennec is in memory. In the long run, this is not desirable because if Fennec always runs, this data will never be deleted. However, this may be used to prevent the crash, so it is a form of micro-backout (at the cost of potentially harming any users who have their browser in memory for a long period of time).

ii) Do not run cascading deletions, leaving orphaned measurements and fields. This may also fix the crash but adds DB maintenance costs for checking for and removing orphans (most reasonably when Fennec is not in memory or is idle). The DB maintenance can be done in a followup bug, making this fix fairly low risk (if a simple DB schema upgrade is considered low risk) and another possible micro-backout. The con is we may have orphaned measurements/fields lying around for long periods of time (e.g. if the browser is always in memory), however, these values should get orphaned infrequently so this is less of an issue than the potentially never deleted data of #1-i. 

--- Approach #2: Invalidate both the measurements and fields cache

I feel this is the more correct approach because it generally preserves the current DB performance, caching scheme, and pruning schedule. However, these solutions are fairly complicated: they require the background PruneService to access the running Fennec instance and access the cache, which breaks the current abstractions (and may have unknown reprocussions). Note that since these solutions are more complicated, I have not been able to think through the reprocussions as thoroughly as approach #1.

i) Ensure the measurements and fields are not invalidated before every event insertion. This can be done through an invalidation flag set by the PruneService, provided we are able to synchronize the prune and recordEvent events correctly. Additionally, the cache may have been designed with the assumption that it won't be invalidated after initialization, so there may be hard-to-spot side effects to invalidating and restoring the cache.

ii) Copy the field and measurements cache data into the database after pruning. We may delete some values while pruning, however, if we reinsert them back into the DB the way they were before, the cache is still valid. This would have to occur at the end of each pruning transaction (at most two per PruningService run). Note, however, that field uses the autoincremented ID field as an identifier so we'd be custom inserting these values (which *appears* to run correctly in local sqlite3 testing on desktop, but might have unknown side-effects). Also, I'm not sure if the measurement and fields caches can be invalidated while Fennec is running (it doesn't not seem possible), however, if they ever are, we're hosed - including in future revisions.

--- Conclusion
While #2 has the most correct approaches, I think #1-ii is a shipping approach: it takes less engineering effort, has the fewest changes to the existing codebase, and is only slightly less performant (in terms of disk usage and DB maintenance costs). Note that this is assuming we are able to determine when fennec is in memory (i.e. running). The added method, akin to pruneOrphans, is easily testable, fairly simple, and can be done in a followup bug.
Flags: needinfo?(michael.l.comella)
rnewman and I discussed this on irc and determined that my hypothesis in comment 11 that the fields were getting deleted from the DB is incorrect because `ON DELETE CASCADE` goes the opposite way that I thought it did. It is possible that the cached field ID is being changed elsewhere in the code (comment 12), but it's more likely that this is related to the PruneService (comment 16) given the first date of the crash's occurance.

We are continuing to investigate.
Given a straight backout was not really possible, looks like we a=have to eat this crasher before we build our RC candidate and final mobile beta, hoping to have a forward fix in the Fx28 cycle.
Attached file Patch
This should fix the crashes we've been having, though we never definitively found the cause of the problem.

We suspect it is because the prune service may remove various environments from the DB. These environment ids, however, are cached by the components that record events which means they may be using invalid IDs (particularly when an event recording occurs on the background thread and is blocked by the prune service's transaction). This patch helps to mitigate future issues by invalidating these caches and fixing this crash by no longer throwing when we hit a constraint exception.

Note that I still need to look into automated testing (it fails testMeasurementRecordingConstraintViolation in the integration test suite).

rnewman, is this what you were expecting?
Attachment #8368351 - Flags: feedback?(rnewman)
Attachment #8368351 - Flags: feedback?(rnewman) → feedback+
Comment on attachment 8368351 [details] [review]
Patch

Added parts 5 - 9 which fix bugs, tests, and adds a test.

I neglected to add deleteMeasurements cache clearance tests (there are no test methods for that method already existing and it does not seem to be used in practice [1]) as well as a test to PrunePolicyDatabaseStorage to ensure the current environment ID is reset. I did this because `getCurrentEnvironmentID` in MockPrunePolicyDatabaseStorage is overridden to return any int value [2], presumably because it was difficult to properly mock the method (alas, I cannot remember the true reason I did that). I figure making proper tests for this are not worth my time.

[1]: https://mxr.mozilla.org/mozilla-central/search?string=deletemeasurements&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=mozilla-central
[2]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/tests/background/junit3/src/healthreport/prune/TestPrunePolicyDatabaseStorage.java?rev=469b3bd4a91d#31
Attachment #8368351 - Flags: feedback+ → review?(rnewman)
Assignee: rnewman → michael.l.comella
Comment on attachment 8368351 [details] [review]
Patch

Feel free to roll in my part1 if it's still relevant.
Attachment #8368351 - Flags: review?(rnewman) → review+
Comment on attachment 8362724 [details] [diff] [review]
Potential patch. v1

Review of attachment 8362724 [details] [diff] [review]:
-----------------------------------------------------------------

Moving to r+ since enum comment was resolved.

I'm going to roll this in with my patch.
Attachment #8362724 - Flags: feedback+ → review+
https://hg.mozilla.org/mozilla-central/rev/517ee3c1fba5
https://hg.mozilla.org/mozilla-central/rev/c23d29bb38fd
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 30
Letting it bake before flagging for uplift.
Flags: needinfo?(michael.l.comella)
Comment on attachment 8368351 [details] [review]
Patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
  FHR pruning, most likely.

User impact if declined: 
  Some rate of crashes. Browser will crash on insert, probably until restarted.

Testing completed (on m-c, etc.): 
  On m-c for a few days. Tested locally. Integration tests.

Risk to taking this patch (and alternatives if risky): 
  Should be slim. I'm interested in getting this on Aurora to double our coverage before requesting approval for Beta.

String or IDL/UUID changes made by this patch:
  None.
Attachment #8368351 - Flags: approval-mozilla-aurora?
Attachment #8368351 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Letting it bake before uplifting to Beta.
Flags: needinfo?(michael.l.comella)
Michael - it's been a few days, can you nom this for Beta so we can get it into next week's builds and see the impact there with more users?
Comment on attachment 8368351 [details] [review]
Patch

See comment 35, with exception to:

Testing completed (on m-c, etc.): 
  Tested locally. Integration tests. m-c since 2/5. aurora since 2/10.
Attachment #8368351 - Flags: approval-mozilla-beta?
Flags: needinfo?(michael.l.comella)
Comment on attachment 8368351 [details] [review]
Patch

Excellent, thank you - let's see this in Monday's mobile beta and get some data over the next few weeks.
Attachment #8368351 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Keywords: verifyme
(In reply to Thomas Stache from comment #43)
> Just hit this twice in Nightly :
> https://crash-stats.mozilla.com/report/index/fd77e865-7696-4537-90c7-
> aa4032140715
> https://crash-stats.mozilla.com/report/index/0542af43-25fb-4189-80f3-
> e1bb02140715

Thanks, Thomas, but those are actually a totally different bug (which I'll file shortly):

android.database.sqlite.SQLiteConstraintException: foreign key constraint failed (code 19)
	at android.database.sqlite.SQLiteConnection.nativeExecuteForLastInsertedRowId(Native Method)
	at android.database.sqlite.SQLiteConnection.executeForLastInsertedRowId(SQLiteConnection.java:782)
	at android.database.sqlite.SQLiteSession.executeForLastInsertedRowId(SQLiteSession.java:788)
	at android.database.sqlite.SQLiteStatement.executeInsert(SQLiteStatement.java:86)
	at android.database.sqlite.SQLiteDatabase.insertWithOnConflict(SQLiteDatabase.java:1469)
	at android.database.sqlite.SQLiteDatabase.insertOrThrow(SQLiteDatabase.java:1365)
	at org.mozilla.gecko.db.BrowserProvider.insertBookmark(BrowserProvider.java:942)
	at org.mozilla.gecko.db.BrowserProvider.updateInTransaction(BrowserProvider.java:541)
	at org.mozilla.gecko.db.AbstractTransactionalProvider.update(AbstractTransactionalProvider.java:269)
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: