Crash in java.lang.NullPointerException: Attempt to read from field ''java.lang.String android.accounts.Account.name'' on a null object reference at android.os.Parcel.readException(Parcel.java)

RESOLVED FIXED in Firefox 57

Status

()

defect
--
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Grisha, Assigned: Grisha)

Tracking

({crash})

unspecified
Firefox 57
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 unaffected, firefox56 unaffected, firefox57 fixed, firefox58 fixed)

Details

(crash signature)

Attachments

(2 attachments)

This bug was filed from the Socorro interface and is 
report bp-e3bb7cfd-f51d-4089-9e09-970ed0170919.
=============================================================

Observed this in Sync Fest after changing primary email on an API21+ device.
java.lang.NullPointerException: Attempt to read from field 'java.lang.String android.accounts.Account.name' on a null object reference
	at android.os.Parcel.readException(Parcel.java:1699)
	at android.os.Parcel.readException(Parcel.java:1646)
	at android.accounts.IAccountManager$Stub$Proxy.renameAccount(IAccountManager.java:1535)
	at android.accounts.AccountManager$5.doWork(AccountManager.java:804)
	at android.accounts.AccountManager$BaseFutureTask.startTask(AccountManager.java:2073)
	at android.accounts.AccountManager$Future2Task.start(AccountManager.java:2132)
	at android.accounts.AccountManager.renameAccount(AccountManager.java:801)
	at org.mozilla.gecko.fxa.authenticator.AndroidFxAccount.doOptionalProfileRename21Plus(AndroidFxAccount.java:1112)
	at org.mozilla.gecko.fxa.authenticator.AndroidFxAccount.renameAccountIfNecessary(AndroidFxAccount.java:1100)
	at org.mozilla.gecko.fxa.authenticator.AndroidFxAccount.access$300(AndroidFxAccount.java:69)
	at org.mozilla.gecko.fxa.authenticator.AndroidFxAccount$ProfileResultReceiver.onReceiveResult(AndroidFxAccount.java:1023)
	at android.os.ResultReceiver$MyRunnable.run(ResultReceiver.java:50)
	at android.os.Handler.handleCallback(Handler.java:751)
	at android.os.Handler.dispatchMessage(Handler.java:95)
	at android.os.Looper.loop(Looper.java:154)
	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:43)
Somehow we're passing null account into a call to accountManager.renameAccount.
Actually, no. First thing that renameAccount does is checks if the account==null, and throws. And so, account becomes null sometime later.
Duplicate of this bug: 1401388
The patch above seems to do the trick - I can't reproduce the issue locally. But I don't really understand how Comment 3 can happen :(
Comment on attachment 8910517 [details]
Bug 1401318 - Fix some of the 'shared-state' access problems around Account

https://reviewboard.mozilla.org/r/181938/#review187290
Attachment #8910517 - Flags: review?(nalexander) → review+
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dd9b01ec0df2
Make sure account won't be null when it's processed by Android APIs r=nalexander
https://hg.mozilla.org/mozilla-central/rev/dd9b01ec0df2
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #9)
> https://hg.mozilla.org/mozilla-central/rev/dd9b01ec0df2

Sebastian, you've marked this as fixed in 57, but the push page says that the first build with this patch is 20170921220243, which it says is 58. Can you clarify which one it is? This needs to be in 57, so it'd be good to uplift this soon if we need to.

Same question for Bug 1400742.
Flags: needinfo?(aryx.bugmail)
It's in 57, see e.g. the milestone for https://hg.mozilla.org/mozilla-central/rev/dd9b01ec0df2 It was part of the last autoland merge before m-c stopped syncing to beta. The situation is similar for 1400742. First _Nightly_ builds with the pushes are 58.0a1 because they ran >8 hours later (with the version bump added).
Flags: needinfo?(aryx.bugmail)
Well, the patch that didn't quite make sense also didn't quite fixed this (surprise!). I'm still seeing these crashes on my device after changing the email on a different device (specifically, in the emulator).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Not yet sure if this is related, but I'm seeing now in the emulator that the accounts.db file is locked:

10-05 19:59:30.378  1548  2172 W SQLiteConnectionPool: The connection pool for database '/data/system_de/0/accounts_de.db' has been unable to grant a connection to thread 180 (Binder:1548_B) with flags 0x2 for 5550.1 seconds.

generic_x86:/data/system_de/0 # sqlite3 accounts_de.db                                                                                                                                                                                                                                   
sqlite> .tables
Error: database is locked

The strange thing is that I never did manage to crash the emulator, so not sure how it got to this state yet - but it did seem to happen after changing an email on the physical device and triggering a manual sync on the emulator after not observing an email change. Restarting the emulator seems to have reset things back to normal - and the email change was picked up.
As a back-up plan if we can't figure this out in time, I've confirmed that the manual pre-21 API flow appears to works well on the new emulator. Next up, try it on a device.
Switching to 'doOptionalProfileRenamePre21' as the primary code path appears to work well on-device (running API26). I can't reproduce the crashes anymore. Given the tight 57 uplift deadline, this might be our best bet - same code path for all API levels.
Let's read the source code of the crashing API to understand this better.

(This seems like a reasonable guess at a version my device is running)

We might be crashing in one of three places:
1) https://android.googlesource.com/platform/frameworks/base/+/android-7.0.0_r34/services/core/java/com/android/server/accounts/AccountManagerService.java?autodive=0%2F%2F#1376
2) https://android.googlesource.com/platform/frameworks/base/+/android-7.0.0_r34/services/core/java/com/android/server/accounts/AccountManagerService.java?autodive=0%2F%2F#1425
3) https://android.googlesource.com/platform/frameworks/base/+/android-7.0.0_r34/services/core/java/com/android/server/accounts/AccountManagerService.java?autodive=0%2F%2F#1456

There's a check for 'accountToRename' not being null, and so it's likely that we're crashing in (1).

Notice how a call to renameAccountInternal might actually return 'null' if the isSuccessful flag isn't set. Given our somewhat ambiguous stack trace, my guess is that this happens because 'getAccountIdLocked' call returns <0 (see https://android.googlesource.com/platform/frameworks/base/+/android-7.0.0_r34/services/core/java/com/android/server/accounts/AccountManagerService.java?autodive=0%2F%2F#1418).

There is no error checking or sanity checks in place if that happens, and it ends up crashing via an NPE in the caller.

Here's the getAccountIdLocked method impl: https://android.googlesource.com/platform/frameworks/base/+/android-7.0.0_r34/services/core/java/com/android/server/accounts/AccountManagerService.java?autodive=0%2F%2F#3741

So if the above reading is correct, it's failing to actually find the "account to rename" by name & type in an internal accounts db. Not sure why yet!
Reading android services source code, I can't quite see where this might be going wrong. However, one theory that seems plausible: we might be performing renameAccount call twice for a given 'account' object. The second call will execute normally until it tries to lookup the account by name&type tuple, and fails (since it's already been renamed).
After some trial and error with debug logging in place, replicated the crash in a way that now makes sense:

10-05 16:49:32.993 30398 30430 I GeckoLogger: fennec_grisha :: GRIGRI :: RENAMING ACCOUNT! CURRENT NAME: grigory.kruglov@gmail.com NEW NAME: gkruglov@mozilla.com
10-05 16:49:33.319 30398 30430 I GeckoLogger: fennec_grisha :: GRIGRI :: RENAMING ACCOUNT! CURRENT NAME: grigory.kruglov@gmail.com NEW NAME: gkruglov@mozilla.com

... and it blows up.

Oddly enough, I couldn't replicate the usual "change email on device A, it crashes o device B" scenario. The above is simply "change email on device A, it crashes on device A".

Most likely we're racing to change the email twice via two channels: via a push message and via a webchannel. Timestamp difference above is roughly how long it'll take for the device to get a push message.
Here's what a bad race looks like:

10-05 17:20:09.069  7625  7680 I GRIGRI  : WEBCHANNEL: Accounts:ProfileUpdated
10-05 17:20:09.093  7625 12860 I GRIGRI  : Fetching https://profile.accounts.firefox.com/v1/profile
10-05 17:20:09.289  7625 12862 D GeckoPushGCM: Message received.  Processing on background thread.
10-05 17:20:09.296  7625  7799 I GeckoPushService: Google Play Services GCM message received; delivering.
10-05 17:20:09.296  7625  7799 I GRIGRI  : GOT PUSH!
10-05 17:20:09.488  7625  7799 I GRIGRI  : HANDLING PROFILE UPDATE PUSH
10-05 17:20:09.584  7625 12860 I GRIGRI  : Fetching https://profile.accounts.firefox.com/v1/profile
10-05 17:20:09.584  7625  7865 I GRIGRI  : 200 response! JSON: {"email":"grigory.kruglov@gmail.com","uid":"****","displayName":"grisha","avatar":"https:\/\/firefoxusercontent.com\/***"}
10-05 17:20:09.590  7625  7799 I FxAccounts: fennec_grisha :: GRIGRI :: RENAMING ACCOUNT! CURRENT NAME: gkruglov@mozilla.com NEW NAME: grigory.kruglov@gmail.com
10-05 17:20:09.692  7625  7865 I GRIGRI  : 200 response! JSON: {"email":"grigory.kruglov@gmail.com","uid":"****","displayName":"grisha","avatar":"https:\/\/firefoxusercontent.com\/***"}
10-05 17:20:09.984  7625  7799 I FxAccounts: fennec_grisha :: GRIGRI :: RENAMING ACCOUNT! CURRENT NAME: gkruglov@mozilla.com NEW NAME: grigory.kruglov@gmail.com

The simplest thing to do is probably to synchronize profile renaming.
The higher order problem here is that we have many different representations of the shared state, and they're not kept in-sync; i.e. it's possible to have many instances of AndroidFxAccount present with their internal 'account' object out-of-sync with reality. The problem is pretty narrow - we tend to create these object right before using them - but it's clear in this bug that in presence of variable-time IO and exacerbating circumstances (two things happening at once) it very much is a problem. I vaguely suspect we've seen this in other "strange" accounts crashes as well. I've started addressing this a while back in some exploratory patches by moving towards a shared singleton holder for the accounts data out of perf considerations, but now it's clear that there are correctness benefits as well.

Smallest possible patch upcoming...
See Also: → 1407316
Comment on attachment 8910517 [details]
Bug 1401318 - Fix some of the 'shared-state' access problems around Account

https://reviewboard.mozilla.org/r/181938/#review193242

I have seen strange shared account locking log messages that I have never investigated, but hopefully we don't see deadlocks in the wild.
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e464d1cc05a6
Fix some of the 'shared-state' access problems around Account r=nalexander
Approval Request Comment
[Feature/Bug causing the regression]: Bug 1368147

[User impact if declined]: A crash might occur after user changes primary email associated with the account.

[Is this code covered by automated tests?]: No.

[Has the fix been verified in Nightly?]: Not yet, just landed now. Bug is reproducible locally, verified that crashes do not occur.

[Needs manual test from QE? If yes, steps to reproduce]: Wouldn't hurt. Crash won't happen every time. On a logged-in device, open FxA management page, append &canChangeEmail=true to the URI, and under Secondary Email change the primary email a bunch of times. Similarly, change the primary email on a different device a bunch of times. And finally, change primary email a bunch of times on a different device, and wait until device-under-test misses a push notification and doesn't update the primary email. Then, in Sync Prefs, trigger a sync. None of these cases should crash, and a sensible thing should always happen (primary email is changed).

[List of other uplifts needed for the feature/fix]: Already uplited!

[Is the change risky?]: Not very.

[Why is the change risky/not risky?]: The primary fix is simple - refreshing internal state. The necessary locking is also simple, but it's possible it'll cause unforeseen issues. In my tests, it does not, and other users of the lock behave correctly when interrupted.

[String changes made/needed]: N/A.
Attachment #8917106 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/e464d1cc05a6
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Comment on attachment 8917106 [details] [diff] [review]
account-shared-state.patch

Fixes a new crash issue, Beta57+
Attachment #8917106 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8917106 [details] [diff] [review]
account-shared-state.patch

Actually on second thoughts if the crash volume is not large, is this a must fix for 57? It doesn't look like a simple not-null check to me, the size of the patch is usually directly proportional to the risk to quality.

Hi Snorp, could I get a please second opinion from you on this one?
Flags: needinfo?(snorp)
Attachment #8917106 - Flags: approval-mozilla-beta+ → approval-mozilla-beta?
The crash volume is low for this particular signature because the feature is currently behind the flag on FxA side and must be explicitly enabled (via 'canChangeEmail' GET param). The crashes you currently see are all from devs, testers or early adopters. Once the feature is enabled, crash volume will rise proportionally to its usage.

The aim is to ship this feature with 57, so the uplift is unfortunately necessary if we are to follow through with that plan.

I also expect that this patch will fix Bug 1349147, which is unrelated to the "email change" feature, and has a considerable volume of crashes already.
Comment on attachment 8917106 [details] [diff] [review]
account-shared-state.patch

Thanks Grisha for the additional details. That makes sense and justifies the uplift. beta57+
Flags: needinfo?(snorp)
Attachment #8917106 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1349147
You need to log in before you can comment on or make changes to this bug.