Crash in java.lang.NullPointerException: Attempt to invoke interface method ''java.lang.Object java.util.Map.get(java.lang.Object)'' on a null object reference at org.mozilla.gecko.sync.GlobalSession.getSyncStageByName(GlobalSession.java)

RESOLVED FIXED in Firefox 57

Status

()

Firefox for Android
Android Sync
P1
critical
RESOLVED FIXED
4 months ago
9 days ago

People

(Reporter: marcia, Assigned: Grisha)

Tracking

({crash, regression})

54 Branch
Firefox 58
Unspecified
Android
crash, regression
Points:
---

Firefox Tracking Flags

(firefox54 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 fixed, firefox58 fixed)

Details

(crash signature)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments, 1 obsolete attachment)

This bug was filed from the Socorro interface and is 
report bp-255563d3-079c-4d85-be06-3d8840170615.
=============================================================

Bucketing this in general for now, but looks to be related to Sync: http://bit.ly/2seySkA. So far on 54 we have 55 crashes/62 installations.

Happens from 53 to 56, so adding those versions as affected.

Who should I engage from the sync team?
Flags: needinfo?(gkruglov)
(Assignee)

Updated

4 months ago
Assignee: nobody → gkruglov
Status: NEW → ASSIGNED
Component: General → Android Sync
Flags: needinfo?(gkruglov)
Priority: -- → P1
Product: Firefox for Android → Android Background Services
(Assignee)

Comment 1

4 months ago
Seems like we have some threading issues.

Some crashes happen while one thread is trying to 'advance' a session after a problem occurs (412, storage failure...) , but 'stages' map which is being read has already been reset to null, indicating that another thread either aborted or completed a syn.

Other (most) crashes happen after successfully finishing a stage, and trying to advance further - also indicating that another thread already either aborted or completed a sync in the meantime.
(Assignee)

Comment 2

4 months ago
Essentially this bug means that sync stages may overlap. This seems to happen infrequently, likely during failures. One obvious place where this may happen is in PayloadDispatcher, note how it calls *failed methods on the store delegate directly from its current thread:

https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java?q=PayloadDispatcher.java&redirect_type=direct#138

This by itself should account for most of the crashes. However, PayloadDispatcher landed in 54, and so there got to be some other places where we got this wrong (since we seem to have these crashes for 53+).
This doesn't seem high volume enough to warrant a point release but we might still take a patch for 55/56.
status-firefox54: affected → wontfix
status-firefox55: affected → fix-optional
(Assignee)

Comment 4

4 months ago
(In reply to :Grisha Kruglov from comment #2)
> One obvious place where this
> may happen is in PayloadDispatcher, note how it calls *failed methods on the
> store delegate directly from its current thread:

Actually, PayloadDispatcher already ensures that methods run on the store worker queue - I've missed my own explanation of this in the class comment. Looking elsewhere!
(Assignee)

Comment 5

2 months ago
Will get to this soon.
Assignee: gkruglov → nobody
Status: ASSIGNED → NEW
Assignee: nobody → gkruglov
(Assignee)

Comment 6

a month ago
A plausible explanation for what's happening here is that we're trying to finish a remote (or local?) session more than once. One of these attempts will succeed, the other will fail (session will be in DONE state instead of ACTIVE), at which point they'll race to advance the global session, and eventually crash.
There are 64 crashes in nightly 57.
status-firefox57: --- → affected
status-firefox55: fix-optional → wontfix
status-firefox56: affected → fix-optional
Comment hidden (mozreview-request)
(Assignee)

Comment 9

a month ago
Upon further reflection, I think theory outlined in Comment 6 still applies, but with an added layer of a "larger problem":

ServerLocalSynchronizerSession overrides SynchronizerSessions's on*FlowCompleted methods, and determines if a synchronization should be considered as "failed" depending on failed counts or recorded exceptions in RecordsChannel, as well as which flow is being processed. But, when it determines that a synchronization should be considered as failed (e.g. if we hit a 412 on upload), it forgets to finish the current session first!

This means two things:
- we'll never hit the "incorrect state" condition postulated in Comment 6, and so the crash logs don't line up with that theory
- but we'll also never actually shut down the storeWorkQueue on which we might have pending scheduled "upload" tasks

storeWorkQueue is single-threaded. Pending tasks will wait for their turn - that is, they will wait until we go through the regular motions of failing a synchronization of a stage, advancing GlobalSession, etc. Supposedly afterwards, they will run, leading to strange results.

The last stages of this are a bit murky for me, but it seems relatively clear that this isn't correct.

IIRC at some point uploader behavior changed from "skip POSTing if we hard-failed" to "invoke onStoreFailed if we hard-failed". That change seems like a good contender for surfacing this incorrect behaviour.
(Assignee)

Comment 10

29 days ago
Upon further reading of this code, I now realize that "finishing" a repository session implies "synchronization success", while "aborting" a session indicates "synchronization failure". We even have tests in place which make these assumptions.

We abort in a variety of cases. One example is after encountering fetch problems in BatchingDownloader. However, mistakenly BatchingUploader does not abort a session when it encounters store problems.

Aborting remote session in case of upload failures will fit better into the current design than trying to finish it.

A note on this design: I think it duplicates decision making across different layers. On the very edges - in the downloader and uploader - we need to know how to treat certain types of failure, and if they should abort a session or not. We then also decide if a "synchronization" should be treated as successful or not. This decision is made based on artifacts collected during the flow of records - fetch and store failures, any exceptions encountered. So, in theory we might get into a situation where the outer layer aborted a session, while the inner layers might treat still treat this flow as successful. This might happen in practice if we're not careful, but IIRC doesn't currently. What does happen in practice is that the outer layer might behave incorrectly, but the problem is hard to spot and doesn't always manifest itself (... so, is it a problem? ;) )

ISTM that the minimally useful thing to do in the short-term is to ensure that we:
- abort the session in case of uploader failures; this will shutdownNow our work queues 
- given parallelism of the current design, simply ensure that our onStoreFailed calls are idempotent. The currently attached patch sort of does this in an ad-hoc fashion and from the point of view of the caller. Perhaps that's enough of a fix for now

In the medium-term, we should aim to drastically simplify this whole system to be sequential, do away with the chains of delegates, etc.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 13

26 days ago
Also, as a follow-up: we should probably remove "recordUploadFailed" flag. The behaviour is nearly identical to just aborting a session - except for the fact that we're doing a ton of unnecessary work. Processing all of the records following the one that "failed", firing off onRecordStoreFailed delegate calls, doing lots of logging, etc. The worst case here is pretty bad - for large history collections pending an upload, this might mean we'll sit there logging tenths of thousands of errors for a few minutes (as I've just observed on my device, it took about 10 minutes to fail upload of about 50k history records...).

We use this flag in two cases: if we hit a 412, and if we encounter a record in the "failed" list in the server response. 

Hitting a 412, we'll abort entirely following these patches, so it's fine to just fold the logic.

In theory, we should never see anything in the "failed" list if the server accepted our payload. According to tcsc, desktop ignores whatever it sees in "failed", and doesn't abort or fail consequent uploads.

It does seem to me that aborting if we see anything in "failed" is probably the right thing to do for bookmarks - we've encountered an unexpected state, should we risk possibly corruption the structure of bookmarks? Whereas for other record types it should be safe to proceed.

And once again we see that this type of decision making is spread across different layers...

Comment 14

26 days ago
mozreview-review
Comment on attachment 8910554 [details]
Bug 1373254 - Ensure onStoreFailed won't be called twice

https://reviewboard.mozilla.org/r/182002/#review188538
Attachment #8910554 - Flags: review?(rnewman) → review+
(Assignee)

Updated

26 days ago
Blocks: 1403022
Comment hidden (mozreview-request)
(Assignee)

Updated

26 days ago
Attachment #8911429 - Attachment is obsolete: true
Attachment #8911429 - Flags: review?(rnewman)

Comment 16

26 days ago
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/82a3a8ca311e
Ensure onStoreFailed won't be called twice r=rnewman
https://hg.mozilla.org/mozilla-central/rev/82a3a8ca311e
Status: NEW → RESOLVED
Last Resolved: 26 days ago
status-firefox58: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 58
(Assignee)

Comment 18

25 days ago
Created attachment 8912357 [details] [diff] [review]
npe-sync-g-sess-crash.patch

[Feature/Bug causing the regression]: One of descendants of Bug 1253111

[User impact if declined]: Possible crash while syncing if a certain error conditions are hit.

[Is this code covered by automated tests?]: Broad overall coverage via unit tests of Sync

[Has the fix been verified in Nightly?]: It's on Nightly now; nightly crash volume is fairly low, so it will take some time to ensure we're not seeing any consequent crashes.

[Needs manual test from QE? If yes, steps to reproduce]: no

[List of other uplifts needed for the feature/fix]: n/a

[Is the change risky?]: no

[Why is the change risky/not risky?]: A certain non-idempotent method is ensured not to be called multiple times. Patch itself is very simple, and the consequences shouldn't go beyond ensuring we don't crash.

[String changes made/needed]: n/a
Attachment #8912357 - Flags: approval-mozilla-beta?
Comment on attachment 8912357 [details] [diff] [review]
npe-sync-g-sess-crash.patch

Fix a crash, taking it.
Should be in 47b4, gtb tomorrow Thursday
Attachment #8912357 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
status-firefox56: fix-optional → wontfix

Comment 20

24 days ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/e84c811afec6
status-firefox57: affected → fixed

Updated

9 days ago
Product: Android Background Services → Firefox for Android
You need to log in before you can comment on or make changes to this bug.