|Submitter||Diff||Changes||Open Issues||Last Updated|
|Error loading review requests:|
This bug was filed from the Socorro interface and is report bp-255563d3-079c-4d85-be06-3d8840170615. ============================================================= Bucketing this in general for now, but looks to be related to Sync: http://bit.ly/2seySkA. So far on 54 we have 55 crashes/62 installations. Happens from 53 to 56, so adding those versions as affected. Who should I engage from the sync team?
Seems like we have some threading issues. Some crashes happen while one thread is trying to 'advance' a session after a problem occurs (412, storage failure...) , but 'stages' map which is being read has already been reset to null, indicating that another thread either aborted or completed a syn. Other (most) crashes happen after successfully finishing a stage, and trying to advance further - also indicating that another thread already either aborted or completed a sync in the meantime.
Essentially this bug means that sync stages may overlap. This seems to happen infrequently, likely during failures. One obvious place where this may happen is in PayloadDispatcher, note how it calls *failed methods on the store delegate directly from its current thread: https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadDispatcher.java?q=PayloadDispatcher.java&redirect_type=direct#138 This by itself should account for most of the crashes. However, PayloadDispatcher landed in 54, and so there got to be some other places where we got this wrong (since we seem to have these crashes for 53+).
This doesn't seem high volume enough to warrant a point release but we might still take a patch for 55/56.
(In reply to :Grisha Kruglov from comment #2) > One obvious place where this > may happen is in PayloadDispatcher, note how it calls *failed methods on the > store delegate directly from its current thread: Actually, PayloadDispatcher already ensures that methods run on the store worker queue - I've missed my own explanation of this in the class comment. Looking elsewhere!
Will get to this soon.
A plausible explanation for what's happening here is that we're trying to finish a remote (or local?) session more than once. One of these attempts will succeed, the other will fail (session will be in DONE state instead of ACTIVE), at which point they'll race to advance the global session, and eventually crash.
There are 64 crashes in nightly 57.
Upon further reflection, I think theory outlined in Comment 6 still applies, but with an added layer of a "larger problem": ServerLocalSynchronizerSession overrides SynchronizerSessions's on*FlowCompleted methods, and determines if a synchronization should be considered as "failed" depending on failed counts or recorded exceptions in RecordsChannel, as well as which flow is being processed. But, when it determines that a synchronization should be considered as failed (e.g. if we hit a 412 on upload), it forgets to finish the current session first! This means two things: - we'll never hit the "incorrect state" condition postulated in Comment 6, and so the crash logs don't line up with that theory - but we'll also never actually shut down the storeWorkQueue on which we might have pending scheduled "upload" tasks storeWorkQueue is single-threaded. Pending tasks will wait for their turn - that is, they will wait until we go through the regular motions of failing a synchronization of a stage, advancing GlobalSession, etc. Supposedly afterwards, they will run, leading to strange results. The last stages of this are a bit murky for me, but it seems relatively clear that this isn't correct. IIRC at some point uploader behavior changed from "skip POSTing if we hard-failed" to "invoke onStoreFailed if we hard-failed". That change seems like a good contender for surfacing this incorrect behaviour.
Upon further reading of this code, I now realize that "finishing" a repository session implies "synchronization success", while "aborting" a session indicates "synchronization failure". We even have tests in place which make these assumptions. We abort in a variety of cases. One example is after encountering fetch problems in BatchingDownloader. However, mistakenly BatchingUploader does not abort a session when it encounters store problems. Aborting remote session in case of upload failures will fit better into the current design than trying to finish it. A note on this design: I think it duplicates decision making across different layers. On the very edges - in the downloader and uploader - we need to know how to treat certain types of failure, and if they should abort a session or not. We then also decide if a "synchronization" should be treated as successful or not. This decision is made based on artifacts collected during the flow of records - fetch and store failures, any exceptions encountered. So, in theory we might get into a situation where the outer layer aborted a session, while the inner layers might treat still treat this flow as successful. This might happen in practice if we're not careful, but IIRC doesn't currently. What does happen in practice is that the outer layer might behave incorrectly, but the problem is hard to spot and doesn't always manifest itself (... so, is it a problem? ;) ) ISTM that the minimally useful thing to do in the short-term is to ensure that we: - abort the session in case of uploader failures; this will shutdownNow our work queues - given parallelism of the current design, simply ensure that our onStoreFailed calls are idempotent. The currently attached patch sort of does this in an ad-hoc fashion and from the point of view of the caller. Perhaps that's enough of a fix for now In the medium-term, we should aim to drastically simplify this whole system to be sequential, do away with the chains of delegates, etc.
Also, as a follow-up: we should probably remove "recordUploadFailed" flag. The behaviour is nearly identical to just aborting a session - except for the fact that we're doing a ton of unnecessary work. Processing all of the records following the one that "failed", firing off onRecordStoreFailed delegate calls, doing lots of logging, etc. The worst case here is pretty bad - for large history collections pending an upload, this might mean we'll sit there logging tenths of thousands of errors for a few minutes (as I've just observed on my device, it took about 10 minutes to fail upload of about 50k history records...). We use this flag in two cases: if we hit a 412, and if we encounter a record in the "failed" list in the server response. Hitting a 412, we'll abort entirely following these patches, so it's fine to just fold the logic. In theory, we should never see anything in the "failed" list if the server accepted our payload. According to tcsc, desktop ignores whatever it sees in "failed", and doesn't abort or fail consequent uploads. It does seem to me that aborting if we see anything in "failed" is probably the right thing to do for bookmarks - we've encountered an unexpected state, should we risk possibly corruption the structure of bookmarks? Whereas for other record types it should be safe to proceed. And once again we see that this type of decision making is spread across different layers...
Comment on attachment 8910554 [details] Bug 1373254 - Ensure onStoreFailed won't be called twice https://reviewboard.mozilla.org/r/182002/#review188538
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/82a3a8ca311e Ensure onStoreFailed won't be called twice r=rnewman
Created attachment 8912357 [details] [diff] [review] npe-sync-g-sess-crash.patch [Feature/Bug causing the regression]: One of descendants of Bug 1253111 [User impact if declined]: Possible crash while syncing if a certain error conditions are hit. [Is this code covered by automated tests?]: Broad overall coverage via unit tests of Sync [Has the fix been verified in Nightly?]: It's on Nightly now; nightly crash volume is fairly low, so it will take some time to ensure we're not seeing any consequent crashes. [Needs manual test from QE? If yes, steps to reproduce]: no [List of other uplifts needed for the feature/fix]: n/a [Is the change risky?]: no [Why is the change risky/not risky?]: A certain non-idempotent method is ensured not to be called multiple times. Patch itself is very simple, and the consequences shouldn't go beyond ensuring we don't crash. [String changes made/needed]: n/a
Comment on attachment 8912357 [details] [diff] [review] npe-sync-g-sess-crash.patch Fix a crash, taking it. Should be in 47b4, gtb tomorrow Thursday