Closed Bug 615284 Opened 9 years ago Closed 9 years ago

download chunking needs to be more resilient against app shutdowns

Categories

(Firefox :: Sync, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+
fennec 2.0+ ---

People

(Reporter: mconnor, Assigned: philikon)

References

Details

(Whiteboard: [softblocker][has patch][has review])

Attachments

(1 file, 1 obsolete file)

I've had an Android device running Fennec for four days, and haven't finished a sync yet.

Android kills background processes, and we may or may not be hitting OOM due to lack of GC inside of the sync call (some conflicting reports).

We need to:

* sync more records per GET (I have 17k history records, so it's 340 gets to sync everything, as it stands).
* periodically add remaining IDs to the tracker so we don't start from scratch on interrupt.
* Consider only fetching some cap of records on mobile (i.e. limit the query to 5k records, same as the initial 5k upload cap?)

These two things should help us catch up much quicker, without really hurting our disk I/O story.
We need this ASAP.
blocking2.0: --- → beta8+
tracking-fennec: --- → 2.0b3+
same thing has happened on one of my test accounts on the nexus one.  it syncs
fine on a desktop build, but once i switch to mobile it remains "in progress"
indefinitely.
(In reply to comment #0)
> * sync more records per GET (I have 17k history records, so it's 340 gets to
> sync everything, as it stands).

Yes, that's painful. Need to watch memory here, though (see bug 615167)

> * periodically add remaining IDs to the tracker so we don't start from scratch
> on interrupt.

I think we should probably do it the other way around. We should get the changedIDs from the engine (which probably uses the tracker, but it might not in the future) and add them to the tracker at the beginning of the Sync. Then, as we fetch items in batches, we can remove successfully fetched items from the tracker (in batches, too)
Assignee: philipp → rnewman
blocking2.0: beta8+ → beta9+
tracking-fennec: 2.0b3+ → 2.0b4+
Dropping this from me, as I'm not actively working on the contents of this bug.
Assignee: rnewman → nobody
(In reply to comment #4)
> Dropping this from me, as I'm not actively working on the contents of this bug.

No probs, I already started messing around with this anyway (see comment 3).
Assignee: nobody → philipp
As per today's meeting, beta 9 will be a time-based release. Marking these all betaN+. Please move it back to beta9+ if you believe it MUST be in the next beta (ie: trunk is in an unshippable state without this)
blocking2.0: beta9+ → betaN+
Fixing fields my automated script accidentally blanked. Apologies for the bugspam
Whiteboard: [softblocker]
Summary: sync chunking needs to be more resilient against app shutdowns → Upload chunking needs to be more resilient against app shutdowns
tracking-fennec: 2.0b4+ → 2.0+
mconnor, can you comment what the plan for this bug is?  from what i read on irc, this wasn't needing beta testers, and there are other bugs tracking perf fixes that are more important for testing.

also, tying bug dependencies to those bugs would be helpful.  Thanks.
Effectively, this is about backing off on some of the changes to no longer write IDs we're fetching to disk (bug 609398).  However, instead of wholesale reverting that, we'll just periodically flush things to the toFetch file, so that we don't have to redownload and dupe-check everything from scratch.

This is 100% unrelated to other perf work.
Summary: Upload chunking needs to be more resilient against app shutdowns → download chunking needs to be more resilient against app shutdowns
Oh, this is about *down*load chunking? Your mentioning the tracker in comment 0 threw me off and I thought this was about upload (separate issue, but my comment 3 about upload still holds for that case.)

(In reply to comment #9)
> Effectively, this is about backing off on some of the changes to no longer
> write IDs we're fetching to disk (bug 609398).  However, instead of wholesale
> reverting that, we'll just periodically flush things to the toFetch file, so
> that we don't have to redownload and dupe-check everything from scratch.

Bringing back toFetch would also help us with bug 622762.

Then again, we might not need to bring back toFetch at all. We could sync items down in the order of their timestamp and update our lastSync timestamp as we process individual WBOs. That way we could restart exactly where we left off the last time. This might put a bit more stress on the server, though mostly the records should already be in input order so the sort operation shouldn't be too expensive.

Thoughts?
Not using sort order means we lose the advantage of syncing folders first, so we don't need to reparent on a large sync.
(In reply to comment #11)
> Not using sort order means we lose the advantage of syncing folders first, so
> we don't need to reparent on a large sync.

Good point! (Not that we do this right now on the desktop... bug 625295)
Blocks: 622762
Attached patch wip v1 (obsolete) — Splinter Review
Bring back the toFetch file that was removed in bug 609398. We still sync all batches in one sync on mobile, so bug 609398 wasn't backed out entirely.

Unlike before, toFetch is now saved on the next event loop tick so that we don't do synchronous I/O while syncing stuff down. We still have bug 608757 for making Utils.json{Load|Save} async. I have a WIP for that so if we decide the synchronous I/O (now at least not interfering with places I/O and the like) is bad on mobile, I can always resurrect that patch.
Attachment #506862 - Flags: review?(mconnor)
Whiteboard: [softblocker] → [softblocker][has patch][needs review mconnor]
Comment on attachment 506862 [details] [diff] [review]
wip v1

Hmm, we don't update lastSync until the end of processIncoming when all records have been downloaded. So if the app gets shutdown in the middle processIncoming, we still refetch all records.

So once we're in the batching mode and save to toFetch, we should update lastSync and not lastModified.
Attachment #506862 - Attachment description: v1 → wip v1
Attachment #506862 - Flags: review?(mconnor)
Attached patch v2Splinter Review
This patch now also ensures that lastSync is fast-forwarded when in batch mode. That way on the next sync we resume where we left off rather than refetching everything.
Attachment #506862 - Attachment is obsolete: true
Attachment #506915 - Flags: review?(mconnor)
Sorry to rain on this parade, but loadToFetch calls Utils.jsonLoad, which does sync disk I/O.  If we are doing all this work to avoid synchronous disk I/O in the first place, it seems rather silly to end up adding synchronous disk I/O.
Although it shouldn't be hard to make Utils.jsonLoad async.  It would probably be a lot like bug 627772.
(In reply to comment #16)
> Sorry to rain on this parade, but loadToFetch calls Utils.jsonLoad, which does
> sync disk I/O.  If we are doing all this work to avoid synchronous disk I/O in
> the first place, it seems rather silly to end up adding synchronous disk I/O.

We're talking about much different scales here, though. Syncing down 1000 history items in batches of 50 will have us do a tiny bit of synchronous I/O 20 times, rather than a N*1000 times (for some number of N) like now.

(In reply to comment #17)
> Although it shouldn't be hard to make Utils.jsonLoad async.

It's not, it just hasn't been a priority up until now. As I said, we could decide to nominate bug bug 608757 if it turns out to be a problem. It's a small fix that can land between beta and RC.
(In reply to comment #18)
> We're talking about much different scales here, though. Syncing down 1000
> history items in batches of 50 will have us do a tiny bit of synchronous I/O 20
> times, rather than a N*1000 times (for some number of N) like now.
Unfortunately, it looks like any synchronous I/O on mobile ends up having fairly bad pauses (for example, bug 627635).  This is probably partly because on android prior to gingerbread (which uses ext4), the file system serializes all access to itself.  If we are already doing I/O on a background thread, the main thread cannot do anything because it's blocked by the background thread.
(In reply to comment #19)
> fairly bad pauses (for example, bug 627635).  This is probably partly because
> on android prior to gingerbread (which uses ext4), the file system serializes
> all access to itself.  If we are already doing I/O on a background thread, the
> main thread cannot do anything because it's blocked by the background thread.

F-I-N-E. Have it your way! :p

I uploaded patches for bug 608757 just now. Pls to be stopping with the rain on my parade nao. kthxbai.
Comment on attachment 506915 [details] [diff] [review]
v2

Looks good here, woo.
Attachment #506915 - Flags: review?(mconnor) → review+
Whiteboard: [softblocker][has patch][needs review mconnor] → [softblocker][has patch][has review]
https://hg.mozilla.org/services/fx-sync/rev/d020b67d5dcf
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 627511
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.