download chunking needs to be more resilient against app shutdowns

RESOLVED FIXED

Status

Cloud Services
Firefox Sync: Backend
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: mconnor, Assigned: philikon)

Tracking

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 betaN+, fennec2.0+)

Details

(Whiteboard: [softblocker][has patch][has review])

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

7 years ago
I've had an Android device running Fennec for four days, and haven't finished a sync yet.

Android kills background processes, and we may or may not be hitting OOM due to lack of GC inside of the sync call (some conflicting reports).

We need to:

* sync more records per GET (I have 17k history records, so it's 340 gets to sync everything, as it stands).
* periodically add remaining IDs to the tracker so we don't start from scratch on interrupt.
* Consider only fetching some cap of records on mobile (i.e. limit the query to 5k records, same as the initial 5k upload cap?)

These two things should help us catch up much quicker, without really hurting our disk I/O story.
(Reporter)

Comment 1

7 years ago
We need this ASAP.
blocking2.0: --- → beta8+
tracking-fennec: --- → 2.0b3+

Comment 2

7 years ago
same thing has happened on one of my test accounts on the nexus one.  it syncs
fine on a desktop build, but once i switch to mobile it remains "in progress"
indefinitely.
(In reply to comment #0)
> * sync more records per GET (I have 17k history records, so it's 340 gets to
> sync everything, as it stands).

Yes, that's painful. Need to watch memory here, though (see bug 615167)

> * periodically add remaining IDs to the tracker so we don't start from scratch
> on interrupt.

I think we should probably do it the other way around. We should get the changedIDs from the engine (which probably uses the tracker, but it might not in the future) and add them to the tracker at the beginning of the Sync. Then, as we fetch items in batches, we can remove successfully fetched items from the tracker (in batches, too)
(Reporter)

Updated

7 years ago
Assignee: philipp → rnewman
(Reporter)

Updated

7 years ago
blocking2.0: beta8+ → beta9+
tracking-fennec: 2.0b3+ → 2.0b4+
Dropping this from me, as I'm not actively working on the contents of this bug.
Assignee: rnewman → nobody
(In reply to comment #4)
> Dropping this from me, as I'm not actively working on the contents of this bug.

No probs, I already started messing around with this anyway (see comment 3).
Assignee: nobody → philipp

Comment 6

7 years ago
As per today's meeting, beta 9 will be a time-based release. Marking these all betaN+. Please move it back to beta9+ if you believe it MUST be in the next beta (ie: trunk is in an unshippable state without this)
blocking2.0: beta9+ → betaN+

Comment 7

7 years ago
Fixing fields my automated script accidentally blanked. Apologies for the bugspam
Whiteboard: [softblocker]
(Assignee)

Updated

7 years ago
Summary: sync chunking needs to be more resilient against app shutdowns → Upload chunking needs to be more resilient against app shutdowns

Updated

7 years ago
tracking-fennec: 2.0b4+ → 2.0+

Comment 8

7 years ago
mconnor, can you comment what the plan for this bug is?  from what i read on irc, this wasn't needing beta testers, and there are other bugs tracking perf fixes that are more important for testing.

also, tying bug dependencies to those bugs would be helpful.  Thanks.
(Reporter)

Comment 9

7 years ago
Effectively, this is about backing off on some of the changes to no longer write IDs we're fetching to disk (bug 609398).  However, instead of wholesale reverting that, we'll just periodically flush things to the toFetch file, so that we don't have to redownload and dupe-check everything from scratch.

This is 100% unrelated to other perf work.
Summary: Upload chunking needs to be more resilient against app shutdowns → download chunking needs to be more resilient against app shutdowns
Oh, this is about *down*load chunking? Your mentioning the tracker in comment 0 threw me off and I thought this was about upload (separate issue, but my comment 3 about upload still holds for that case.)

(In reply to comment #9)
> Effectively, this is about backing off on some of the changes to no longer
> write IDs we're fetching to disk (bug 609398).  However, instead of wholesale
> reverting that, we'll just periodically flush things to the toFetch file, so
> that we don't have to redownload and dupe-check everything from scratch.

Bringing back toFetch would also help us with bug 622762.

Then again, we might not need to bring back toFetch at all. We could sync items down in the order of their timestamp and update our lastSync timestamp as we process individual WBOs. That way we could restart exactly where we left off the last time. This might put a bit more stress on the server, though mostly the records should already be in input order so the sort operation shouldn't be too expensive.

Thoughts?
(Reporter)

Comment 11

7 years ago
Not using sort order means we lose the advantage of syncing folders first, so we don't need to reparent on a large sync.
(In reply to comment #11)
> Not using sort order means we lose the advantage of syncing folders first, so
> we don't need to reparent on a large sync.

Good point! (Not that we do this right now on the desktop... bug 625295)
(Assignee)

Updated

7 years ago
Blocks: 622762
Created attachment 506862 [details] [diff] [review]
wip v1

Bring back the toFetch file that was removed in bug 609398. We still sync all batches in one sync on mobile, so bug 609398 wasn't backed out entirely.

Unlike before, toFetch is now saved on the next event loop tick so that we don't do synchronous I/O while syncing stuff down. We still have bug 608757 for making Utils.json{Load|Save} async. I have a WIP for that so if we decide the synchronous I/O (now at least not interfering with places I/O and the like) is bad on mobile, I can always resurrect that patch.
Attachment #506862 - Flags: review?(mconnor)
(Assignee)

Updated

7 years ago
Whiteboard: [softblocker] → [softblocker][has patch][needs review mconnor]
Comment on attachment 506862 [details] [diff] [review]
wip v1

Hmm, we don't update lastSync until the end of processIncoming when all records have been downloaded. So if the app gets shutdown in the middle processIncoming, we still refetch all records.

So once we're in the batching mode and save to toFetch, we should update lastSync and not lastModified.
Attachment #506862 - Attachment description: v1 → wip v1
Attachment #506862 - Flags: review?(mconnor)
Created attachment 506915 [details] [diff] [review]
v2

This patch now also ensures that lastSync is fast-forwarded when in batch mode. That way on the next sync we resume where we left off rather than refetching everything.
Attachment #506862 - Attachment is obsolete: true
Attachment #506915 - Flags: review?(mconnor)
Sorry to rain on this parade, but loadToFetch calls Utils.jsonLoad, which does sync disk I/O.  If we are doing all this work to avoid synchronous disk I/O in the first place, it seems rather silly to end up adding synchronous disk I/O.
Although it shouldn't be hard to make Utils.jsonLoad async.  It would probably be a lot like bug 627772.
(In reply to comment #16)
> Sorry to rain on this parade, but loadToFetch calls Utils.jsonLoad, which does
> sync disk I/O.  If we are doing all this work to avoid synchronous disk I/O in
> the first place, it seems rather silly to end up adding synchronous disk I/O.

We're talking about much different scales here, though. Syncing down 1000 history items in batches of 50 will have us do a tiny bit of synchronous I/O 20 times, rather than a N*1000 times (for some number of N) like now.

(In reply to comment #17)
> Although it shouldn't be hard to make Utils.jsonLoad async.

It's not, it just hasn't been a priority up until now. As I said, we could decide to nominate bug bug 608757 if it turns out to be a problem. It's a small fix that can land between beta and RC.
(In reply to comment #18)
> We're talking about much different scales here, though. Syncing down 1000
> history items in batches of 50 will have us do a tiny bit of synchronous I/O 20
> times, rather than a N*1000 times (for some number of N) like now.
Unfortunately, it looks like any synchronous I/O on mobile ends up having fairly bad pauses (for example, bug 627635).  This is probably partly because on android prior to gingerbread (which uses ext4), the file system serializes all access to itself.  If we are already doing I/O on a background thread, the main thread cannot do anything because it's blocked by the background thread.
(In reply to comment #19)
> fairly bad pauses (for example, bug 627635).  This is probably partly because
> on android prior to gingerbread (which uses ext4), the file system serializes
> all access to itself.  If we are already doing I/O on a background thread, the
> main thread cannot do anything because it's blocked by the background thread.

F-I-N-E. Have it your way! :p

I uploaded patches for bug 608757 just now. Pls to be stopping with the rain on my parade nao. kthxbai.
(Reporter)

Comment 21

7 years ago
Comment on attachment 506915 [details] [diff] [review]
v2

Looks good here, woo.
Attachment #506915 - Flags: review?(mconnor) → review+
(Assignee)

Updated

7 years ago
Whiteboard: [softblocker][has patch][needs review mconnor] → [softblocker][has patch][has review]
https://hg.mozilla.org/services/fx-sync/rev/d020b67d5dcf
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
(Assignee)

Updated

7 years ago
Blocks: 627511
You need to log in before you can comment on or make changes to this bug.