Closed Bug 1378567 Opened 7 years ago Closed 7 years ago

[meta] Enable batch API v3

Categories

(Cloud Services Graveyard :: Server: Sync, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markh, Unassigned)

References

Details

(Whiteboard: [Sync Q3 OKR])

This is a meta-bug to re-enable the batch API on the sync server.

Best I can tell, the commit with the fix we need is https://github.com/mozilla-services/server-syncstorage/commit/f3cadd8d335a8becf01099b0a7b93223c5b4b974

There will probably be subordinate bugs for the deployment to stage and QA, then for the incremental deployment to prod.
See Also: → 1378569
For whatever reason I can't add this as a blocker or a See Also, but: Bug 1373105, Bug 1370136
See Also: → 1370136, 1373105
FWIW I don't think we should block on Bug 1370136 which AFAICT is an unrelated change.  Bob, thoughts?

I've made a v1.6.8 tag for this deployment, so I think it's over to bob for the rollout at this stage.
(In reply to Ryan Kelly [:rfkelly] from comment #2)
> FWIW I don't think we should block on Bug 1370136 which AFAICT is an
> unrelated change.  Bob, thoughts?
Sounds good to me.

> I've made a v1.6.8 tag for this deployment, so I think it's over to bob for
> the rollout at this stage.
Built from v1.6.8 tag and deployed to stage.  Over to QA for testing.  Perhaps we should create a separate deployment ticket.
Flags: needinfo?(rbillings)
I'm available for testing, but yes a separate deployment ticket would be good.
Flags: needinfo?(rbillings)
(In reply to Rebecca Billings [:rbillings] from comment #4)
> I'm available for testing, but yes a separate deployment ticket would be
> good.

https://bugzilla.mozilla.org/show_bug.cgi?id=1384702
Depends on: 1385138
Depends on: 1378569
Blocks: 1378569
No longer depends on: 1378569
Blocks: 1250747
So next steps from here:

* Roll out v1.6.8 to production with batch uploads disabled
* Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness
* Enable batch uploads on 50% of nodes; monitor server and client telemetry for weirdness
* Enable batch uploads on all nodes; monitor server and client telemetry for weirdness

Since the new version has changed the performance characteristics of the `?commit=true` request, I think it makes sense to still do the rollout in several stages like this, and to carefully watch for any performance degradation on the server.

Bob, can you suggest a series of dates for the above steps in Q3 based on your availability?
Flags: needinfo?(bobm)
Whiteboard: [Sync Q3 OKR]
> * Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness

:bobm tells me that we're already at some percentage of rollout with batch API enabled, so let's do a quick checkin on the metrics.

Bob, any noticeable differences in request latency, throughout, IOPS or CPU usage between the servers with batch API enabled vs disabled?
Mark, is https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look to monitor for any client-side failures?
Flags: needinfo?(markh)
(In reply to Ryan Kelly [:rfkelly] from comment #8)
> Mark, is
> https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/
> Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look
> to monitor for any client-side failures?

That's a snapshot so isn't updated automatically. I'm re-running the analysis now (although the notebook now seems too large to display, so you'll probably need to take my verbal results :(
Flags: needinfo?(markh)
Depends on: 1388963
(In reply to Ryan Kelly [:rfkelly] from comment #6)

> Bob, can you suggest a series of dates for the above steps in Q3 based on
> your availability?

Due to an issue with the 1.6.8 rollout batch uploads were enabled on roughly 1/3 of our server fleet.  We should close out bug 1388963 before enabling additional servers.
https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb has a recent analysis - the 503 uptick is noticeable for a few engines (the specific error is record_upload_fail), most notably clients, but it gets lost in the noise for many others. It appears to have dropped in the last day, but that's likely to be a lie. I can't see any other issues, but I'll run it again next week.
I updated https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb - it shows the 503s dropping for a couple of days, then jumping back up again over the last couple of days, for multiple engines. I'll run it again next week, but it might be worth seeing if the server stats show a similar recent re-spike.
Depends on: 1397357
Depends on: 1397553
Quick update on this based on today's standup, we're at 50% enablement of the batch API and things seems to be pretty stable.  It flushed out a server bug that only affected iOS beta, but otherwise we're not seeing any signs of server overload or other such badness.
Flags: needinfo?(bobm)
We've been at 100%, and stable on release 1.6.11.  I think we're clear to close this one out. :rfkelly agree?
Flags: needinfo?(rfkelly)
Sounds *great* to me!
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(rfkelly)
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.