Closed
Bug 1378567
Opened 7 years ago
Closed 7 years ago
[meta] Enable batch API v3
Categories
(Cloud Services Graveyard :: Server: Sync, enhancement)
Cloud Services Graveyard
Server: Sync
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markh, Unassigned)
References
Details
(Whiteboard: [Sync Q3 OKR])
This is a meta-bug to re-enable the batch API on the sync server. Best I can tell, the commit with the fix we need is https://github.com/mozilla-services/server-syncstorage/commit/f3cadd8d335a8becf01099b0a7b93223c5b4b974 There will probably be subordinate bugs for the deployment to stage and QA, then for the incremental deployment to prod.
Comment 1•7 years ago
|
||
For whatever reason I can't add this as a blocker or a See Also, but: Bug 1373105, Bug 1370136
Updated•7 years ago
|
Comment 2•7 years ago
|
||
FWIW I don't think we should block on Bug 1370136 which AFAICT is an unrelated change. Bob, thoughts? I've made a v1.6.8 tag for this deployment, so I think it's over to bob for the rollout at this stage.
Comment 3•7 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #2) > FWIW I don't think we should block on Bug 1370136 which AFAICT is an > unrelated change. Bob, thoughts? Sounds good to me. > I've made a v1.6.8 tag for this deployment, so I think it's over to bob for > the rollout at this stage. Built from v1.6.8 tag and deployed to stage. Over to QA for testing. Perhaps we should create a separate deployment ticket.
Flags: needinfo?(rbillings)
Comment 4•7 years ago
|
||
I'm available for testing, but yes a separate deployment ticket would be good.
Flags: needinfo?(rbillings)
Comment 5•7 years ago
|
||
(In reply to Rebecca Billings [:rbillings] from comment #4) > I'm available for testing, but yes a separate deployment ticket would be > good. https://bugzilla.mozilla.org/show_bug.cgi?id=1384702
Updated•7 years ago
|
Comment 6•7 years ago
|
||
So next steps from here: * Roll out v1.6.8 to production with batch uploads disabled * Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness * Enable batch uploads on 50% of nodes; monitor server and client telemetry for weirdness * Enable batch uploads on all nodes; monitor server and client telemetry for weirdness Since the new version has changed the performance characteristics of the `?commit=true` request, I think it makes sense to still do the rollout in several stages like this, and to carefully watch for any performance degradation on the server. Bob, can you suggest a series of dates for the above steps in Q3 based on your availability?
Flags: needinfo?(bobm)
Updated•7 years ago
|
Whiteboard: [Sync Q3 OKR]
Comment 7•7 years ago
|
||
> * Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness
:bobm tells me that we're already at some percentage of rollout with batch API enabled, so let's do a quick checkin on the metrics.
Bob, any noticeable differences in request latency, throughout, IOPS or CPU usage between the servers with batch API enabled vs disabled?
Comment 8•7 years ago
|
||
Mark, is https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look to monitor for any client-side failures?
Flags: needinfo?(markh)
Reporter | ||
Comment 9•7 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #8) > Mark, is > https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/ > Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look > to monitor for any client-side failures? That's a snapshot so isn't updated automatically. I'm re-running the analysis now (although the notebook now seems too large to display, so you'll probably need to take my verbal results :(
Flags: needinfo?(markh)
Comment 10•7 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #6) > Bob, can you suggest a series of dates for the above steps in Q3 based on > your availability? Due to an issue with the 1.6.8 rollout batch uploads were enabled on roughly 1/3 of our server fleet. We should close out bug 1388963 before enabling additional servers.
Reporter | ||
Comment 11•7 years ago
|
||
https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb has a recent analysis - the 503 uptick is noticeable for a few engines (the specific error is record_upload_fail), most notably clients, but it gets lost in the noise for many others. It appears to have dropped in the last day, but that's likely to be a lie. I can't see any other issues, but I'll run it again next week.
Reporter | ||
Comment 12•7 years ago
|
||
I updated https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb - it shows the 503s dropping for a couple of days, then jumping back up again over the last couple of days, for multiple engines. I'll run it again next week, but it might be worth seeing if the server stats show a similar recent re-spike.
Comment 13•7 years ago
|
||
Quick update on this based on today's standup, we're at 50% enablement of the batch API and things seems to be pretty stable. It flushed out a server bug that only affected iOS beta, but otherwise we're not seeing any signs of server overload or other such badness.
Updated•7 years ago
|
Flags: needinfo?(bobm)
Comment 14•7 years ago
|
||
We've been at 100%, and stable on release 1.6.11. I think we're clear to close this one out. :rfkelly agree?
Flags: needinfo?(rfkelly)
Comment 15•7 years ago
|
||
Sounds *great* to me!
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(rfkelly)
Resolution: --- → FIXED
Updated•1 year ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•