1378567 - [meta] Enable batch API v3

Reporter

Description

•

7 years ago

This is a meta-bug to re-enable the batch API on the sync server.

Best I can tell, the commit with the fix we need is https://github.com/mozilla-services/server-syncstorage/commit/f3cadd8d335a8becf01099b0a7b93223c5b4b974

There will probably be subordinate bugs for the deployment to stage and QA, then for the incremental deployment to prod.

Mark Hammond [:markh] [:mhammond]

Reporter

Updated

•

7 years ago

Comment 1

•

7 years ago

For whatever reason I can't add this as a blocker or a See Also, but: Bug 1373105, Bug 1370136

Richard Newman [:rnewman]

Updated

•

7 years ago

Comment 2

•

7 years ago

FWIW I don't think we should block on Bug 1370136 which AFAICT is an unrelated change.  Bob, thoughts?

I've made a v1.6.8 tag for this deployment, so I think it's over to bob for the rollout at this stage.

Bob Micheletto [:bobm]

Comment 3

•

7 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #2)
> FWIW I don't think we should block on Bug 1370136 which AFAICT is an
> unrelated change.  Bob, thoughts?
Sounds good to me.

> I've made a v1.6.8 tag for this deployment, so I think it's over to bob for
> the rollout at this stage.
Built from v1.6.8 tag and deployed to stage.  Over to QA for testing.  Perhaps we should create a separate deployment ticket.

Flags: needinfo?(rbillings)

Rebecca Billings [:rbillings]

Comment 4

•

7 years ago

I'm available for testing, but yes a separate deployment ticket would be good.

Flags: needinfo?(rbillings)

Bob Micheletto [:bobm]

Comment 5

•

7 years ago

(In reply to Rebecca Billings [:rbillings] from comment #4)
> I'm available for testing, but yes a separate deployment ticket would be
> good.

https://bugzilla.mozilla.org/show_bug.cgi?id=1384702

Bob Micheletto [:bobm]

Updated

•

7 years ago

Depends on: 1385138

Ryan Kelly [:rfkelly]

Updated

•

7 years ago

Depends on: 1378569

Ryan Kelly [:rfkelly]

Updated

•

7 years ago

Blocks: 1378569

No longer depends on: 1378569

Ryan Kelly [:rfkelly]

Updated

•

7 years ago

Blocks: 1250747

Ryan Kelly [:rfkelly]

Comment 6

•

7 years ago

So next steps from here:

* Roll out v1.6.8 to production with batch uploads disabled
* Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness
* Enable batch uploads on 50% of nodes; monitor server and client telemetry for weirdness
* Enable batch uploads on all nodes; monitor server and client telemetry for weirdness

Since the new version has changed the performance characteristics of the `?commit=true` request, I think it makes sense to still do the rollout in several stages like this, and to carefully watch for any performance degradation on the server.

Bob, can you suggest a series of dates for the above steps in Q3 based on your availability?

Flags: needinfo?(bobm)

Julie McCracken (:julie)

Updated

•

7 years ago

Whiteboard: [Sync Q3 OKR]

Ryan Kelly [:rfkelly]

Comment 7

•

7 years ago

> * Enable batch uploads on 10% of nodes; monitor server and client telemetry for weirdness

:bobm tells me that we're already at some percentage of rollout with batch API enabled, so let's do a quick checkin on the metrics.

Bob, any noticeable differences in request latency, throughout, IOPS or CPU usage between the servers with batch API enabled vs disabled?

Ryan Kelly [:rfkelly]

Comment 8

•

7 years ago

Mark, is https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look to monitor for any client-side failures?

Flags: needinfo?(markh)

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 9

•

7 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #8)
> Mark, is
> https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/
> Success-Failure-Error%20rates%20per%20engine.ipynb still the place to look
> to monitor for any client-side failures?

That's a snapshot so isn't updated automatically. I'm re-running the analysis now (although the notebook now seems too large to display, so you'll probably need to take my verbal results :(

Flags: needinfo?(markh)

Bob Micheletto [:bobm]

Updated

•

7 years ago

Depends on: 1388963

Bob Micheletto [:bobm]

Comment 10

•

7 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #6)

> Bob, can you suggest a series of dates for the above steps in Q3 based on
> your availability?

Due to an issue with the 1.6.8 rollout batch uploads were enabled on roughly 1/3 of our server fleet.  We should close out bug 1388963 before enabling additional servers.

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 11

•

7 years ago

https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb has a recent analysis - the 503 uptick is noticeable for a few engines (the specific error is record_upload_fail), most notably clients, but it gets lost in the noise for many others. It appears to have dropped in the last day, but that's likely to be a lie. I can't see any other issues, but I'll run it again next week.

Mark Hammond [:markh] [:mhammond]

Reporter

Comment 12

•

7 years ago

I updated https://nbviewer.jupyter.org/gist/mhammond/01907397057deb00edec1d6616c0a17c/Success-Failure-Error%20rates%20per%20engine.ipynb - it shows the 503s dropping for a couple of days, then jumping back up again over the last couple of days, for multiple engines. I'll run it again next week, but it might be worth seeing if the server stats show a similar recent re-spike.

Richard Newman [:rnewman]

Updated

•

7 years ago

Depends on: 1397357

Ryan Kelly [:rfkelly]

Updated

•

7 years ago

Depends on: 1397553

Ryan Kelly [:rfkelly]

Comment 13

•

7 years ago

Quick update on this based on today's standup, we're at 50% enablement of the batch API and things seems to be pretty stable.  It flushed out a server bug that only affected iOS beta, but otherwise we're not seeing any signs of server overload or other such badness.

Ryan Kelly [:rfkelly]

Updated

•

7 years ago

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 14

•

7 years ago

We've been at 100%, and stable on release 1.6.11.  I think we're clear to close this one out. :rfkelly agree?

Flags: needinfo?(rfkelly)

Ryan Kelly [:rfkelly]

Comment 15

•

7 years ago

Sounds *great* to me!

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(rfkelly)

Resolution: --- → FIXED

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard