Closed Bug 1366820 Opened 7 years ago Closed 2 years ago

Ship-it v2 should implement a workflow where Fennec is staged rollout at 10% (first 2 days), then 99.99%, then 100%

Categories

(Release Engineering :: Release Automation: Other, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: ritu, Unassigned)

References

Details

Fennec release staged rollout should be as follows:

1. On go-live day, set Google Play Fennec Release staged rollout to 10%
2. Keep as is for 2 days,
3. If there are no release blocking issues, update staged rollout to 99.99%
4. Keep as is for a week,
5. If there are no release blockers, update staged rollout to 100%
Hi Johan, it would be great if we can put this in place before 54 goes live (i.e. June 13th). Thanks!
Flags: needinfo?(jlorenzo)
I'm sorry, delayed automatic upgrades are not as trivial as they sound.

* We need to allow pushapk_scriptworker to support a new type of task
* We have to find a way to spawn this task so that Chain of Trust (a feature of scriptworker) validates the task graph. This means we have to create a new type of decision task, for this particular use case.
* Moreover, we need to find a way to prevent automatic bump to 100% when we manually throttled down updates. I think Google Play APIs expose that, but that's a new feature to add to mozapkpublisher.

At a first glance, this deadline seems stretch. I'll talk to :catlee about it.
Flags: needinfo?(jlorenzo)
Another note about 99.99%. The current implementation of pushapk_scriptworker only allows integers, so we can go up to 99%. Is it good enough?
Flags: needinfo?(rkothari)
(In reply to Johan Lorenzo [:jlorenzo] from comment #3)
> Another note about 99.99%. The current implementation of
> pushapk_scriptworker only allows integers, so we can go up to 99%. Is it
> good enough?

Yes this should be fine.
Flags: needinfo?(rkothari)
Circling back after discussing with first :sylvestre, :jcristau, then :catlee.

1. On the technical side

* pushapk_scriptworker has to support a new type of task. We considered at some point to create a new worker. Chris pointed out this is unnecessary because the credentials are going to be the same on both worker.

* Making Chain of Trust happy remains the biggest unknown. The best solution we have at the moment is to create a new type of graph via the in-tree taskgraph. Spawning CoT-compatible tasks via another method (like by the Taskcluster API) is unnecessary risky.

* Bumps cannot be a part of the same release graph as the initial publication. This is due to the 5-day-limit of TC. A better way to track where we are with the percentage: make these steps a part of the ship-it v2 workflow (suggested by :catlee).


2. On the workflow side

* Having the percentage automatically changed without any human confirmation is risky. :jcristau said the "happy path" case occurs less often than when release management has to stop or throttle down updates. This gives automation less value. Moreover, the heuristic for the bot to detect we're in a good shape so we can go to 100% is not trivial. For instance, let's consider:

> Heuristic: If, one week after step 2., the current percentage is 99% then publish to 100%.
> Edge case scenario: Step 2 happened. A few hours later, Relman had to reduce updates to 0% for 3 days. Updates are back at 10% for 3 more days. Then, 99% is set back. Moments later, the bot publishes at 100%, leaving no turning back.

* Letting pushapk_scriptworker to increase the percentage doesn't prevent humans to hold Google Play credentials. Turning down updates isn't part of the automated workflow, and like said in the previous point, this happens.


3. In a glorious future, how can we have less Google Play creds? 

Here's a proposed solution:
a. pushapk_scriptworker should be *the* technical way to configure Google Play.
b. Ship-it v2 should implement the happy path scenario and ask a human to confirm rollout increases to 99% and 100%. Once the human signed off, ship-it v2 should schedule a CoT graph and let pushapk_scriptworker change the percentage.
c. Ship-it v2 should allow to reduce the percentage, by exceptionally adding a step in the workflow.


4. Solution, in the meantime

Considering:
* the requirement on ship-it v2,
* knowing that heuristic without confirmation is potentially dangerous,
* humans will still have to interact with Google Play, because of the throttle down scenario

I'd suggest we let release management manually change the rollout percentage, until release engineering implements the necessary bits in ship-it v2.

Does that sound okay to you, Ritu?
Flags: needinfo?(rkothari)
Hi Johan that is a good write up. Thank you! If there is a ship-it v2 meta bug, I think we should attach this bug to it and leave it around as a requirement of a GP staged rollout workflow.
Flags: needinfo?(rkothari)
Renamed the bug to sound like a requirement. Attached to ship-it meta-bug.
Blocks: shipit-v2
Summary: Update Fennec release APK scripts to do staged rollout of 10% (first 2 days), then 99.99% → Ship-it v2 should implement a workflow where Fennec is staged rollout at 10% (first 2 days), then 99.99%, then 100%
Priority: -- → P3
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: catlee → jlund
Component: Release Requests → Release Automation: Pushapk
Component: Release Automation: PushApk → Release Automation: Other

Fennec is EOL; I think we have a Fenix solution.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.