Closed Bug 1552485 Opened 5 years ago Closed 5 years ago

[tracking] Reduce release Graph End-to-End times

Categories

(Release Engineering :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mtabara, Assigned: mtabara)

References

(Depends on 2 open bugs)

Details

While doing some cleanup for bug 1530728, I was glancing over our release artifacts and figured there's no consumers at this point for the individual checksums detached signatures.

To recap:

  • beetmover submits files X, Y to S3 and generates a target.checksums at the end
  • checksums-signing consumes that target.checksums and signs that, providing a target.checksums and a detached signature target.checksums.asc
  • beetmover-checksums-signing consumes the two from above and transfers them under beetmover-checksums folder in root in candidates (e.g. for 66.0.5), slightly pretty-named:
    target.checksums -> firefox-{version}.beet
    target.checksums.asc -> firefox-{version}.checksums.asc

Follow-up, release-generate-checksums job, iterates in the public S3 folder and reads the contents of the beet files and concatenates them together in a big-fat SHA{256,512}SUMS. That later gets signed and beetmoved in the root of the ~candidates directory.

But at this point there is no regular use of the asc files from S3.
There's two three options:

  1. We drop the asc files as they are currently ununused.
  2. We add verification for those in the release-generate-checksums to ensure the files have not been tampered within S3.
  3. We rewrite the existing mozharness script into a scriptworker that ensures the CoT verification and downloads the files from upstream tasks instead. However, that might look like a huge payload tasks since there are currently 992 individual checksums files for 66.0.5, for example.
Blocks: 1530728
See Also: 1530728
Type: defect → task

Re-using this bug as this conversation happened at the all-hands. This is to track all the work that's to be done, including but not limited to:

  • clear plan with measureble small wins
  • beetmover and checksums improvements work
  • balrog improvements work

More bugs are to be filed against this, but for now, dropping some of the ideas:

  • Stop producing target.checksums files from BM jobs (maybe? or we could use these to grab both sha512/sha256 hashes for the checksum scriptworker below)

  • Remove checksum-signing and beetmove-checksum-signing jobs

  • Have beetmover and balrog chunking match l10n chunking
    -- we discovered that balrog jobs take over 5 hours in cumulative run time to complete!

  • create a new checksum scriptworker type to generate the signed SHASUMS files by inspecting all the CoT artifacts for the release (or target.checksums)
    -- alternatively, create the SHA
    SUMS file in a generic worker that knows how to do CoT verification, and do regular signing / beetmoving

Component: Release Automation: Other → General
QA Contact: sfraser → catlee
Summary: investigate the usage of individual checksums signing jobs → [tracking] Reduce release Graph End-to-End times

To recap here as I grabbed this bug earlier this week, the ideas that have been depicted in Whistler and so far in this bug largely are related to:

  1. checksums
    a) save up time

  2. chunkification
    a) save up time

To extrapolate, in 1), this includes removing the so-far useless signing of individual checksums to save computational time, but also to enhance security by ensuring SHASUMS big-fat checksums are CoT protected.

To extrapolate, in 2) we want to chunkify to reduce the number of jobs, hence the deps edges in the graph. But this might not mean necessarily time saved as some of these might still block each other.

re: 2), I think we also wanted to have beetmover and balrog chunking match the l10n chunking

So far the ideas that were floating around were mainly related to checksums and chunkification. Since we might discover other potential improvements along the way, I suggest we keep this bug as a meta bug and file individual ones for each chapter we want to optimize.

Depends on: 1567429
Depends on: 1567431

Two ideas to improve our search in this are:

  1. we need to create the perfect benchmark - that end-to-end time that we know we can't beat (not taking into consideration hardware updates or more paralelization). So basically a BF deep in the graph, offset each task to its latest dependency - as if there was no waiting time - and compute the end-to-end boundary. We know for sure that's a limit we can't reach. Once we have that, we can better measure the extent to which we can optimize the graph

  2. we need to look again at the layout of tasks (similarly to what we did in Whistler around the checksums conversation) - the graph layout - to see if there's room to reorganize, reoptimize and redo some of the nodes/edges so that we can get better timings. A good example is checksums. If we redo the logic, we main gain a similar cost but a smaller grapher. There could be similar cases.

Depends on: 1572102
Group: mozilla-employee-confidential
Depends on: 1579067
Assignee: nobody → mtabara
Depends on: 1533337
Blocks: 1590054

Conclusions - https://mihaitabara.github.io/2019/11/21/release-end-to-end-reduced-by-40-percent.html
Follow-ups will be addressed in dedicated bugs.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.