[tracking] Reduce release Graph End-to-End times
Categories
(Release Engineering :: General, task)
Tracking
(Not tracked)
People
(Reporter: mtabara, Assigned: mtabara)
References
(Depends on 2 open bugs)
Details
While doing some cleanup for bug 1530728, I was glancing over our release artifacts and figured there's no consumers at this point for the individual checksums detached signatures.
To recap:
- beetmover submits files X, Y to S3 and generates a
target.checksums
at the end - checksums-signing consumes that
target.checksums
and signs that, providing atarget.checksums
and a detached signaturetarget.checksums.asc
- beetmover-checksums-signing consumes the two from above and transfers them under
beetmover-checksums
folder in root in candidates (e.g. for 66.0.5), slightly pretty-named:
target.checksums
->firefox-{version}.beet
target.checksums.asc
->firefox-{version}.checksums.asc
Follow-up, release-generate-checksums
job, iterates in the public S3 folder and reads the contents of the beet
files and concatenates them together in a big-fat SHA{256,512}SUMS. That later gets signed and beetmoved in the root of the ~candidates directory.
But at this point there is no regular use of the asc
files from S3.
There's two three options:
- We drop the
asc
files as they are currently ununused. - We add verification for those in the
release-generate-checksums
to ensure the files have not been tampered within S3. - We rewrite the existing mozharness script into a scriptworker that ensures the CoT verification and downloads the files from upstream tasks instead. However, that might look like a huge payload tasks since there are currently 992 individual checksums files for 66.0.5, for example.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
Re-using this bug as this conversation happened at the all-hands. This is to track all the work that's to be done, including but not limited to:
- clear plan with measureble small wins
- beetmover and checksums improvements work
- balrog improvements work
More bugs are to be filed against this, but for now, dropping some of the ideas:
-
Stop producing target.checksums files from BM jobs (maybe? or we could use these to grab both sha512/sha256 hashes for the checksum scriptworker below)
-
Remove checksum-signing and beetmove-checksum-signing jobs
-
Have beetmover and balrog chunking match l10n chunking
-- we discovered that balrog jobs take over 5 hours in cumulative run time to complete! -
create a new checksum scriptworker type to generate the signed SHASUMS files by inspecting all the CoT artifacts for the release (or target.checksums)
-- alternatively, create the SHASUMS file in a generic worker that knows how to do CoT verification, and do regular signing / beetmoving
Assignee | ||
Comment 2•5 years ago
•
|
||
To recap here as I grabbed this bug earlier this week, the ideas that have been depicted in Whistler and so far in this bug largely are related to:
-
checksums
a) save up time -
chunkification
a) save up time
To extrapolate, in 1), this includes removing the so-far useless signing of individual checksums to save computational time, but also to enhance security by ensuring SHASUMS big-fat checksums are CoT protected.
To extrapolate, in 2) we want to chunkify to reduce the number of jobs, hence the deps edges in the graph. But this might not mean necessarily time saved as some of these might still block each other.
Comment 3•5 years ago
|
||
re: 2), I think we also wanted to have beetmover and balrog chunking match the l10n chunking
Assignee | ||
Comment 4•5 years ago
|
||
So far the ideas that were floating around were mainly related to checksums and chunkification. Since we might discover other potential improvements along the way, I suggest we keep this bug as a meta bug and file individual ones for each chapter we want to optimize.
Assignee | ||
Comment 5•5 years ago
|
||
Two ideas to improve our search in this are:
-
we need to create the perfect benchmark - that end-to-end time that we know we can't beat (not taking into consideration hardware updates or more paralelization). So basically a BF deep in the graph, offset each task to its latest dependency - as if there was no waiting time - and compute the end-to-end boundary. We know for sure that's a limit we can't reach. Once we have that, we can better measure the extent to which we can optimize the graph
-
we need to look again at the layout of tasks (similarly to what we did in Whistler around the checksums conversation) - the graph layout - to see if there's room to reorganize, reoptimize and redo some of the nodes/edges so that we can get better timings. A good example is checksums. If we redo the logic, we main gain a similar cost but a smaller grapher. There could be similar cases.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 6•5 years ago
|
||
Conclusions - https://mihaitabara.github.io/2019/11/21/release-end-to-end-reduced-by-40-percent.html
Follow-ups will be addressed in dedicated bugs.
Description
•