Open Bug 1567429 Opened 4 years ago Updated 3 years ago

redesign the way we do checksums in the release graphs

Categories

(Release Engineering :: General, task)

task
Not set
normal

Tracking

(firefox70 fixed)

REOPENED
Tracking Status
firefox70 --- fixed

People

(Reporter: mtabara, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: leave-open)

Attachments

(1 file)

This mainly focuses on two things:
a) save up time
b) fix security whole

To detail a bit, this includes removing the so-far useless signing of individual checksums to save computational time, but also to fix the security whole that we currently have that the SHASUMS big-fat checksums are generated by a docker-worker that reads the public beet files that we published in archive.mozilla.org. So if anyone maliciously replaced one of those files, we wouldn't catch it, since signatures are not actually used anywhere in that release-generate-checksums job that runs in docker-worker.

More conversations happened in here but I'll paste the plan for Q3 in a bit, for consistency.

Stage 1 (optimize):

What’s the goal?
  • Trim the graph NOW of unwanted/unneeded tasks. Conclude if chunkification is possible and if yes, how. Measure things.
What does it mean for me?
  • get familiar with working with pandas for measuring things better
  • measure the compute time effort to run checksums-signing in !nightly-graphs
  • remove that from taskgraph for !nightly graphs
  • measure again with pandas to confirm the save-up
  • explore chunkification process and its implications to manifest.json and BM payload
When it should be ready?
  • End of July.

Stage 2 (partial replacement):

What’s the goal?
  • move the logic from docker-worker release-generate-checksums into a separate behavior in beetmoverscript.
What does it mean for me?
  • move the logic from release-generate-checksums into a behavior in beetmoverscript
  • retire the aforementioned mozharness script that runs in docker-worker
  • keep existing signing pieces and beetmove-ing to S3 buckets
When it should be ready?
  • Mid-August.

Stage 3 (full replacement):

What’s the goal?
  • move signing into beetmover. Fully replace all checksums into a single task.
What does it mean?
  • move Autograph signing-bits into beetmover
  • add signing credentials in beetmover, per product
  • kill release-generate-checksums-signing and its beetmover counterpart
  • roll-out to all release branches
When it should be ready?
  • Mid September

Stage 4 (opimizations for Q4/2020 future):

At some point we can get checksums from CoT artifacts. however, to do this,

  • we need to translate from simple names to pretty names (declarative). alternately, we can actually build with pretty names, which is also possible with declarative artifacts.
  • we need to know which task to use for which artifact (unsigned vs signed vs repackaged vs repackage-signed), also helped by declarative
  • we need sha512sums in the cot artifacts, or drop sha512sums. guessing the former would be better. this requires patches in all worker implementations
  • we need file sizes in the cot artifacts, or drop filesizes, also guessing the former is better. this requires patches in all worker implementations
Depends on: 1569220

I started working on this last week. Plan so far is:
a) prep patches to remove checksums-signing as a kind altogher and link the beetmover-checksums directly to beetmover or alike (this would work on all release-branches except nightly)
b) Us eredash to measure whether beet files and checksums signatures on nightly are actually used. If they are not, we stop pushing them as well. Instead we add the big fat SHA512 SUMS.
c) for now, we push the big fat checksums in both the l10n and the en-US dirs but for the future, we want to centralize them in a single home, as tracked by bug 1569220.
d) explore what we do with latest dirs in this case and whether it's needed to upload files there too

Remove individual checksums signing in favor of big fat checksums

Stats time:

Did some measuring with https://gist.github.com/MihaiTabara/8fcf36f4132428b2dbe873753439bf37 to reckon the amount of time we spend with doing checksums-signing. Our goal is to measure, before and after we remove checksums-signing the following metrics:

a) end-to-end times in a promote graph
b) runtime saving

Use cases

  1. Firefox nightly central graph - using this Nd from 1st of August.
Graph took 4:21:02.320000 end-to-end to run
All graph tasks runtime: 8 days 09:15:00.851000
Checksums-signing runtime: 0 days 03:27:45.307000 <-- represents 1.720537247512495 %
  1. Firefox Beta 68.0b14 build 1 - using this graph
Graph took 4:22:34.764000 end-to-end to run
All graph tasks runtime: 16 days 11:50:07.365000
Checksums-signing runtime: 0 days 02:42:30.811000 <-- represents 0.6842639020325344 %
  1. Firefox release 68.0 build 1 - using this graph
Graph took 3 days, 20:41:47.732000 end-to-end to run
All graph tasks runtime: 17 days 23:26:59.481000
Checksums-signing runtime: 0 days 02:41:12.245000 <-- represents 0.6227223594808879%
  1. Firefox ESR 68.0.1esr build 1 - using this graph
All graph tasks runtime: 10 days 13:56:13.077000
Checksums-signing runtime: 0 days 02:26:31.688000 <-- represents 0.961709354737429%
  1. Firefox 69.0b8 build 1 - using this graph
Graph took 13:47:32.609000 end-to-end to run
All graph tasks runtime: 18 days 05:59:39.186000
Checksums-signing runtime: 0 days 02:39:06.745000 <-- represents 0.6054585882896099%
Group: mozilla-employee-confidential, releng-security

Beta Releases

Firefox 69.0b11 build1

Graph took 5:01:08.463000 end-to-end to run
All graph tasks runtime: 16 days 10:22:02.959000
checksums-signing runtime: 0 days 02:59:35.592000 <-- represents 0.7589925859612727%

69.0b10 build1

Graph took 5:14:02.655000 end-to-end to run
All graph tasks runtime: 16 days 16:34:44.867000
checksums-signing runtime: 0 days 03:00:34.260000 <-- represents 0.7512914286756744%

69.0b9 build4

Graph took: unfinished graph ----
All graph tasks runtime: 18 days 06:38:19.203000
checksums-signing runtime: 0 days 02:39:54.577000 <-- represents 0.6075981155441061%

69.0b8 build1

Graph took 13:47:32.609000 end-to-end to run
All graph tasks runtime: 18 days 05:59:39.186000
checksums-signing runtime: 0 days 02:39:06.745000 <-- represents 0.6054585882896099%

69.0b7 build1

Graph took 4:21:24.885000 end-to-end to run
All graph tasks runtime: 11 days 08:04:47.344000
checksums-signing runtime: 0 days 02:30:20.376000 <-- represents 0.9209282851131969%

68.0b3 build 2

Graph took 7:11:56.910000 end-to-end to run
All graph tasks runtime: 10 days 22:45:47.680000
checksums-signing runtime: 0 days 02:08:05.061000 <-- represents 0.8124192450051783%

68.0b6 build 1

Graph took 3:22:02.059000 end-to-end to run
All graph tasks runtime: 10 days 21:30:11.511000
checksums-signing runtime: 0 days 02:34:10.745000 <-- represents 0.98264625956969%

68.0b9 build 1

Graph took 4:12:35.742000 end-to-end to run
All graph tasks runtime: 15 days 11:09:12.185000
checksums-signing runtime: 0 days 02:38:26.054000 <-- represents 0.711449946100264%

68.0b13

Graph took 4:56:10.238000 end-to-end to run
All graph tasks runtime: 15 days 15:24:04.886000
checksums-signing runtime: 0 days 02:47:54.187000 <-- represents 0.7454382420149984%

68.0b11

Graph took 7:32:10.382000 end-to-end to run
All graph tasks runtime: 15 days 01:12:11.379000
checksums-signing runtime: 0 days 02:43:46.162000 <-- represents 0.7556659908920033%

67.0b3 build 1

Graph took 6:23:28.315000 end-to-end to run
All graph tasks runtime: 11 days 06:03:42.940000
checksums-signing runtime: 0 days 03:28:48.512000 <-- represents 1.2886459971824982%

67.0b5 build 1

Graph took 17:10:15.097000 end-to-end to run
All graph tasks runtime: 27 days 12:54:36.297000
checksums-signing runtime: 1 days 11:11:34.240000 <-- represents 5.32490657599318%

67.0b12 build 1

Graph took 4:18:26.685000 end-to-end to run
All graph tasks runtime: 15 days 20:09:14.322000
checksums-signing runtime: 0 days 03:29:19.307000 <-- represents 0.9177061369143082%

67.0b9 build 1

Graph took 21:57:32.372000 end-to-end to run
All graph tasks runtime: 15 days 23:50:45.966000
checksums-signing runtime: 0 days 03:31:39.037000 <-- represents 0.9189907784555489%

67.0b6 build 1

Graph took 3:56:08.415000 end-to-end to run
All graph tasks runtime: 11 days 07:02:12.204000
checksums-signing runtime: 0 days 03:37:07.414000 <-- represents 1.3351423624837129%

Releases

68.0 build3

Graph took 3:14:07.620000 end-to-end to run
All graph tasks runtime: 17 days 18:29:01.958000
checksums-signing runtime: 0 days 02:46:52.638000 <-- represents 0.6521438398676264%

67.0 build 1

Graph took 2 days, 23:02:22.947000 end-to-end to run
All graph tasks runtime: 18 days 03:35:00.649000
checksums-signing runtime: 0 days 03:18:46.087000 <-- represents 0.7605434643245276%

67.0.1 build 1

Graph took 3:01:24.811000 end-to-end to run
All graph tasks runtime: 16 days 02:08:02.140000
checksums-signing runtime: 0 days 03:19:50.580000 <-- represents 0.8625806817430228%

66.0.5 build 1

Graph took 3:23:18.731000 end-to-end to run
All graph tasks runtime: 17 days 18:58:22.114000
checksums-signing runtime: 0 days 03:08:05.162000 <-- represents 0.7341842742400911%

I played again with the graphs and data today and found some interesting facts. Using :nthomas's 's awesome graphs, I computed, per each kind:

  • the earlist time a job starts
  • the lattest it finishes

Looking at the data, it turns out signing is the most important factor that slows us down. And then beetmover.

The goal is to put the promotion phase earlier in QA's hands, so looking at the graphs, there's two main threads of actions that are slowing the promotion phase.

  1. the eme-free jobs - which start slow because nightly-l10n-signing is taking a lot of time
  2. the beetmover-repackage action thread which is also slow because of nightly-l10n-signing.

Towards the end of both aforementioned threads of action, beetmover jobs are also flooded so we'd need more load there.

Conclusions:

  1. Signing and beetmover are currently weighing the most in slowing the promotion phase overall.
  2. It's difficult to measure which job's removal could save us big time because they are running in parallel per chunks. But removing things like checksums-signing could definitely release some resources so that other jobs that run in parallel go faster. So it's definitely worth doing but without knowing in advance what the impact will be.
  3. Chunking good give us some slight advantage too as it will reduce the time to poll to the TC queue and claim a task. It's small per task, but all those added up could give us a good value.

Other question that I have in mind for now:

  1. Do we need beefier docker-workers for repackage-ing jobs? (both eme-free and normal ones)
  2. The priorities are set across-graph per level. That means though, that on-push eme-free jobs for example are higher in priority than their beta release counterpart. Is that something we would like to look into changing?
  3. eme-free jobs are currently taking slightly longer than the beetmover-checksums/repackage action thread. Do we actually need to block on those during the promotion phase? I know they are blocking the big-fat checksums, but overall, do we actually need those altogether?

Status update: just spoke with :Rail in a 1x1. A few weeks/months back he set up some measurements to test what jobs are pending the most in the queue. Seems like his results found the same thing, that signing and beetmover are the major culprits. Optimizing the others is nice to have too, but not solving the major problem.

Agreed with :rail to set those functions up again so that, we can follow and trace our progress over the next weeks, with each of the new improvement we land in the release graphs (give it autoscaling with Docker, trimming down unused jobs like checksums-signing or optmizations in chunkification process).

The data we collect will be good to set a baseline for us going further.

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #7)

  1. the eme-free jobs - which start slow because nightly-l10n-signing is taking a lot of time

I agree this is generally true, but there was also that regression in 68.0.1 and between 69.0b8 and 69.0b10 for mac signing to watch out for when crunching data. It might be possible to depend on nightly-l10n for some platforms, but we do so many types of signing in a single job that it would need some careful investigations to see what is carried forward from vanilla en-US, vanilla localised, etc, and how that affects partial updates.

Other question that I have in mind for now:

  1. Do we need beefier docker-workers for repackage-ing jobs? (both eme-free and normal ones)

Maaaybe. Repackage is using gecko-3-b-linux so the same as builds, which is c5/c5d/m4/m5/c4.4xlarge. It might be interesting to know how peaky the load is (how many times we hit a "hot" instance, vs waiting for capacity to spin up and clone gecko the first time) and what we spend time on in the jobs.

  1. The priorities are set across-graph per level. That means though, that on-push eme-free jobs for example are higher in priority than their beta release counterpart. Is that something we would like to look into changing?

Did you mean repackage instead of eme-free there ? If you've found on-push eme-free jobs can you point me to one ? For anyone who hasn't seen it already, see also bug 1572102 for eme-free vs partner priority changes.

  1. eme-free jobs are currently taking slightly longer than the beetmover-checksums/repackage action thread. Do we actually need to block on those during the promotion phase? I know they are blocking the big-fat checksums, but overall, do we actually need those altogether?

Perhaps not. There are minor considerations like 12 checksums files vs 6 if we split out eme-free into its own checksums, but delivering builds to QE faster would likely trump that.

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #9)

Did you mean repackage instead of eme-free there ? If you've found on-push eme-free jobs can you point me to one ? For anyone who hasn't seen it already, see also bug 1572102 for eme-free vs partner priority changes.

Ah, mea culpa - I've seen them in https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=7ece03f6971968eede29275477502309bbe399da but those are filled-by the dot release action task.

Pushed by mtabara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b57a45d3b476
remove checksums signing from release automation. r=tomprince

:nthomas and I met today to chat around a couple of ideas depicted in this bug and in the document

Note to self, conclusions so far, in a nutshell:

  1. signing and beetmover auto-scaling will definitely be the major wins with real impact in the end-to-end release times
  2. upgrading the docker-workers to beefier instances is not recommended without a proper understanding of what they're requiring. For example in the repackage jobs, we'd need more bandwidth, more CPU, more RAM, or a combination of those? We need to understand very well what's taking so long in the jobs runtime, before we upgrade there.
    3 . Priority-wise, as bug 1572102 depicts, we could lower some of the partner repacks priorities for a chemspill for example, to free up some resources for the other releases running in paralel. (beta, ersr60, nightly, etc). eme-free repacks need to stay as prioritized as the tree, given they are blocking the promotion phase, but the partner-repacks can be lowered. Nick started working on this in bug 1572102 \o/
  3. TIL why eme-free-repacks are important. I initially thought we could make some optimizations around them to delay their run or remove them from the promotion phase dep chain, but I was wrong. We need them there.
  4. Chunkification needs to be done top-bottom and we should start with repackage. This is tricky as it was initially implemented to expect a single locale and that propagated to beetmover and balrog jobs. If we manage to solve this, the downstream will probably follow more easily.
  5. We need to double-check with QA what's the notification they mostly care about? Do they refresh the S3 urls? Do they follow the TC graph? How do they know when to start testing? If we learn that, we could possibly optimize the notifiy-promote or variations of it earlier in the graph.
  6. Related to 6, Nick had an interesting idea around potentially exploring prioritizing locales, instead of simple chunking. Some locales are used more than others, so it'd maybe make sense to prioritize before the rest?
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Pushed by thunderbird@calypsoblue.org:
https://hg.mozilla.org/comm-central/rev/a117aceeecb2
Port Bug 1567429 - Remove checksums-signing from Thunderbird CI. rs=bustage-fix
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

Removing checksum-signing in the nightly graph saved around 10% in the beetmover-checksums run.

beetmover-checksums runtime: 0 days 03:25:54.418000 <-- represents 1.6477520692116214%

vs

beetmover-checksums runtime: 0 days 03:43:22.533000 <-- represents 1.8353060331211775%

Removing them may not solely be the culprit of this efficiency though. There are many moving parts, parallelization and fight-for-resources so I can't yet assess it properly. We'll need to see this in beta too, for a couple of releases until we can measure things properly.

Nevertheless, it's a good step forward.

For the record:
https://hg.mozilla.org/comm-central/rev/a324b7aef2aa
Revert part of port of bug 1567429: Remove accidentally pushed beetmover files. rs=jorgk

Since the previous patch got published, how do we ensure that the bytes on archive.mozilla.org are produced by us?

Previously, we had the checksum files which were signed, ensuring that the checksums are trusted, and transitively the files which are listed in the checksums files.

Now, it seems that only the *.tar.bz2 files are signed (reading the content of [0]). Could this cause a security issue with the content provided by other files distributed from archive.mozilla.org?

related issue: https://github.com/mozilla/nixpkgs-mozilla/issues/199

[0] https://archive.mozilla.org/pub/firefox/nightly/2019/08/2019-08-21-21-55-24-mozilla-central/

:nbp and I chatted today. It turns out nightly is missing the SHA{256,512}SUMS files[1]. Beta[2] and release do provide them. To me, the fix is to have these files on nightly too. What do you think, Mihai?

[1] https://archive.mozilla.org/pub/firefox/nightly/2019/08/2019-08-21-21-55-24-mozilla-central/
[2] https://archive.mozilla.org/pub/firefox/releases/69.0b16/

See Also: → 1580146

Found in triaging. Not actively working on this, returning to the pool.

Assignee: mtabara → nobody
You need to log in before you can comment on or make changes to this bug.