Open Bug 1773091 Opened 2 years ago Updated 5 months ago

Expand the time that we keep autoland builds around for so that mozregression still work for regressions older than 1 year

Categories

(Release Engineering :: General, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: jrmuizel, Assigned: gbrown)

References

Details

bug 1771837 and bug 1772225 are some examples of regressions that fall outside of the 1 year autoland range. Without the autoland builds, it is painful to narrow down the regression window which makes it much harder and time consuming to act on these bugs.

Sylvestre suggests that it would not be too costly to expand the time that we keep the autoland builds around for.

(sorry, obsoleting comment 1 [regarding preserving android/gve builds] since it turns out that's tracked via bug 1763040 and planned followup work there.)

I suspect jrmuizel's request here (preserving autoland builds) falls under the RelEng component, based on ~similar "old-build-preserving" bug 1763040.

--> Reclassifying.

Component: mozregression → General
Product: Testing → Release Engineering
QA Contact: aki
Version: Default → unspecified

What is the proposed new expiration? 366 days? 10 years? What set of binaries/tasks do we need to keep for longer?
I imagine we're hitting this block where we default to 1y for tasks where expires-after isn't specified.

Severity: -- → S3
Flags: needinfo?(jmuizelaar)

How much do the different lengths of time cost?

We only need to keep the binaries that mozregression uses. Perhaps Zeid can more precisely describe which those are.

Flags: needinfo?(jmuizelaar) → needinfo?(zeid)

I believe it's standard S3 pricing until/unless we move to GCP.
https://aws.amazon.com/s3/pricing/ -- I'm guessing .021 USD per gb per month? Aiui we dropped from ~4.44 petabytes to ~1.35 petabytes recently after we fixed our cleanup processes, so we're well into the 500 tb+ tier.

Is that the total amount of S3 storage Mozilla pays for or just the amount of storage for the current year of mozregression used builds?

Flags: needinfo?(aki)

That's the total for Taskcluster artifacts for the FirefoxCI cluster.

Flags: needinfo?(aki)

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

Flags: needinfo?(zeid)
See Also: → 1234029, 1763040
See Also: → 1773343
Blocks: 1773355

We could expand to two years right now and then take another year to decide what the optimal number is. Every day we wait more builds are lost.

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

Flags: needinfo?(cbond)
Flags: needinfo?(aki)

mozregression also supports debug and asan builds, fwiw. And mobile IIRC.

Only keeping debug and asan builds for a year seems ok if we want to keep the costs down. I've never run into an old regression that needed one of those builds.

(In reply to Marco Castelluccio [:marco] from comment #11)

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

If we ditch all non-opt builds, we have 4 builds per push (build-linux64, build-macosx64, build-win32, build-win64).
We always keep all artifacts for a given task for the lifespan of the task [1].
I ran a test script to download all the artifacts from those builds from this push and came up with 4GB of build artifacts per push.

If we have around 100 pushes per day, we have 400GB per day, which is 146000GB per year. If we multiply that by .021$ per GB, it adds up to around $3000 per month, assuming we're keeping the builds around for an extra year. If we choose some other number, like 10 years or forever, that number will increase.

Also, I'm not entirely sure if we can keep some tasks around for longer than, say, their decision task, so we may need to increase the lifespan of a number of tasks, depending.

Because of some of the taskgraph complexity involved here, I'm somewhat leaning towards adding an autoland beetmover task to move the subset of artifacts we want to keep to https://archive.mozilla.org/pub/firefox/ somewhere -- we may be able to significantly reduce that 4GB number. This will take some engineering from Releng to add the beetmover task, and some engineering from the mozregression team to know how to look at that location.

Flags: needinfo?(aki)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #14)

(In reply to Marco Castelluccio [:marco] from comment #11)

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

If we ditch all non-opt builds, we have 4 builds per push (build-linux64, build-macosx64, build-win32, build-win64).
We always keep all artifacts for a given task for the lifespan of the task [1].

Couldn't we change that to have different expirations for different artifacts and only keep the actual build artifact (e.g. target.tar.bz2 on Linux)?

Hey there! As far as I remember, artifact expiration is defined per task. In other words, if we wanted to implement different expiration, then we would need to get the build artifacts generated in one task, and the rest in another one.

Would that be a lot of work? Or significant tech debt?

:jlorenzo there seems to be a expires value for each artifact (see https://hg.mozilla.org/mozilla-central/file/eda93d9c342fcb5d7c774e3bf1d391bd977172fc/taskcluster/gecko_taskgraph/transforms/task.py#l526), it's also part of the API (https://docs.taskcluster.net/docs/reference/platform/queue/api#createArtifact and https://docs.taskcluster.net/docs/reference/platform/object/api#createUpload).
Maybe the expiration of the artifact can't be longer than the expiration of the task? If so, we could set a 2 year expiration for tasks and the build artifacts, and set a shorter one (1 year or even less) for the other artifacts.

For example, in bugbug, I have a task that is expiring in one year, with one artifact expiring in one month and another artifact expiring in one year: https://github.com/mozilla/bugbug/blob/ebf06e9c18ff45a6719f8e4f959af4569d52a32b/infra/data-pipeline.yml#L559.

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #20)

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

Oh no, I'm not pushing back against it, I don't even know exactly what it entails in terms of implementation. I was only asking about the other option of having different expirations for different artifacts, since in your initial comment you didn't mention that as a possibility and so I didn't know if you had considered it. From a high level perspective it seemed like a quicker/simpler solution (not requiring an additional task and additional mozregression code), but I trust your judgement and that's why I asked you :)

As a side note, I just remembered there were some discussions as part of the cost reduction project we run a couple years ago to define a policy around artifacts, see https://docs.google.com/document/d/1QC2pj5Y1aK95SdA2nHCSmRrT8aXz9BjZVZarElvdZz0/edit. Among other things, it proposed different expirations for different artifacts. Bug 1649987 implemented that, but it was backed-out because of two problems, I'm not sure how easy it is to fix them. I imagine we could use the beetmover solution in that case too and move not only builds but also the other artifacts that require a longer expiration than normal, but then we'd need to update all downstream consumers to be able to retrieve the information from some other place. Anyway, it's a problem for another day.

(In reply to Marco Castelluccio [:marco] from comment #21)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #20)

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

Oh no, I'm not pushing back against it, I don't even know exactly what it entails in terms of implementation. I was only asking about the other option of having different expirations for different artifacts, since in your initial comment you didn't mention that as a possibility and so I didn't know if you had considered it. From a high level perspective it seemed like a quicker/simpler solution (not requiring an additional task and additional mozregression code), but I trust your judgement and that's why I asked you :)

Ok. Yeah, it's definitely more work for two teams to implement, maybe a few weeks on the Releng side to set up the beetmover manifests, set up the task and scopes properly, and coordinate with the product delivery team about a new directory structure on archive.m.o with a separate artifact retention policy. I have fewer long-term worries about this approach, however.

As a side note, I just remembered there were some discussions as part of the cost reduction project we run a couple years ago to define a policy around artifacts, see https://docs.google.com/document/d/1QC2pj5Y1aK95SdA2nHCSmRrT8aXz9BjZVZarElvdZz0/edit. Among other things, it proposed different expirations for different artifacts. Bug 1649987 implemented that, but it was backed-out because of two problems, I'm not sure how easy it is to fix them. I imagine we could use the beetmover solution in that case too and move not only builds but also the other artifacts that require a longer expiration than normal, but then we'd need to update all downstream consumers to be able to retrieve the information from some other place. Anyway, it's a problem for another day.

Yes, the issues mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1649987#c17 are precisely what I'm worried about. We may solve one set of issues and create another set that won't hurt us until a year later or more when we've forgotten about what changes we've made to create those issues. And even if the set of artifacts+tasks that we choose for this bug don't cause those problems, the fact that we have a precedent to customize longer expiration times for a subset of artifacts is likely going to lead to a future change that will cause those problems.

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) (away, back Mon Jul 11) from comment #22)

Yes, the issues mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1649987#c17 are precisely what I'm worried about. We may solve one set of issues and create another set that won't hurt us until a year later or more when we've forgotten about what changes we've made to create those issues. And even if the set of artifacts+tasks that we choose for this bug don't cause those problems, the fact that we have a precedent to customize longer expiration times for a subset of artifacts is likely going to lead to a future change that will cause those problems.

Wild idea: we could run a controlled experiment where we temporarily make the artifacts for which we are planning to have a shorter lifespan unavailable, and see what breaks. It wouldn't be complete, but it could reduce the risk of unexpected future breakages.

We'd likely see bustage in however many months it takes to expire the shorter-lived artifacts, and we wouldn't have a way to fix it. Probably not a great situation.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)

Only keeping debug and asan builds for a year seems ok if we want to keep the costs down. I've never run into an old regression that needed one of those builds.

FWIW I just ran into one bug where I would love to have older debug builds available (possibly an exponential-decay sparse set of them). But I agree this is pretty rare.

(In my case, I'm looking at a fuzzer bug with a fatal assertion, which of course only trips in debug builds. It doesn't reproduce in current builds, nor in debug builds from 1 year back which is the oldest I can get via mozregression / artifacts. The bug has notes that suggest it was reproducible 2 years ago, so I'm wishing I could run a ~2-year-old debug build to double-check that I can indeed repro the issue in that older build. This would help me validate that I'm actually testing properly and give me a bit of confidence that the issue has in fact gone away.)

Yeah, we are planning to improve the situation here.
We are currently analysis what we are storing, what is useful, what is not and then define a new storage policy.

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #9)

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Those are a subset of mozilla-central builds, not autoland builds, which mozregression also needs.

Sure, I was just highlighting this if folks need to do long regression ranges

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Sylvestre has a point. I (officially non-Mozilla) use these builds as there's data in a .txt file from each build that shows the corresponding hash checksum of the revision that it was built off from. This revision hash (since it's m-c) can be mapped to both autoland and mozilla-central.

Do a date bisection on them and you will eventually get a daily range of changes - it may not be as granular as the autoland/mozregression ones, but for the devs, it may just be good enough for the age of the build.

Any bisection range is generally better than none at all.

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #9)

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Aiui mozregression already supports nightlies. If this is sufficient, we can RESO WORKSFORME this bug.
Comment #0 appears to suggest that without the granular autoland per-push information, we can only bisect to nightlies which can be separated by many dozens of commits... so if we want to support more-granular-bisection for over a year, I suggest we beetmove more binaries from autoland to archive.m.o with a lifespan of >1y. Otherwise, WORKSFORME.

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #32)

Aiui mozregression already supports nightlies. If this is sufficient, we can RESO WORKSFORME this bug.

Just using nightlies is not sufficient.

a few things were overlooked, here are the shippable builds we have:
linux64 shippable (linux32 is central only)
osx cross shippable
osx aarch64 shippable
win32 shippable
win64 shippable
win aarch64 shippable
android lite shippable
android lite arm7 shippable
android lite aarch64 shippable
android arm7 shippable
android aarch64 shippable
android x86 shippable
android x86_64 shippable

that is 12 builds, not 4, so our math needs to be 3x, BUT...

we do not do builds on every push, in fact, shippable builds seem to be every few pushes and we don't do all shippable builds per push, so each platform will have different revisions.

Autoland introduces variability- assuming we have 100 pushes/day, how many are landings + backouts + relandings? how many are test-only, or DONTBUILD? in the end we are looking at a small subset of builds that meet our definition. I would recommend taking backstop pushes this is where we schedule all tasks (and all builds)- I counted 7 of those for August 31st.

now our math looks to be more manageable and going with a backstop push we seem to have all our builds, so our mozregression story should be consistent.

One other consideration- do we need ALL 12 shippable builds? do we need shippable only or other builds? I think this could lead to a more productive discussion (outside this bug) about what is tier1 vs tier2. For example, linux32 is only run on central, should win/aarch64? what about some of the android permutations? I would vote we keep this bug about shippable builds on autoland being available for a duration > 1 year and solve discussions about specific builds/types for another bug.

Summary: Expand the time that we keep autoland builds around for so that mozregression still work for regresssion older than 1 year → Expand the time that we keep autoland builds around for so that mozregression still work for regressions older than 1 year

I think mozregression will use opt non-shippable builds from autoland, so it's not just shippable builds we are considering.

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #34)

One other consideration- do we need ALL 12 shippable builds? do we need shippable only or other builds? I think this could lead to a more productive discussion (outside this bug) about what is tier1 vs tier2. For example, linux32 is only run on central, should win/aarch64? what about some of the android permutations? I would vote we keep this bug about shippable builds on autoland being available for a duration > 1 year and solve discussions about specific builds/types for another bug.

android lite shippable
android lite arm7 shippable
android lite aarch64 shippable

We don't need to keep any Android "lite" builds. They're a test configuration that we don't use in Fenix or Focus. (We stopped build and testing the debug "lite" builds in bug 1778172.)

android arm7 shippable
android x86 shippable

I don't think we need to keep the arm7 and (32-bit) x86 builds.

So in the end, the only Android builds we need to keep are:

android aarch64 shippable
android x86_64 shippable

Flags: needinfo?(jmaher)

Thanks for the info :cpeterson, the list is reduced to shippable (opt if not available):
linux64
osx cross
osx aarch64
win32
win64
win aarch64
android aarch64
android x86_64

The target is 2 years of retention for autoland.

Questions:

  1. I assume we only need the installer package, nothing else related to these build configurations.
  2. Is there agreement that we should do backstop only pushes? If not, ok to do some subset of pushes (i.e. !DONTBUILD, !TESTONLY, !backed out, !test/tooling only changes)
  3. who would be the appropriate people to sign off on these changes?
Flags: needinfo?(jmaher)

who would be the appropriate people to sign off on these changes?

I can (probably)

QA Contact: mozilla → jlorenzo

Joel, can you provide an update on the status of fixing this?

Flags: needinfo?(jmaher)

I had overlooked this- I believe the solution for doing this is one of 2 things:

  1. use beetmover to move the builds only to archive.mozilla.org- I don't understand all of that.
  2. extend the task expiration to be 2 years for builds on autoland and ensure the artifacts expire faster except the builds.

From earlier bug conversations, beetmover seemed to be the ideal path, :jlorenzo, is this something you can determine how to do and what it would take or what issues we might have?

Flags: needinfo?(jmaher) → needinfo?(jlorenzo)

I had a look at RELENG-942 (mentioned in comment 23). Aki highlighted the big steps in regard to option #1. I agree with these steps:

Tl;dr: Increasing taskcluster expiration times is dangerous, and may bite us in a year+.

We likely want to solve this by:

  1. Releng + Mozregression team hash out details
  2. At Releng’s request, Product Delivery adds another directory in with a custom artifact expiration period
  3. Releng adds an autoland beetmover task + manifest to upload the minimum set of needed Autoland binaries to this new location. We need to know how to upload: by revision-named directory? by datestring-named directory? (If the latter, we may need to upload the info.txt to know what revision this build is from)
  4. Mozregression team updates mozregression to look at this new location for builds.
Flags: needinfo?(jlorenzo)

Is there anything blocking "Releng + Mozregression team hash out details" from happening? Can someone own making that happen?

Flags: needinfo?(jlorenzo)

Just wanted to chime in and say that as someone who investigates platform regressions from time to time, ending up with regression windows which are older than 1 year, and thus only have nightly granularity (i.e. the window often includes on the order of 50 or so pushes) is a recurring pain point. Increasing the retention period, such as from 1 to 2 years, would be a significant improvement.

The suggestion to limit the retained artifacts to a subset of platforms and build types listed in comment 37 sounds fine to me as a way to reduce the storage costs involved.

Comment 37 and comment 41 seem very helpful; I'd like to try to sort this out.

Zeid: Can you help clarify the mozregression needs? (Maybe just comment 37, "...We need to know how to upload: by revision-named directory? by datestring-named directory?")

Assignee: nobody → gbrown
Flags: needinfo?(jlorenzo) → needinfo?(zeid)

Currently, we find a changeset from a date range via the pushlog, and then we do a lookup on taskcluster by branch, build type, and changeset, before fetching the artifacts from that task. I think naming directories after changesets would mean that we could continue with the pushlog approach, and instead of searching for a task and artifacts on taskcluster, we would fetch the builds directly from that directory.

We do however also use the build date which we currently fetch from the task (for integration builds), so we will still need a way to get this info. Assuming that the tasks may expire and become unavailable, it would make sense to store this as part of the upload itself. So, perhaps we should include both the changeset and the build date in the directory name, or nest the two.

Here is a proposal as an example (task for reference):

/edac68e2456cc823720ee9c6915784191d82ad2e/2022-11-01-00-06-51-764000/build-macosx64/opt/target.dmg

Hopefully this helps?

Flags: needinfo?(zeid)

That definitely helps. Thanks.

Minor concerns:

  1. That's quite different from the existing pattern for, say, nightlies:
https://archive.mozilla.org/pub/firefox/nightly/2023/10/2023-10-30-16-49-30-mozilla-central/firefox-121.0a1.en-US.mac.dmg
  1. In time, we'll end up with a very large top level directory for /<revision>.

Both issues are mostly concerns from the perspective of manual browsing convenience...maybe not important since we are doing this just for mozregression.

(In reply to Geoff Brown [:gbrown] from comment #46)

  1. In time, we'll end up with a very large top level directory for /<revision>.

Both issues are mostly concerns from the perspective of manual browsing convenience...maybe not important since we are doing this just for mozregression.

This sounds similar to what we had with Tinderbox builds, right? Here an example for mozilla-central builds for Linux:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux/

  1. That's quite different from the existing pattern for, say, nightlies:
https://archive.mozilla.org/pub/firefox/nightly/2023/10/2023-10-30-16-49-30-mozilla-central/firefox-121.0a1.en-US.mac.dmg

I think it would be good to make it more consistent with existing patterns, as much as possible. So something like this may be better then.

https://archive.mozilla.org/pub/firefox/integration/edac68e2456cc823720ee9c6915784191d82ad2e-autoland/2022-11-01-00-06-51/<filename>.dmg

Where <filename> would contain any other information (e.g., OS, build type, etc.) that we need.

We could also make the date/time part of the filename but that seems more redundant. I can also check what exactly we use the build date/time for, and if it is only displayed for reference purposes then perhaps we don't have to include it (but possibly include push time instead, or something else) -- will get back to you on this.

  1. In time, we'll end up with a very large top level directory for /<revision>.

If this is a significant problem (though based on :whimboo's comment perhaps it isn't?) we could figure out how to group them by something (maybe the first two digits of the changeset ID or something -- though not sure if those would be uniformly distributed). I don't think grouping by year and/or year and month would work since there are some edge cases where a push date is not going to match the build date.

Since this support has yet to be implemented in mozregression, the exact details from my perspective don't matter too much, as long as we are able to easily search by changeset and also have access to the full date/time, build type, and OS info. So whichever is the most conventional and sensible way you think that this is doable is good with me!

No longer blocks: 1773355
You need to log in before you can comment on or make changes to this bug.