Bugzilla

Comment 4

•

2 years ago

What is the proposed new expiration? 366 days? 10 years? What set of binaries/tasks do we need to keep for longer?
I imagine we're hitting this block where we default to 1y for tasks where expires-after isn't specified.

Severity: -- → S3

Flags: needinfo?(jmuizelaar)

Reporter

Comment 5

•

2 years ago

How much do the different lengths of time cost?

We only need to keep the binaries that mozregression uses. Perhaps Zeid can more precisely describe which those are.

Flags: needinfo?(jmuizelaar) → needinfo?(zeid)

Comment 6

•

2 years ago

I believe it's standard S3 pricing until/unless we move to GCP.
https://aws.amazon.com/s3/pricing/ -- I'm guessing .021 USD per gb per month? Aiui we dropped from ~4.44 petabytes to ~1.35 petabytes recently after we fixed our cleanup processes, so we're well into the 500 tb+ tier.

Reporter

Comment 7

•

2 years ago

Is that the total amount of S3 storage Mozilla pays for or just the amount of storage for the current year of mozregression used builds?

Flags: needinfo?(aki)

Comment 8

•

2 years ago

That's the total for Taskcluster artifacts for the FirefoxCI cluster.

Flags: needinfo?(aki)

Timothy Nikkel (:tnikkel)

Comment 9

•

2 years ago

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

Zeid Zabaneh [:zeid]

Updated

•

2 years ago

Flags: needinfo?(zeid)

Updated

•

2 years ago

Updated

•

2 years ago

Blocks: 1773355

Comment 10

•

2 years ago

We could expand to two years right now and then take another year to decide what the optimal number is. Every day we wait more builds are lost.

Ryan VanderMeulen [:RyanVM]

Comment 11

•

2 years ago

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

Flags: needinfo?(cbond)

Flags: needinfo?(aki)

Comment 12

•

2 years ago

mozregression also supports debug and asan builds, fwiw. And mobile IIRC.

Reporter

Comment 13

•

2 years ago

Only keeping debug and asan builds for a year seems ok if we want to keep the costs down. I've never run into an old regression that needed one of those builds.

Comment 14

•

2 years ago

(In reply to Marco Castelluccio [:marco] from comment #11)

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

If we ditch all non-opt builds, we have 4 builds per push (build-linux64, build-macosx64, build-win32, build-win64).
We always keep all artifacts for a given task for the lifespan of the task [1].
I ran a test script to download all the artifacts from those builds from this push and came up with 4GB of build artifacts per push.

If we have around 100 pushes per day, we have 400GB per day, which is 146000GB per year. If we multiply that by .021$ per GB, it adds up to around $3000 per month, assuming we're keeping the builds around for an extra year. If we choose some other number, like 10 years or forever, that number will increase.

Also, I'm not entirely sure if we can keep some tasks around for longer than, say, their decision task, so we may need to increase the lifespan of a number of tasks, depending.

Because of some of the taskgraph complexity involved here, I'm somewhat leaning towards adding an autoland beetmover task to move the subset of artifacts we want to keep to https://archive.mozilla.org/pub/firefox/ somewhere -- we may be able to significantly reduce that 4GB number. This will take some engineering from Releng to add the beetmover task, and some engineering from the mozregression team to know how to look at that location.

Flags: needinfo?(aki)

Johan Lorenzo [:jlorenzo]

Comment 15

•

2 years ago

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #14)

(In reply to Marco Castelluccio [:marco] from comment #11)

One build is around 80 MB, if we have three builds per push (do we? Or do we have more?), it adds up to 240 MB per push. If we have around 100 pushes per day, we have 24 GB per day, which is 8750 GB per year. If we multiply that by .021$ per GB, it adds up to around 200$ per month.

Do you think my estimation is correct? If so, the cost seems relatively low given the potential benefit.

If we ditch all non-opt builds, we have 4 builds per push (build-linux64, build-macosx64, build-win32, build-win64).
We always keep all artifacts for a given task for the lifespan of the task [1].

Couldn't we change that to have different expirations for different artifacts and only keep the actual build artifact (e.g. target.tar.bz2 on Linux)?

Comment 16

•

2 years ago

Hey there! As far as I remember, artifact expiration is defined per task. In other words, if we wanted to implement different expiration, then we would need to get the build artifacts generated in one task, and the rest in another one.

Comment 17

•

2 years ago

Would that be a lot of work? Or significant tech debt?

Comment 18

•

2 years ago

:jlorenzo there seems to be a expires value for each artifact (see https://hg.mozilla.org/mozilla-central/file/eda93d9c342fcb5d7c774e3bf1d391bd977172fc/taskcluster/gecko_taskgraph/transforms/task.py#l526), it's also part of the API (https://docs.taskcluster.net/docs/reference/platform/queue/api#createArtifact and https://docs.taskcluster.net/docs/reference/platform/object/api#createUpload).
Maybe the expiration of the artifact can't be longer than the expiration of the task? If so, we could set a 2 year expiration for tasks and the build artifacts, and set a shorter one (1 year or even less) for the other artifacts.

Comment 19

•

2 years ago

For example, in bugbug, I have a task that is expiring in one year, with one artifact expiring in one month and another artifact expiring in one year: https://github.com/mozilla/bugbug/blob/ebf06e9c18ff45a6719f8e4f959af4569d52a32b/infra/data-pipeline.yml#L559.

Comment 20

•

2 years ago

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

Comment 21

•

2 years ago

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #20)

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

Oh no, I'm not pushing back against it, I don't even know exactly what it entails in terms of implementation. I was only asking about the other option of having different expirations for different artifacts, since in your initial comment you didn't mention that as a possibility and so I didn't know if you had considered it. From a high level perspective it seemed like a quicker/simpler solution (not requiring an additional task and additional mozregression code), but I trust your judgement and that's why I asked you :)

As a side note, I just remembered there were some discussions as part of the cost reduction project we run a couple years ago to define a policy around artifacts, see https://docs.google.com/document/d/1QC2pj5Y1aK95SdA2nHCSmRrT8aXz9BjZVZarElvdZz0/edit. Among other things, it proposed different expirations for different artifacts. Bug 1649987 implemented that, but it was backed-out because of two problems, I'm not sure how easy it is to fix them. I imagine we could use the beetmover solution in that case too and move not only builds but also the other artifacts that require a longer expiration than normal, but then we'd need to update all downstream consumers to be able to retrieve the information from some other place. Anyway, it's a problem for another day.

Comment 22

•

2 years ago

(In reply to Marco Castelluccio [:marco] from comment #21)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #20)

I do think it will open the gates to a) a lot of complexity and b) other bugs.
If we set the task and subset of artifacts to live for 2 years, we will have some valid, unexpired tasks in indexes, but anything other than mozregression will be missing upstream artifacts.
Is there something about the beetmover proposal that you're pushing back against?

Oh no, I'm not pushing back against it, I don't even know exactly what it entails in terms of implementation. I was only asking about the other option of having different expirations for different artifacts, since in your initial comment you didn't mention that as a possibility and so I didn't know if you had considered it. From a high level perspective it seemed like a quicker/simpler solution (not requiring an additional task and additional mozregression code), but I trust your judgement and that's why I asked you :)

Ok. Yeah, it's definitely more work for two teams to implement, maybe a few weeks on the Releng side to set up the beetmover manifests, set up the task and scopes properly, and coordinate with the product delivery team about a new directory structure on archive.m.o with a separate artifact retention policy. I have fewer long-term worries about this approach, however.

As a side note, I just remembered there were some discussions as part of the cost reduction project we run a couple years ago to define a policy around artifacts, see https://docs.google.com/document/d/1QC2pj5Y1aK95SdA2nHCSmRrT8aXz9BjZVZarElvdZz0/edit. Among other things, it proposed different expirations for different artifacts. Bug 1649987 implemented that, but it was backed-out because of two problems, I'm not sure how easy it is to fix them. I imagine we could use the beetmover solution in that case too and move not only builds but also the other artifacts that require a longer expiration than normal, but then we'd need to update all downstream consumers to be able to retrieve the information from some other place. Anyway, it's a problem for another day.

Yes, the issues mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1649987#c17 are precisely what I'm worried about. We may solve one set of issues and create another set that won't hurt us until a year later or more when we've forgotten about what changes we've made to create those issues. And even if the set of artifacts+tasks that we choose for this bug don't cause those problems, the fact that we have a precedent to customize longer expiration times for a subset of artifacts is likely going to lead to a future change that will cause those problems.

Comment 23

•

2 years ago

Filed https://mozilla-hub.atlassian.net/browse/RELENG-942.

Flags: needinfo?(cbond)

Comment 24

•

2 years ago

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) (away, back Mon Jul 11) from comment #22)

Yes, the issues mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1649987#c17 are precisely what I'm worried about. We may solve one set of issues and create another set that won't hurt us until a year later or more when we've forgotten about what changes we've made to create those issues. And even if the set of artifacts+tasks that we choose for this bug don't cause those problems, the fact that we have a precedent to customize longer expiration times for a subset of artifacts is likely going to lead to a future change that will cause those problems.

Wild idea: we could run a controlled experiment where we temporarily make the artifacts for which we are planning to have a shorter lifespan unavailable, and see what breaks. It wouldn't be complete, but it could reduce the risk of unexpected future breakages.

Daniel Holbert [:dholbert]

Comment 25

•

2 years ago

We'd likely see bustage in however many months it takes to expire the shorter-lived artifacts, and we wouldn't have a way to fix it. Probably not a great situation.

Comment 26

•

2 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)

Only keeping debug and asan builds for a year seems ok if we want to keep the costs down. I've never run into an old regression that needed one of those builds.

FWIW I just ran into one bug where I would love to have older debug builds available (possibly an exponential-decay sparse set of them). But I agree this is pretty rare.

(In my case, I'm looking at a fuzzer bug with a fatal assertion, which of course only trips in debug builds. It doesn't reproduce in current builds, nor in debug builds from 1 year back which is the oldest I can get via mozregression / artifacts. The bug has notes that suggest it was reproducible 2 years ago, so I'm wishing I could run a ~2-year-old debug build to double-check that I can indeed repro the issue in that older build. This would help me validate that I'm actually testing properly and give me a bit of confidence that the issue has in fact gone away.)

Comment 27

•

2 years ago

Yeah, we are planning to improve the situation here.
We are currently analysis what we are storing, what is useful, what is not and then define a new storage policy.

Julien Cristau [:jcristau]

Comment 28

•

2 years ago

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #9)

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Comment 29

•

2 years ago

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Those are a subset of mozilla-central builds, not autoland builds, which mozregression also needs.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 30

•

2 years ago

Sure, I was just highlighting this if folks need to do long regression ranges

Comment 31

•

2 years ago

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Sylvestre has a point. I (officially non-Mozilla) use these builds as there's data in a .txt file from each build that shows the corresponding hash checksum of the revision that it was built off from. This revision hash (since it's m-c) can be mapped to both autoland and mozilla-central.

Do a date bisection on them and you will eventually get a daily range of changes - it may not be as granular as the autoland/mozregression ones, but for the devs, it may just be good enough for the age of the build.

Any bisection range is generally better than none at all.

Comment 32

•

2 years ago

(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #9)

Another alternative: we could look at adding a set of beetmover tasks to Autoland on-push graphs, possibly to push the minimum set of files needed to a directory in https://archive.mozilla.org/pub/firefox/nightly/ or the like. We'd need to teach mozregression how to find those files.

We have indeed years of data (from 2004 here). We should probably make sure that mozregression can use it (maybe already the case):
https://archive.mozilla.org/pub/firefox/nightly/

Aiui mozregression already supports nightlies. If this is sufficient, we can RESO WORKSFORME this bug.
Comment #0 appears to suggest that without the granular autoland per-push information, we can only bisect to nightlies which can be separated by many dozens of commits... so if we want to support more-granular-bisection for over a year, I suggest we beetmove more binaries from autoland to archive.m.o with a lifespan of >1y. Otherwise, WORKSFORME.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 33

•

2 years ago

(In reply to Aki Sasaki [:aki] (he/him) (UTC-6) from comment #32)

Aiui mozregression already supports nightlies. If this is sufficient, we can RESO WORKSFORME this bug.

Just using nightlies is not sufficient.

Comment 34

•

2 years ago

•

Edited

a few things were overlooked, here are the shippable builds we have:
linux64 shippable (linux32 is central only)
osx cross shippable
osx aarch64 shippable
win32 shippable
win64 shippable
win aarch64 shippable
android lite shippable
android lite arm7 shippable
android lite aarch64 shippable
android arm7 shippable
android aarch64 shippable
android x86 shippable
android x86_64 shippable

that is 12 builds, not 4, so our math needs to be 3x, BUT...

we do not do builds on every push, in fact, shippable builds seem to be every few pushes and we don't do all shippable builds per push, so each platform will have different revisions.

Autoland introduces variability- assuming we have 100 pushes/day, how many are landings + backouts + relandings? how many are test-only, or DONTBUILD? in the end we are looking at a small subset of builds that meet our definition. I would recommend taking backstop pushes this is where we schedule all tasks (and all builds)- I counted 7 of those for August 31st.

now our math looks to be more manageable and going with a backstop push we seem to have all our builds, so our mozregression story should be consistent.

One other consideration- do we need ALL 12 shippable builds? do we need shippable only or other builds? I think this could lead to a more productive discussion (outside this bug) about what is tier1 vs tier2. For example, linux32 is only run on central, should win/aarch64? what about some of the android permutations? I would vote we keep this bug about shippable builds on autoland being available for a duration > 1 year and solve discussions about specific builds/types for another bug.

Timothy Nikkel (:tnikkel)

Updated

•

2 years ago

Summary: Expand the time that we keep autoland builds around for so that mozregression still work for regresssion older than 1 year → Expand the time that we keep autoland builds around for so that mozregression still work for regressions older than 1 year

Comment 35

•

2 years ago

I think mozregression will use opt non-shippable builds from autoland, so it's not just shippable builds we are considering.

Chris Peterson [:cpeterson]

Comment 36

•

2 years ago

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #34)

One other consideration- do we need ALL 12 shippable builds? do we need shippable only or other builds? I think this could lead to a more productive discussion (outside this bug) about what is tier1 vs tier2. For example, linux32 is only run on central, should win/aarch64? what about some of the android permutations? I would vote we keep this bug about shippable builds on autoland being available for a duration > 1 year and solve discussions about specific builds/types for another bug.

android lite shippable
android lite arm7 shippable
android lite aarch64 shippable

We don't need to keep any Android "lite" builds. They're a test configuration that we don't use in Fenix or Focus. (We stopped build and testing the debug "lite" builds in bug 1778172.)

android arm7 shippable
android x86 shippable

I don't think we need to keep the arm7 and (32-bit) x86 builds.

So in the end, the only Android builds we need to keep are:

android aarch64 shippable
android x86_64 shippable

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 37

•

2 years ago

Thanks for the info :cpeterson, the list is reduced to shippable (opt if not available):
linux64
osx cross
osx aarch64
win32
win64
win aarch64
android aarch64
android x86_64

The target is 2 years of retention for autoland.

Questions:

I assume we only need the installer package, nothing else related to these build configurations.
Is there agreement that we should do backstop only pushes? If not, ok to do some subset of pushes (i.e. !DONTBUILD, !TESTONLY, !backed out, !test/tooling only changes)
who would be the appropriate people to sign off on these changes?

Flags: needinfo?(jmaher)

Comment 38

•

2 years ago

who would be the appropriate people to sign off on these changes?

I can (probably)

Updated

•

2 years ago

QA Contact: mozilla → jlorenzo

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 39

•

1 year ago

Joel, can you provide an update on the status of fixing this?

Flags: needinfo?(jmaher)

Comment 40

•

1 year ago

I had overlooked this- I believe the solution for doing this is one of 2 things:

use beetmover to move the builds only to archive.mozilla.org- I don't understand all of that.
extend the task expiration to be 2 years for builds on autoland and ensure the artifacts expire faster except the builds.

From earlier bug conversations, beetmover seemed to be the ideal path, :jlorenzo, is this something you can determine how to do and what it would take or what issues we might have?

Flags: needinfo?(jmaher) → needinfo?(jlorenzo)

Johan Lorenzo [:jlorenzo]

Comment 41

•

1 year ago

I had a look at RELENG-942 (mentioned in comment 23). Aki highlighted the big steps in regard to option #1. I agree with these steps:

Tl;dr: Increasing taskcluster expiration times is dangerous, and may bite us in a year+.

We likely want to solve this by:

Releng + Mozregression team hash out details

At Releng’s request, Product Delivery adds another directory in with a custom artifact expiration period

Releng adds an autoland beetmover task + manifest to upload the minimum set of needed Autoland binaries to this new location. We need to know how to upload: by revision-named directory? by datestring-named directory? (If the latter, we may need to upload the info.txt to know what revision this build is from)

Mozregression team updates mozregression to look at this new location for builds.

Flags: needinfo?(jlorenzo)