Closed Bug 1650208 Opened 5 months ago Closed 4 months ago

run opt builds by default on autoland to save some money and time

Categories

(Testing :: General, task)

task

Tracking

(firefox81 fixed)

RESOLVED FIXED
Future
Tracking Status
firefox81 --- fixed

People

(Reporter: jmaher, Assigned: bhearsum)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

:erahm asked why we do shippable builds on autoland and not opt. opt builds are faster (and cheaper), but perf tests are on shippable.

I calculated the savings we would get by running opt builds instead of shippable builds to be $1.14/push.

I then calculated the cost of running a full set of shippable builds to be $2.50/push.

the average push load on autoland the last 3 months is ~3000 pushes/month.

Querying the number of talos and raptor jobs that we have on autoland over that same period of time, it turns out that we would need enough builds to cover ~1/3 of the pushes.

so $1.143000 - $2.501000 = ~$1K/month.

I say $1K as we would run shippable builds on the backstop pushes already, so those 10 pushes/day would not be subtracted.

  1. I do wonder why not just optimize our instr and run tasks to be faster? That is unknown in terms of feasibility and engineering effort- worth answering though.
  2. how difficult would it be to adjust our scheduling to be opt by default on autoland, but ensure that perf tests are only run on shippable and every backstop we only run shippable.
  3. what is the risk of a shippable only unittest regression showing up and adding a lot of load (this could be confusing for the code sheriffs on the backstop pushes when backfilling)
  4. what is the risk of the perf sheriffs as builds wouldn't be available and backfilling would have an extra 1+ hour delay.

:glandium can you help answer #1
:marco/:bhearsum can you help answer #2
:aryx: can you help answer #3
:igoldan can you help answer #4

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(igoldan)
Flags: needinfo?(aryx.bugmail)

how difficult would it be to adjust our scheduling to be opt by default on autoland, but ensure that perf tests are only run on shippable and every backstop we only run shippable.

I'm going to defer this to Tom.

Flags: needinfo?(mozilla)

Somewhat related bugs: https://bugzilla.mozilla.org/show_bug.cgi?id=1650083, https://bugzilla.mozilla.org/show_bug.cgi?id=1648292#c14. We probably only want to do one of the three of them.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0)

:erahm asked why we do shippable builds on autoland and not opt. opt builds are faster (and cheaper), but perf tests are on shippable.

I calculated the savings we would get by running opt builds instead of shippable builds to be $1.14/push.

I then calculated the cost of running a full set of shippable builds to be $2.50/push.

These numbers don't feel right. Looking at a recent m-c landing I can see a "Linux x64 shippable opt" build taking:

  • instr - 36 min
  • run - 7 min
  • B - 50 min

So 93 minutes total build time.

For a "Linux x64 opt" build we have:

  • B - 36 min

So 36 minutes total build time.

If shippable are $2.50/push I'd expect opt to be $0.96/push. That puts us closer to saving ~5K/month. Of course I'm just looking at 1 data point and platform so it could be the averages are different. Do we have raw numbers on total time for all shippable (instr + run + B) builds over June vs opt (just B)? I might be completely missing something here of course, if you could elaborate on how you came to your numbers that might be helpful.

Flags: needinfo?(jmaher)

Before reaching to this, we should finish bug 1637544, and improve the build times for the shippable build itself (the new dump_syms will help a lot, and there are other things in the pipe that will improve things too).

Flags: needinfo?(mh+mozilla)

:erahm, I didn't realize opt-B vs shippable-B were different; I will look at the difference- the savings is not $5K, it is $1K on my numbers, with 14 more minutes saved, that might be $1.2K/month.

ok, looking more, we have a lot of opt builds that were included in the total cost which do not have shippable partners. I adjusted the parameters to focus on specific builds ( https://gist.github.com/jmaher/1b77b25928bcad397b978939828a1bb3 in bigquery )

average for Q2:
opt: $0.47
ship: $1.14

so that is a $0.67 savings on just the 'B' job and $1.14 for instr/run for a total of: $1.81/push.

This puts the savings closer to $3K/month.

Flags: needinfo?(jmaher)
Duplicate of this bug: 1650083
Flags: needinfo?(mcastelluccio)
See Also: → 1637544

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0)

  1. how difficult would it be to adjust our scheduling to be opt by default on autoland, but ensure that perf tests are only run on shippable and every backstop we only run shippable.

For the smart scheduler, I just need to be aware if the change is actually going to happen, and I need to make it so that in the training data shippable is considered the same as opt (otherwise we'll lose the entire training data) and to schedule opt instead of shippable on autoland.
It's a relatively easy change, we just need to coordinate.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0)

  1. what is the risk of a shippable only unittest regression showing up and adding a lot of load (this could be confusing for the code sheriffs on the backstop pushes when backfilling)
    Based on experiences with previouses switches from opt to shippable, the risk should be low. If autoland has shippable for backstop pushes and opt for the other pushes, this will make tracing issues across pushes in Treeherder problematic with current tooling. The 'Similar Jobs' tab won't show the other config and sheriffs might request backfills and get confused why it's not shown in 'Similar Jobs' or a filtered view.
Flags: needinfo?(aryx.bugmail)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0)

:erahm asked why we do shippable builds on autoland and not opt. opt builds are faster (and cheaper), but perf tests are on shippable.
[...]
4) what is the risk of the perf sheriffs as builds wouldn't be available and backfilling would have an extra 1+ hour delay.

I'm deferring this to Bebe & Alexandru, as they can provide the most precise answers.

Flags: needinfo?(igoldan)
Flags: needinfo?(fstrugariu)
Flags: needinfo?(aionescu)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0)

:erahm asked why we do shippable builds on autoland and not opt. opt builds are faster (and cheaper), but perf tests are on shippable.
...
4) what is the risk of the perf sheriffs as builds wouldn't be available and backfilling would have an extra 1+ hour delay.

I'm not very clear what "1+ hour delay" means. Most of the times, to identify a regression we have to make some backfills, filling a gap with a rough average of 5 revisions. If the entire backfill will be delayed with about 1h, I don't see a problem. But if every build will be delayed with this amount of time, giving we sometimes have to backfill several times, can increase our performance metrics (time-to's).

Flags: needinfo?(aionescu)

:alexandrui, every time you would request a backfill on a revision with no existing perf jobs, you would have to wait for builds to be generated. It doesn't mean that if you backfill 10 jobs it would take 10 hours, it would take 1-2 hours for the jobs to be run in parallel, then the perf tests would start. If you retrigger an existing job, it would just run the job as normal.

How often do you have to backfill multiple times? I assume this means for a single alert, you backfill, then have to repeat to get more data points? Is this because it didn't backfill enough revisions to start with?

Flags: needinfo?(aionescu)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #13)

How often do you have to backfill multiple times? I assume this means for a single alert, you backfill, then have to repeat to get more data points? Is this because it didn't backfill enough revisions to start with?

Not too often. Usually I do backfill+retrigger. But when I do backfill+backfill it means that the culprit wasn't covered by the initial backfill.

Flags: needinfo?(aionescu)

I think the risk here is low, and tweaking our tools for backfill range will be able to help perf sheriffs out.

It sounds like we really need to wait for bug 1637544 to land before deciding on this- also determining which big picture route we want to take (:bhearsum's comment 3) between this, bug 1650083, and bug 1648292

Flags: needinfo?(mozilla)
Flags: needinfo?(fstrugariu)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #15)

I think the risk here is low, and tweaking our tools for backfill range will be able to help perf sheriffs out.

It sounds like we really need to wait for bug 1637544 to land before deciding on this- also determining which big picture route we want to take (:bhearsum's comment 3) between this, bug 1650083, and bug 1648292

I'm not sure what we need to wait on. It might reduce the time of the "instr" build, but wouldn't affect the "run" or "B" task that are also required for shippable and take up a majority of the build time.

I was echoing :glandium's comment, maybe we don't need to wait on that.

Rereading, it sounds like there's no reason we've come up with not go ahead and run opt for all pushes on autoland, and only run shippable for backstop and backfilling (please speak up if this is wrong). I'm going to work on this.

Assignee: nobody → bhearsum

Looks good to me!
many thanks!

Attachment #9162922 - Attachment description: Bug 1650208: run opt builds by default on autoland to save some money and time → bug 1650208: run opt builds by default on autoland; shippable builds by default on central
Attachment #9162922 - Attachment description: bug 1650208: run opt builds by default on autoland; shippable builds by default on central → Bug 1650208: run opt builds by default on autoland; shippable builds by default on central

It seems like this didn't actually reduce the number of shippable builds on autoland. bhearsum, can you double-check that it's working as intended?

Flags: needinfo?(bhearsum)

I don't think this has landed yet

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #22)

I don't think this has landed yet

That'd explain it! Not sure how I missed that, I think I confused the dup'd bug be resolved with landing.

Flags: needinfo?(bhearsum)

(In reply to Eric Rahm [:erahm] from comment #23)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #22)

I don't think this has landed yet

That'd explain it! Not sure how I missed that, I think I confused the dup'd bug be resolved with landing.

Soon! Really!

Pushed by bhearsum@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a706413a07d0
run opt builds by default on autoland; shippable builds by default on central r=tomprince,ahal,marco
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → Future
Duplicate of this bug: 1648292
You need to log in before you can comment on or make changes to this bug.