1259412 - adjust talos scheduling/SETA to not be as aggressive

Reporter

Description

•

10 years ago

I have found that with the new windows xp scheduling, tracking down a regression is hard. A lot of work goes into backfilling, waiting, more backfilling, more waiting, then finally getting to a root cause. In addition I think it would be more beneficial to run all talos jobs every 3rd push guaranteed, vs coalescing certain jobs. We can make an exception for xp, maybe every 6th push. Another thought is for every PGO talos job to automatically run it twice- we get so few pgo data points, this would help us correlate failures easier. I find it more beneficial to have ALL the data, then some of it, even if it is skipping revisions. To move this forward we would need to adjust the buildbot schedulers. Ideally SETA would have additional data for each job to indicate how frequent to run it, that might be more of a future role for taskcluster.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

10 years ago

:catlee, you had done the work recently to get SETA to work with talos and buildbot, do you have thoughts here?

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

10 years ago

ping...catlee

Chris AtLee [:catlee]

Comment 3

•

10 years ago

Sorry, I'm not clear on exactly what you're asking for here. I see 3 or 4 different ideas.

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

10 years ago

I would like to ensure all talos tests get run together- say every 3rd push; and to save on XP we could do all XP tests on every 6th push. Is that reasonable?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 5

•

10 years ago

I think that would require a significant change to how the SETA scheduler works.

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

10 years ago

currently we define 2 tiers of SETA scheduling: http://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/config_seta.py#34 every 7th/1hour for XP 14th/2hour I was thinking I could adjust SETA to run on ALL talos jobs (not just opt vs e10s), then we would have talos run on every 7th push and for XP every 14th push. where the problem occurs is that we then would want to adjust these to be 3/6 or something more like 3/7. That would require hacking config_seta.py to be different for the skipconfig settings. Maybe we could do that for talos via examining the job name? What do you think :catlee? /me needs to consider what it might look like if we ran all talos every 7th push instead of partially every push?

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

10 years ago

one other thought here after thinking more on this: * winxp should be every 7th push- we are having to do too much backfilling and getting results an extra day later is harder to sheriff and not as useful for developers * if we did all talos every 7th push, I would like to run each job twice. This gives us the benefit of getting all relevant data at once, and for the most part within 24 hours. * pgo will still be random, possibly we could run pgo two talos jobs at a time * how would this work when backfilling, or manually doing pgo/builds on a revision? in those cases we wouldn't require the duplicate runs, but more data is always helpful. :catlee- with the current SETA scheduling, it is difficult to change winxp to every 7th push, it would require hacking as I outlined in comment 6. for running duplicate Talos jobs on pgo, or ideally every coalesced SETA job, what would it take to do that?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

10 years ago

another frustration is that I see win7 e10s (seta skipped) jobs running on rev X, and rev X+1, I see win8. it is as if the skip counter is not starting at identical places. that causes more confusion.

Chris AtLee [:catlee]

Comment 9

•

10 years ago

The problem is that we don't have a good way of aligning the skip counters.

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

10 years ago

I think the most realistic actionable thing to do right now would be to run each talos job twice for pgo runs. That will get us alerts much faster and make them actionable. The last few months we have seen more and more pgo only regressions. :rail, would you have some suggestions or ideas on how we could do that?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 11

•

10 years ago

Scheduling tests twice sounds like the easiest solution here. I'm not sure how to implement this properly though.

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

10 years ago

:armen, do you think this is something we could solve with mozci/pulse_actions?

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 13

•

10 years ago

This would be easy with pulse_actions, we can watch for PGO jobs (through the Treeherder exchange - trigger-bot does this) and make sure that two jobs are scheduled for those pushes.

Flags: needinfo?(armenzg)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

10 years ago

:catlee, I would like to move forward with the pulse-actions approach that :armenzg mentions, do you have concerns?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 15

•

10 years ago

Sounds like a great idea - let's do it!

Flags: needinfo?(catlee)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

10 years ago

filed a pulse_actions issue: https://github.com/mozilla/pulse_actions/issues/70 lets see where that gets us and if it works, we can look at addressing winxp depending on the load.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 17

•

9 years ago

I don't think there is anything actionable here, although I do find there is a longer term need to pull hacky scheduling out of different systems and get it into taskcluster proper.

Armen [:armenzg]

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INVALID

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: General Automation → General

Bugzilla

adjust talos scheduling/SETA to not be as aggressive

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Updated

Updated