Closed
Bug 1259412
Opened 10 years ago
Closed 9 years ago
adjust talos scheduling/SETA to not be as aggressive
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: jmaher, Unassigned)
Details
I have found that with the new windows xp scheduling, tracking down a regression is hard. A lot of work goes into backfilling, waiting, more backfilling, more waiting, then finally getting to a root cause.
In addition I think it would be more beneficial to run all talos jobs every 3rd push guaranteed, vs coalescing certain jobs. We can make an exception for xp, maybe every 6th push.
Another thought is for every PGO talos job to automatically run it twice- we get so few pgo data points, this would help us correlate failures easier.
I find it more beneficial to have ALL the data, then some of it, even if it is skipping revisions.
To move this forward we would need to adjust the buildbot schedulers. Ideally SETA would have additional data for each job to indicate how frequent to run it, that might be more of a future role for taskcluster.
| Reporter | ||
Comment 1•10 years ago
|
||
:catlee, you had done the work recently to get SETA to work with talos and buildbot, do you have thoughts here?
Flags: needinfo?(catlee)
| Reporter | ||
Comment 2•10 years ago
|
||
ping...catlee
Comment 3•10 years ago
|
||
Sorry, I'm not clear on exactly what you're asking for here. I see 3 or 4 different ideas.
Flags: needinfo?(catlee)
| Reporter | ||
Comment 4•10 years ago
|
||
I would like to ensure all talos tests get run together- say every 3rd push; and to save on XP we could do all XP tests on every 6th push. Is that reasonable?
Flags: needinfo?(catlee)
Comment 5•10 years ago
|
||
I think that would require a significant change to how the SETA scheduler works.
Flags: needinfo?(catlee)
| Reporter | ||
Comment 6•10 years ago
|
||
currently we define 2 tiers of SETA scheduling:
http://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/config_seta.py#34
every 7th/1hour
for XP 14th/2hour
I was thinking I could adjust SETA to run on ALL talos jobs (not just opt vs e10s), then we would have talos run on every 7th push and for XP every 14th push.
where the problem occurs is that we then would want to adjust these to be 3/6 or something more like 3/7. That would require hacking config_seta.py to be different for the skipconfig settings.
Maybe we could do that for talos via examining the job name?
What do you think :catlee?
/me needs to consider what it might look like if we ran all talos every 7th push instead of partially every push?
Flags: needinfo?(catlee)
| Reporter | ||
Comment 7•10 years ago
|
||
one other thought here after thinking more on this:
* winxp should be every 7th push- we are having to do too much backfilling and getting results an extra day later is harder to sheriff and not as useful for developers
* if we did all talos every 7th push, I would like to run each job twice. This gives us the benefit of getting all relevant data at once, and for the most part within 24 hours.
* pgo will still be random, possibly we could run pgo two talos jobs at a time
* how would this work when backfilling, or manually doing pgo/builds on a revision? in those cases we wouldn't require the duplicate runs, but more data is always helpful.
:catlee- with the current SETA scheduling, it is difficult to change winxp to every 7th push, it would require hacking as I outlined in comment 6. for running duplicate Talos jobs on pgo, or ideally every coalesced SETA job, what would it take to do that?
| Reporter | ||
Comment 8•10 years ago
|
||
another frustration is that I see win7 e10s (seta skipped) jobs running on rev X, and rev X+1, I see win8. it is as if the skip counter is not starting at identical places. that causes more confusion.
Comment 9•10 years ago
|
||
The problem is that we don't have a good way of aligning the skip counters.
Flags: needinfo?(catlee)
| Reporter | ||
Comment 10•9 years ago
|
||
I think the most realistic actionable thing to do right now would be to run each talos job twice for pgo runs. That will get us alerts much faster and make them actionable. The last few months we have seen more and more pgo only regressions.
:rail, would you have some suggestions or ideas on how we could do that?
Flags: needinfo?(rail)
Comment 11•9 years ago
|
||
Scheduling tests twice sounds like the easiest solution here. I'm not sure how to implement this properly though.
Flags: needinfo?(rail)
| Reporter | ||
Comment 12•9 years ago
|
||
:armen, do you think this is something we could solve with mozci/pulse_actions?
Flags: needinfo?(armenzg)
Comment 13•9 years ago
|
||
This would be easy with pulse_actions, we can watch for PGO jobs (through the Treeherder exchange - trigger-bot does this) and make sure that two jobs are scheduled for those pushes.
Flags: needinfo?(armenzg)
| Reporter | ||
Comment 14•9 years ago
|
||
:catlee, I would like to move forward with the pulse-actions approach that :armenzg mentions, do you have concerns?
Flags: needinfo?(catlee)
| Reporter | ||
Comment 16•9 years ago
|
||
filed a pulse_actions issue:
https://github.com/mozilla/pulse_actions/issues/70
lets see where that gets us and if it works, we can look at addressing winxp depending on the load.
| Reporter | ||
Comment 17•9 years ago
|
||
I don't think there is anything actionable here, although I do find there is a longer term need to pull hacky scheduling out of different systems and get it into taskcluster proper.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
| Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•