Linux 32 and Mac shippable test schedules not SETA optimized, high test loads

RESOLVED FIXED in Firefox 68

Status

defect
--
blocker
RESOLVED FIXED
3 months ago
2 months ago

People

(Reporter: RyanVM, Assigned: Callek)

Tracking

({regression})

Version 3
mozilla68
Points:
---

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox66 unaffected, firefox67 unaffected, firefox68 fixed)

Details

Attachments

(2 attachments)

Reporter

Description

3 months ago

I'm guessing this is fallout from the shippable builds work. Can't be good for our backlogs.

Flags: needinfo?(bugspam.Callek)

Linux 32 and macOS are not SETA optimized which causes this for all tests on those platforms. The macOS pool can't keep up with the load and autoland will be closed to reduce the backlog.

Severity: normal → blocker
Summary: Mac Talos tests are running on every push to integration branches → Linux 32 and Max shippable test schedules not SETA optimized, high test loads
Assignee

Comment 2

3 months ago

So, it looks like there are zero linux32 opt tests in LOW VALUE SETA right now, which I'm not sure how that is possible...

https://treeherder.mozilla.org/api/project/mozilla-inbound/seta/job-priorities/?build_system_type=taskcluster&priority=5

So there is no way that https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/util/seta.py#98 ends up matching it.

OSX similarly has no macosx64/opt in low value seta.

neither low value seta nor high value seta have any shippable specified. -- I'm passing this over the wall to Joel for this part of the investigation.

:jmaher, can you help identify what's going on here?

Flags: needinfo?(bugspam.Callek) → needinfo?(jmaher)

we need to hack the seta database inside of treeherder, there is a table there you can query from redash(select the treeherder database):
select * from seta_jobpriority where expiration_date<'2019-05-01';

if we remove all these entries, I believe we will be good- I am not 100% sure. The bulk of the shippable tests expire on April 13th and would then default to low value, that is a long time from now :)

Once removed, there is a 1 hour cache inside of treeherder, so changes will not be immediate and can take up to 1 hour to show up.

Flags: needinfo?(jmaher)

:kthiessen, while I have pinged you in irc, this is the bug where we need to solve the problem, I would like to solve this today so we can go into next week with cleaned up jobs.

Flags: needinfo?(kthiessen)

I can look at this sometime late morning Pacific today, but I'd recommend pulling :camd in as back-up, in case I can't make progress right away. I'm also uncomfortable with making direct database queries on prod without some sort of "what if" plan.

Flags: needinfo?(kthiessen)

Yeah, I'll run the cleanup on Treeherder. We came up with this process:

  1. Use db query to select jobs that will be deleted:
    SELECT * FROM seta_jobpriority WHERE expiration_date < '2019-05-01';
  2. Verify results with jmaher
  3. Create delete query based off that:
    DELETE from seta_jobpriority WHERE expiration_date < '2019-05-01';
  4. Test deleting from prototype
    a. Run select query
    b. Save data off to .csv in MySQLWorkbench
    c. Run delete query
    d. Test Re-import records from .csv
  5. Test deleting from stage
  6. Verify with jmaher that stage is working as expected
  7. Save entries locally from production to .csv
  8. Make a snapshot on RDS for production data
  9. Delete records from production
  10. Monitor

These records have been deleted. I have them saved locally, so they can be restored if need be. I'll attach them here, too.

it appears that we still have these job names in the jobpriority field. This looks to be a system that takes all unknown jobs which are not in the table and adds a expiration_date of 2 weeks; I assumed that deleting the rows would solve the problem.

Unfortunately I don't know the best solutions for this- I would bet that we could do something like:
update jobpriority set expiration_date=null, priority=5 where expiration_date<'2019-05-01'

:camd, do you think we can chat today and do this work cycle one more time?

Flags: needinfo?(cdawson)

Joel and I met over vidyo, reset the expiration date to null on those jobs and then reran analyze_failures. It appears to have fixed the issue.

Flags: needinfo?(cdawson)

Comment 12

3 months ago
Pushed by jwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a024c4370f28
Mac Talos tests are running on every push to integration branches. r=jmaher

Comment 13

3 months ago
bugherder
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Summary: Linux 32 and Max shippable test schedules not SETA optimized, high test loads → Linux 32 and Mac shippable test schedules not SETA optimized, high test loads
You need to log in before you can comment on or make changes to this bug.