SETA had a 3+ hour break in results causing a longer tree closure and bigger headache for sheriffs

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
3 years ago
2 years ago

People

(Reporter: jmaher, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
SETA currently has a time limit of 60 minutes or every 7th push.  Nigel saw an instance this morning on the trees where we had 3+ hours with no data for a given test.

We need to figure out why this is the case.

If we can get the jobname and the original data points we can look to see if there is a specific reason why this happened or find a bug in code somewhere.

Comment 1

3 years ago
Nigel: What is the jobname and branch where the test did not run?
Flags: needinfo?(nigelbabu)

Comment 2

3 years ago
Sorry about the delay in responding. I kinda went afk after sheriffing yesterday.

First thing I want to make sure is that it's actually SETA and not me mistaking the usual coalescing for SETA.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=005264192a61&filter-searchStr=linux%20%28bc7%29&fromchange=1d27bae5a64e

a23e9cbf415d (Wed Oct 7, 9:28:12 (my timezone, sorry)) has a linux mochitest bc7 job.

The next instance of this job happening without my intervention is 3a4fb0ededfd (Wed Oct 7, 11:44:35) and e303cb8adc20 (Wed Oct 7, 12:06:31). There were 5 pushes in between without a mochitest bc7 job on Linux. Including one which caused the bustage. This had us close the tree for several hours so backfill could give us an idea of where to look for the bustage. This gap is actually about 2 hours 15 minutes. I couldn't math yesterday.
Flags: needinfo?(nigelbabu)
(Reporter)

Comment 3

3 years ago
I see what you mean, still 2+ hours is a long time.  Luckily bc7 is ~15 minutes for faster finding the root cause, and failing really fast in this specific case.  If only our builds were faster- we could have verified the backout in a shorter amount of time!!

bc7 is not required now for SETA, because in:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=005264192a61&filter-searchStr=linux%20%28bc7%29&fromchange=1d27bae5a64e

we see that bc6 and bc1-e10s are also related (according to the stars), so we should have caught this as linux32 opt+debug should always be running bc1-e10s, but looking at treeherder it didn't run:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=005264192a61&filter-searchStr=linux%20opt%20mochitest%20e10s%20browser%20chrome%20m-e10s%28bc1%29&fromchange=1d27bae5a64e

this seems like a bug in the coalescing code, :kmoir, can you offer some insight into the buildbot side of things?
Flags: needinfo?(kmoir)

Comment 4

3 years ago
So I looked at the scheduler logs, and the scheduler db.  I don't know what caused the problem, I can't see where it occurred. THere are some known issues with the seta and weird scheduling that I've seen before -> see bug 1174870 for another example.
Flags: needinfo?(kmoir)
(Reporter)

Comment 5

2 years ago
as we haven't had more issues like this for the last 14 months, I am going to mark this as resolved/wontfix.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.