Closed Bug 731014 Opened 12 years ago Closed 6 years ago

Add a timeline based view to see infra / time based failures

Categories

(Tree Management :: Treeherder, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: armenzg, Unassigned)

References

(Blocks 1 open bug)

Details

Currently, tbpl shows jobs group by changeset which prevents from finding certain information.

Sometimes we have network/hg/stage hiccups that can make several jobs burn in the same timeframe.
With the current view each of these jobs can happen on different changes and different branches.
A view which grouped jobs finishing in an interval (5mins the default) would allow to see this easily.

Sometimes a regression is introduced in a changeset but developers don't pay attention to it after a while. Currently it is not obvious since there are 30-50 boxes (each one representing jobs). If we had the ability to view jobs on a "builder name" basis we could see the T1 job for changeset N besides the T1 job for changeset N-1.
* e.g. Linux mochitests-1: G G G O O O <- we can see that the last 3 changesets have gone orange

Everything I have mentioned up to this point I believe that makes it very simple for developers to be able to switch tbpl view rather than have to follow URLs to yet one more releng dashboard + using filters [1].

There is also a third thing that developers should be able to get to but it is not necessarily a modification of the tbpl view.
A view showing the recent jobs of all slaves.
Right now they can load up all jobs [2] and filter OR the jobs of a specific slave [3].
The problem comes when your mind has to set into hunting mode rather than "oh it is so obvious this slave is misbehaving".
If there was a view that showed up slaves sorted by recency of failures it would improve spotting a slave misbehaving (last 6 out of 8 jobs have burned)

This last request might be just modifying the buildapi recent view.

Feel free to argue about where does this make sense to live in (briar-patch/webtools).
I can put this proposal on the mailing lists and see if people agree with this approach (I know philor would be very happy).

[1] https://build.mozilla.org/buildapi/recent
[2] https://build.mozilla.org/buildapi/recent
[3] https://build.mozilla.org/buildapi/recent/tegra-269
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
No longer blocks: 729548
Component: TBPL → Treeherder
OS: Mac OS X → All
Product: Tree Management Graveyard → Tree Management
Hardware: x86 → All
Summary: Add extra views to see failures that tbpl does not show easily → Add a timeline based view to see infra / time based failures
Version: Trunk → ---
Priority: -- → P5
With the move to putting everything in tree this would be at best a nice to have. Closing as won't fix as we wont be doing this any time soon
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
There are two main purposes for this:
1. Check if something (mass) fails across trees.
2. Retrigger mass failures across trees and pushes.
Blocks: 1178522
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #2)
> There are two main purposes for this:
> 1. Check if something (mass) fails across trees.

Since everything is in tree, can you give an example of how this would happen that wasnt an external system failure (e.g. AWS)

> 2. Retrigger mass failures across trees and pushes.

Is this a nice to have or is this really something that the sheriffs would do? How often do you do this currently?
Flags: needinfo?(aryx.bugmail)
queue.taskcluster.net, cloud-mirror.taskcluster.net, auth.taskcluster.net, provisioner,  plus all the parts of taskcluster which haven't failed in the last week so I don't know their name; hg.m.o; hg hooks; tool-tool; all the other non-taskcluster things that haven't failed in the last week so I don't remember their name, or are things mostly hidden behind the IT infra-wall like bug 1430670 so we don't ever really learn names or details. One of my absolute favorites for this sort of thing is a set of tests which do not touch the network, no no that would be bad, but which do rely on having a working DNS server, so when our DNS in AWS (which is us, run by us, not AWS's) fails you maybe get a hint from slightly more clear failures if you have some builds in the right place to admit that they failed to resolve a hostname to upload something, but mostly you get test failures that don't look the least bit like infra until and unless you see that they started failing on inbound and autoland and beta without sharing a recent common parent.

I don't see why it matters that there are non-external-system failures that can go across trees, though, since for at least all of the taskcluster bits, finding out whether it is AWS's fault or our own fault generally happens once someone is about 60-80% of the way through debugging the failure.
What seems to be really wanted is a status page for TC so that we can see if there is an issue. I suggest raising a bug against them. Unfortunately using treeherder as a status page for them seems overkill as this view would be a non-trivial amount of work.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Flags: needinfo?(aryx.bugmail)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.