Closed Bug 852357 Opened 8 years ago Closed 4 years ago

Reporting System for Build/Infrastructure Issues

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: k0scist, Unassigned)

Details

(Whiteboard: [buildfaster:?])

Reporting System for Build/Infrastructure Issues

CC: :bhearsum, :edmorley, :RyanVM, :jgriffin, :jmaher
Whiteboard: [buildfaster:?]

As part of builds (in the buildbot sense, as used herein, so could
also be a test run), the slave environment is introspected and
modified for setup (and sanity) for the build steps. E.g. an existing
hg clone could be updated...or given insufficient disk space or a bad
repo state, could be cloned afresh.

With our existing infrastructure, if there are issues with setup
steps, there are basically two (easy) possibilities:

1. turn the job orange - you broke it!
2. print to TinderboxPrint - but it is likely no one would see it

For cases where there is a non-preferred fallback, an expected result
vs an actual result, or gathering statistics for later actionability
like e.g. timings, but one where the build may safely proceed, turning
the job orange may be overkill, but it may be desirable to note the
issue somewhat more visibly than via tinderboxprint.

Disclaimer: this is a blue sky idea; it is also not (necessarily) a
trivial project.  I also don't know the extent that this exists in the
build system or to the extent this is cared about.  This is a rough
proposal at best if its useful, not a request.

Possible use cases:
* noting issues that may require machine reimaging/maintenance
  (e.g. disk space)
* noting systematic timings/slow downs (and other machine stats)
* noting prevalence of non-fatal problems or potential problems
* noting (excessive?) number of retries (e.g. hg.m.o timeouts)
* noting slow downloads
* noting when a fallback method is used that takes longer than the
  preferred case or is otherwise less desirable

The no-tech (or at least no-infra) solution is to have each particular
piece that is cared about generate and send notification.  A more
complete solution would entail a universal way of noting that there is
an issue (and what it is) as well as a place to put it. Note that
while the no-tech issue is easy for a particular case, multiple cases
will involve copy+pasting code and will probably discourage notifying
on a particular issue since each is roll-your-own.  At the other end
of the spectrum, a precisely tailored solution will involve an
excessive amount of time to spec and craft.  Both extremes give
perfect elasticity, though some middle ground is likely more pragmatic
in terms of overall gain.

Noting the issue could be done in any number of ways, for example
scanning the logs, POSTing to some service (e.g. bugzilla), emailing
some parties, uploading a file (somewhere), pulse, or leveraging
TinderboxPrint or similar and/or TBPL 2.0 equivalent thereof via an
additional piece.

The place to put it could be bugzilla (likely, since it is our issue
tracker whether or not it is ideal for this particular purpose and
this class of bugs could be harvested to make an additional
dashboard), a mailing list, nagios, or yet-to-exist web service.

IFF this is something worth doing, first steps would be prioritizing
based on added value and deciding the actual form of the solution
based on a convolution of (need) and (bang for the buck).

Idea from https://bugzilla.mozilla.org/show_bug.cgi?id=851270#c31
Whiteboard: [buildfaster:?]
The new treeherder generic metadata fields sound like a good place to store this; all we then need is a UI for it, separate from the normal treeherder-ui view.
Product: mozilla.org → Release Engineering
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.