Closed Bug 1188363 Opened 7 years ago Closed 7 years ago

Provide error messages that treeherder can use for display and bug suggestions

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: garndt)

References

Details

Attachments

(1 file)

53 bytes, text/x-github-pull-request
garndt
: review+
Details | Review
There are some scenarios where Taskcluster will resolve a task run in a way that results in no failure summary within Treeherder [1].  

I spoke to Ed last week to understand more about how errors are reported to Treeherder currently.  This bug is to document the ways that taskcluster can resolve a task that results in no usable output and be a tracking bug to resolving the issues.  The solutions could live in taskcluster, libraries/utilities used by tasks, and/or treeherder.

Also, this is an opportunity to either discuss here or spawn another bug to bikeshed ideas of how we could integrate treeherder and taskcluster in a better way moving forward.  Anything proposed below are interim solutions really and we should come up with a more robust way for reporting errors and other information to treeherder.  My understanding is that treeherder was engineered in such a way that there is always a log present that could be parsed or a parsed log summary could be provided when posting the job collection and including a text_log_summary artifact.  There are situations where log parsing might not be possible or ideal and hopefully we can find ways of providing this information without relying on the logs. This also could allow other taskcluster components other than worker to post information about a job if they know something.

Current known scenarios:
1. Task was terminated prior to the task container being started (canceled, malformed payload, scope issues, features not starting).  Logging is enabled, an error is entered into the log and uploaded, however because there are no buildbot steps, Treeherder does not part the errors out.
    * Treeherder could fake buildbot steps for clients that are not buildbot and do not log in a buildbot manner.
    * Docker-worker could make sure that the error message being provided can be parsed by one of the treeherder regexes

2. Task started but was terminated because max runtime was exceeded.  Error message is put into the logs but either no buildbot steps are present (because the task did not reach that point yet), or a finished buildbot step is not present (because the task container was killed).
   * Same solutions as #1 I believe.

3. Task for some reason has no log at all and was resolved because of a failure/exception.
   * Ensure that docker-worker is attempting to have logs as soon as possible so this doesn't happen on the worker side
   * mozilla-taskcluster knows of a few scenarios that could result in this behavior (such as canceled while pending, deadline exceeded before any run claimed) and will fake a log and include errors in the all_errors section of the text_log_summary.  This also needs to set parsed_status to "failed" in the text_log_summary so treeherder doesn't try to do any kind of parsing.  My understanding is that parsing of the log is an either/or type thing (either the client does it or treeherder, but not both).




[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1147867
Ed, I think I summarized what we spoke about previously but perhaps you could provide some clarity if I wasn't correct or clear enough.

Also, perhaps we could start discussing the possibility of adding endpoints or some other mechanism for reporting errors or information about jobs that does not rely on text_log_summary only.  I believe it was mentioned the text_log_summary was created for the use of a client providing log and error information, but that relies on the fact that there is a log and that log needs to be parsed in some treeherder compatible way.  We did discuss moving the log parsing logic into the client libraries to help facilitate this but we were looking for a way of providing these errors if we know about them without needing to parse a log.

Thanks for your help with this!
Flags: needinfo?(emorley)
Depends on: 1174557
Blocks: 1147867
Depends on: 1188444
(In reply to Greg Arndt [:garndt] from comment #0)
>     * Treeherder could fake buildbot steps for clients that are not buildbot
> and do not log in a buildbot manner.

Yeah, we can add support to the Treeherder log parser, for handling error lines that fall outside of a "step".

I've filed bug 1188363.

>    * mozilla-taskcluster knows of a few scenarios that could result in this
> behavior (such as canceled while pending, deadline exceeded before any run
> claimed) and will fake a log and include errors in the all_errors section of
> the text_log_summary.  This also needs to set parsed_status to "failed" in
> the text_log_summary so treeherder doesn't try to do any kind of parsing. 
> My understanding is that parsing of the log is an either/or type thing
> (either the client does it or treeherder, but not both).

Yeah this roughly sounds right - we'll need to check what the best parse_status value is to use here - or even add a new status of "missing" or something.

(In reply to Greg Arndt [:garndt] from comment #1)
> Also, perhaps we could start discussing the possibility of adding endpoints
> or some other mechanism for reporting errors or information about jobs that
> does not rely on text_log_summary only.  I believe it was mentioned the
> text_log_summary was created for the use of a client providing log and error
> information, but that relies on the fact that there is a log and that log
> needs to be parsed in some treeherder compatible way.  

So text_log_summary is precisely _for_ this purpose. (Perhaps a less confusing name would be "text_job_summary"; note: the text is to differentiate it from json structured log formats).
And as such, there is already an API for providing this (the artifacts endpoint; see bug 1080760).

What may require tweaking (beyond the changes to the schema of text_log_summary since it's awful at the moment; see bug 1078450) is making the "line number" parts of the text_log_summary artifact optional.

> We did discuss moving
> the log parsing logic into the client libraries to help facilitate this but
> we were looking for a way of providing these errors if we know about them
> without needing to parse a log.

Yeah - ideally Treeherder does not want to parse logs *at all* in the future - since (a) scraping logs sucks, we should use the json logs, and (b) it's much more scalable to have the individual taskcluster workers do this work, and not have to have treeherder be able to deal with spikes in load.

Once more and more test harnesses switch to using "structured" (ie machine readable json logs), treeherder can use those directly, with the help of packages such as mozlog. This just leaves the infra specific parts of the log (ie outside the test harness), which are more suited to taskcluster/whatever other submitting handling themselves. It may end up still being the case that some parts need to parse logs, and for those parts to share some common library for doing so (that may or may not be the log parser broken out of Treeherder).
Flags: needinfo?(emorley)
Rail, including you in on this to see if you have any opinions...
Flags: needinfo?(rail)
Sorry for the long silence, but TBH I still haven't got any better idea for this. One thing that I'd like to mention is that we shouldn't be too treeherder-specific, the solution should be generic enough to be used by treeherder or something else (tbpl?! :) )
Flags: needinfo?(rail)
Blocks: 1182491
Attached file GH PR 163
Was reviewed on github and r+ by jonasfj
Attachment #8656160 - Flags: review+
Assignee: nobody → garndt
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.