Closed Bug 1060339 Opened 7 years ago Closed 7 years ago

"errors" is empty for successful jobs whose logs contained failure lines

Categories

(Tree Management :: Treeherder, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

In an ideal world, we would never have a successful job that contained errors  which our log parser regex matched against. Unfortunately this is not the case in practice, since:

1) There is sometimes spammy log output that isn't intended to make the job fail, but which false-positive matches against our log regex.
2) Sometimes the test/harness/buildbot/... is broken and the log error is real, and it should have caused the run to fail.

In the case of #1, hiding these errors in successful runs means that when someone looks at the error summary for a failed run, they don't realise that a proportion of the failures they can see are actually present in _every_ job, not just the failed ones.

In the case of #2, hiding the errors means we don't realise <foo> is broken and needs fixing.

Now in TBPL, whilst we didn't parse every log by default, you could request one-off log parsing by just loading the brief log for a successful job.

Aiui treeherder parses every log anyway (for things like buildstep times), so parsing for errors shouldn't be too much extra infra load.
Summary: "errors" is empty for successful jobs that contained failure log lines → "errors" is empty for successful jobs whose logs contained failure lines
https://github.com/mozilla/treeherder-service/blob/2da76b9d0db770190e60356c17d105a22dbb5536/treeherder/model/derived/jobs.py#L1959

    if result != 'success':
        task['check_errors'] = True
        task['routing_key'] = 'parse_log.failures'
    else:
        task['check_errors'] = False
        task['routing_key'] = 'parse_log.success'

https://github.com/mozilla/treeherder-service/blob/2ef3b13cb1ec73d6aabc7f0d3638e75d4b22fec7/treeherder/webapp/api/job_log_url.py#L73

    def parse(self, request, project, jm, pk=None):
        """
        Trigger an async task to parse this log. This can be requested by the ui
        in case the log parsing had an intermittent failure
        """
...
        has_failed = job["result"] in jm.FAILED_RESULTS
...
        parse_log.delay(project, log_obj["url"],
                        job["job_guid"], job["resultset_id"],
                        check_errors=has_failed)
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Cameron, how would you like me to handle this? We currently have check_errors defaulting to both false and true at various points - shall I make them all default to true? Or should I just remove the pref entirely?

https://github.com/mozilla/treeherder-service/search?q=check_errors
Flags: needinfo?(cdawson)
Ed-- Yeah, this was a performance tuning choice so that we didn't waste time looking for valid errors where there would be none.  But your cases above prove that assumption wrong.  We could set check_errors to true everywhere, if that's more correct.  And, like you say, if it's always going to be true, we don't need it.  So it could be removed if we're pretty sure we'll never want to skip error checking.

This will place a heavier load on the worker parsing the regexes for errors are the bulk of the log-parsing time.  And earlier on, it was overloading us.  But now that it's a separate worker, perhaps we could handle that load.  Or I guess I should say: we need to figure out a way to handle that load.  :)
Flags: needinfo?(cdawson)
This makes us parse error lines even if the job was successful. I'll leave any cleanup (removal of check_errors) to later, when we're sure we don't need it any more.

Is my understanding correct that we already parse the build step times (durations) for successful logs? If not, we need to do that too. And if we do that, I'm not sure if this really increases load that much more - given that we'll have already have had to download the gzipped log, extract it etc.

Another thought - I seem to recall us talking about this before (or maybe it was with Mauro) and saying that to ease load, we could make the successful log parsing take a lower priority than the failures. It seems like we're already doing this...?

https://github.com/mozilla/treeherder-service/blob/f53fd6ba0a9ebec92d8d1cca77820fe44bb96626/treeherder/model/derived/jobs.py#L1970
                if result != 'success':
                    task['routing_key'] = 'parse_log.failures'
                else:
                    task['routing_key'] = 'parse_log.success'

https://github.com/mozilla/treeherder-service/blob/14b3d317936e92eaccf38211334334aa052fb04e/treeherder/settings/base.py#L186

CELERY_QUEUES = (
Queue('default', Exchange('default'), routing_key='default'),
# queue for failed jobs/logs
Queue('log_parser_fail', Exchange('default'), routing_key='parse_log.failures'),
# queue for successful jobs/logs
Queue('log_parser', Exchange('default'), routing_key='parse_log.success'),
Attachment #8484086 - Flags: review?(cdawson)
Comment on attachment 8484086 [details] [review]
service PR198 - Parse logs for error lines even if the job was successful

Adding jeads as a reviewer to spread the load a bit, whichever of you gets to this first can cancel the other person's review.

See comment 5 for PR overview.
Attachment #8484086 - Flags: review?(jeads)
Attachment #8484086 - Flags: review?(jeads)
Attachment #8484086 - Flags: review?(cdawson)
Attachment #8484086 - Flags: review+
Blocks: 1078396
You need to log in before you can comment on or make changes to this bug.