Closed Bug 1060339 Opened 7 years ago Closed 7 years ago
"errors" is empty for successful jobs whose logs contained failure lines
In an ideal world, we would never have a successful job that contained errors which our log parser regex matched against. Unfortunately this is not the case in practice, since: 1) There is sometimes spammy log output that isn't intended to make the job fail, but which false-positive matches against our log regex. 2) Sometimes the test/harness/buildbot/... is broken and the log error is real, and it should have caused the run to fail. In the case of #1, hiding these errors in successful runs means that when someone looks at the error summary for a failed run, they don't realise that a proportion of the failures they can see are actually present in _every_ job, not just the failed ones. In the case of #2, hiding the errors means we don't realise <foo> is broken and needs fixing. Now in TBPL, whilst we didn't parse every log by default, you could request one-off log parsing by just loading the brief log for a successful job. Aiui treeherder parses every log anyway (for things like buildstep times), so parsing for errors shouldn't be too much extra infra load.
Summary: "errors" is empty for successful jobs that contained failure log lines → "errors" is empty for successful jobs whose logs contained failure lines
Example: https://treeherder.mozilla.org/api/project/mozilla-central/artifact/?job_id=306570&name=Structured+Log Which is for: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1409256833/mozilla-central-linux64-valgrind-bm72-build1-build83.txt.gz Which contains multiple "TEST-UNEXPECTED-FAIL"s, which show in TBPL as: https://tbpl.mozilla.org/php/getParsedLog.php?id=46971648&tree=Mozilla-Central
https://github.com/mozilla/treeherder-service/blob/2da76b9d0db770190e60356c17d105a22dbb5536/treeherder/model/derived/jobs.py#L1959 if result != 'success': task['check_errors'] = True task['routing_key'] = 'parse_log.failures' else: task['check_errors'] = False task['routing_key'] = 'parse_log.success' https://github.com/mozilla/treeherder-service/blob/2ef3b13cb1ec73d6aabc7f0d3638e75d4b22fec7/treeherder/webapp/api/job_log_url.py#L73 def parse(self, request, project, jm, pk=None): """ Trigger an async task to parse this log. This can be requested by the ui in case the log parsing had an intermittent failure """ ... has_failed = job["result"] in jm.FAILED_RESULTS ... parse_log.delay(project, log_obj["url"], job["job_guid"], job["resultset_id"], check_errors=has_failed)
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Cameron, how would you like me to handle this? We currently have check_errors defaulting to both false and true at various points - shall I make them all default to true? Or should I just remove the pref entirely? https://github.com/mozilla/treeherder-service/search?q=check_errors
Ed-- Yeah, this was a performance tuning choice so that we didn't waste time looking for valid errors where there would be none. But your cases above prove that assumption wrong. We could set check_errors to true everywhere, if that's more correct. And, like you say, if it's always going to be true, we don't need it. So it could be removed if we're pretty sure we'll never want to skip error checking. This will place a heavier load on the worker parsing the regexes for errors are the bulk of the log-parsing time. And earlier on, it was overloading us. But now that it's a separate worker, perhaps we could handle that load. Or I guess I should say: we need to figure out a way to handle that load. :)
This makes us parse error lines even if the job was successful. I'll leave any cleanup (removal of check_errors) to later, when we're sure we don't need it any more. Is my understanding correct that we already parse the build step times (durations) for successful logs? If not, we need to do that too. And if we do that, I'm not sure if this really increases load that much more - given that we'll have already have had to download the gzipped log, extract it etc. Another thought - I seem to recall us talking about this before (or maybe it was with Mauro) and saying that to ease load, we could make the successful log parsing take a lower priority than the failures. It seems like we're already doing this...? https://github.com/mozilla/treeherder-service/blob/f53fd6ba0a9ebec92d8d1cca77820fe44bb96626/treeherder/model/derived/jobs.py#L1970 if result != 'success': task['routing_key'] = 'parse_log.failures' else: task['routing_key'] = 'parse_log.success' https://github.com/mozilla/treeherder-service/blob/14b3d317936e92eaccf38211334334aa052fb04e/treeherder/settings/base.py#L186 CELERY_QUEUES = ( Queue('default', Exchange('default'), routing_key='default'), # queue for failed jobs/logs Queue('log_parser_fail', Exchange('default'), routing_key='parse_log.failures'), # queue for successful jobs/logs Queue('log_parser', Exchange('default'), routing_key='parse_log.success'),
Comment on attachment 8484086 [details] [review] service PR198 - Parse logs for error lines even if the job was successful Adding jeads as a reviewer to spread the load a bit, whichever of you gets to this first can cancel the other person's review. See comment 5 for PR overview.
https://github.com/mozilla/treeherder-service/commit/a623a1fa32a7738f0cf1279e397d647c5c4823fatre This is working, eg: https://treeherder.mozilla.org/ui/logviewer.html#?job_id=360243&repo=mozilla-central
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.