TC jobs occasionally fail and don't get a parsed failure in Treeherder.

RESOLVED FIXED

Status

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: KWierso, Assigned: garndt)

Tracking

Details

Attachments

(1 attachment)

This job failed: https://public-artifacts.taskcluster.net/OxYnpNZFSmyTF3ZKRwJpyw/0/public/logs/live_backing.log

All treeherder sees is "Failure summary is empty" in the failure summary pane.

Looking in the log, the very last line is:
[taskcluster] Unsuccessful task run with exit code: 141 completed in 312.984 seconds

I'm not seeing the actual failure in the log, but it really should print something that treeherder can pick up and use as a failure summary.
Not sure if this should block switching TC stuff to tier-1, but I'll throw it on the pile for someone else to look at...
Blocks: 1243024
Then there's this one: https://public-artifacts.taskcluster.net/fUFD14ZfTWKxg0xyup3Mww/0/public/logs/live_backing.log

No summary gets shown in Treeherder. Bottom of the log has a line with:
92  657k   92  611k    0     0    115      0  1:37:37  1:29:56  0:07:41     0[taskcluster:error] Task timeout after 5400 seconds. Force killing container.

Followed by:
[taskcluster] Unsuccessful task run with exit code: -1 completed in 5402.56 seconds


Should that first bit have been split into two separate lines so that "[taskcluster:error]" was at the start of a new line, and would that have made it get parsed on Treeherder?
this seems pretty important, as in we either block or fix it asap after we switch (i.e. not Q2).

I assume this is reporting status to treeherder which is the problem.
Looking at the task in the bug description, it looks like the task failed because 'curl' couldn't download the file.  As far as there being an error put into the logs, it looks like this should be handled in the script that's running in the task.  If we want that error to be parsed and displayed in the treeherder logs, that error must conform to one of treeherder's error parsing regexes. (prefixing a line with [<component>:error] will get it parsed.

As far as comment 2 it does look like somehow there was a race between writing the last line from the task and the line when the task was killed.  We probably could/should write a new line when writing that error.  Usually the force kill message appears on its own line and gets parsed, but clearly that didn't happen in this case.
I had thought this was an intermittent/perma failure for a test case/build.  If this is core harness/infra, I would vote this down from must have to should have- as in we can fix the harness issues, but core sheriffing would not be affected the majority of the time.  :kwierso, do chime in if you feel I am off base.
Bug 1231200 covers converting some more of the shell-script errors into things treeherder will show.
(In reply to Joel Maher (:jmaher) from comment #5)
> I had thought this was an intermittent/perma failure for a test case/build. 
> If this is core harness/infra, I would vote this down from must have to
> should have- as in we can fix the harness issues, but core sheriffing would
> not be affected the majority of the time.  :kwierso, do chime in if you feel
> I am off base.

Yeah, at the moment, I either star and move on with my day, or star then retrigger then move on with my day. It's possible these could hide some actual bustage, but it's unlikely that it would happen frequently enough to truly make things unreasonably bad.
I'm not reading the visibility policy's failure output requirements strictly enough to say these non-parsed instances mean the tc jobs can't be tier-1, though I'll defer to the other sheriffs on this if they disagree.
I already filed this as bug 1254063 a couple days ago.
Depends on: 1231200
Duplicate of this bug: 1254063
From bug 1254063:

---

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=71c56930fb7f&selectedJob=23106936

This job has no failure summary set so that you are forced to manually update the log and check for the failure. In this case it is because of:

Failed to start Xvfb, retry: 2
+ sleep 2
+ '[' 2 -gt 2 ']'
+ xvinfo
xvinfo:  Unable to open display :0

---

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=41d7e0f6b503&selectedJob=23190672

 curl --fail -o mozharness.zip --retry 10 -L $MOZHARNESS_URL

 + curl --fail -o mozharness.zip --retry 10 -L https://queue.taskcluster.net/v1/task/VwFh2hFKQH-6maIm5zyNUQ/artifacts/public/build/mozharness.zip

   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                  Dload  Upload   Total   Spent    Left  Speed

 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
[..]
 72  657k   72  475k    0     0   6045      0  0:01:51  0:01:20  0:00:31     0
 72  657k   72  475k    0     0   6045      0  0:01:51  0:01:20  0:00:31     0

curl: (18) transfer closed with 187040 bytes remaining to read

---

The first of those is
 https://tools.taskcluster.net/task-inspector/#e6AI4R1lQkOiW6LfiBiwjA/0
and the second
 https://tools.taskcluster.net/task-inspector/#WRuZ43i_TI2z1p2yQOCo9Q/0

In the first case, it looks like xvfb was the issue, and I don't see any more useful information - it looks like it just died.  In the second case, the curl download failed, always a possibility with long HTTP downloads.
Assignee: nobody → dustin
So I think the ask here is better compliance with "Failures must appear in the Treeherder failure summary in order to avoid having to open the full log for every failure."  I'm not sure what that means, though.  https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy linked to a utils.py that didn't seem relevant, but going back to a refactor it seems like the interesting stuff was moved to https://github.com/mozilla/treeherder/blob/master/treeherder/model/error_summary.py.  Even then, I don't really see what the format is (I don't see where the 'all_errors' attribute of an artifact blob is set).

From context, I'm guessing it's looking for lines matching /^\[(?<component>.*):error\]/

So more specifically, we're looking at having all possible failures include a log artifact with a line matching that pattern.  I'm not sure "all" is achievable, but certainly bug 1231200 addresses a class of failure modes (including Henrik's first comment), and including an extra newline before the timeout error will help with comment 1.  Comment 0 and Henrik's second comment are the same, and can also be addressed in bug 1231200.

But more generally, I think this will have to be an ongoing process - we've identified a number of failure modes, but it's impossible to ever say we've identified all failure modes.  As we discover more, we'll make sure they're identified correctly.

Wes, did I get my guesses correct at the top?  If so, I'll edit the wiki to match.  If not, please edit it to enlighten me :)
Flags: needinfo?(wkocher)
Worth noting, I would have thought curl's --retry option would retry the mozharness download :(
Attachment #8729237 - Flags: review?(garndt) → review+
> Wes, did I get my guesses correct at the top?  If so, I'll edit the wiki to match.  If not, please edit it to enlighten me :)

Yeah, looks like it was already changed.
Flags: needinfo?(wkocher)
Greg will be deploying this.
Assignee: dustin → garndt
New AMI started rolling out yesterday and all workers should be cycled by the weekend.  Considering this resolved.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.