Closed Bug 1394779 Opened 4 years ago Closed 4 years ago

unable to backfill or add new jobs both TC and BBB

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla57

People

(Reporter: jmaher, Assigned: bstack)

References

Details

Attachments

(2 files)

we are stuck on a lot of perf regressions as backfill attempts yesterday and today have resulted in no jobs being added (this is both the 'backfill' and the 'add new jobs' actions from treeherder).

For a talos job that uses BBB, I see this:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=201c0c94bae0f87ce4b9af5ba21465761b0fc987&selectedJob=126717516&filter-searchStr=action


and for an android job that is 100% TC, I see this:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=2932577f253f4d2fe8f459bb281a1d92695c417a&selectedJob=126717261&filter-searchStr=action

the action task failures are identical with this text:

[taskcluster 2017-08-29 12:07:47.966Z] === Task Starting ===

[taskcluster:error] Failure to properly start execution environment.

[taskcluster:error] (HTTP code 404) no such container - invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"exec: \\\"/builds/worker/bin/run-task\\\": stat /builds/worker/bin/run-task: no such file or directory\"\n" 
[taskcluster 2017-08-29 12:07:48.291Z] === Task Finished ===
[taskcluster 2017-08-29 12:07:48.353Z] Artifact "public" not found at "/builds/worker/artifacts"
[taskcluster 2017-08-29 12:07:48.687Z] Unsuccessful task run with exit code: -1 completed in 1.371 seconds
:garndt, can you find someone on the TC team to look into this and get this resolved?  I guess if this isn't a TC issue, possibly you would know what team should be working on it?  I assume TC give the use of GO code.
Flags: needinfo?(garndt)
That's an old-style actions.yml task, so that will be going away soon.  It's using an old decision task image (0.1.7, newest is 0.1.10).  Wander just moved everything from /home/worker to /builds/worker, but that directory does not exist on this image.. or on 0.1.10.  So I think the fix is to revert that change to actions.yml.
Flags: needinfo?(garndt)
thanks for the reply :dustin- will this work retroactively on the tree?  I assume so since I don't see actions.yml in-tree
No, it won't, but it's a one-line patch so you could push it to try.  The rename only landed yesterday, though.
Wander, as a side-note -- I see that .taskcluster.yml still has /home/worker.  Should we fix that up and generate a new decision image, so that everything is consistently /builds/worker?
Flags: needinfo?(wcosta)
using the trick to edit an action task and s/builds/home/, then create a new task- worked to get green action tasks.  I have jobs for the taskcluster tests, but I do not have jobs for the BBB yet, I will try a few more times there.
and this trick worked for the BBB jobs as well.
(In reply to Dustin J. Mitchell [:dustin] from comment #5)
> Wander, as a side-note -- I see that .taskcluster.yml still has
> /home/worker.  Should we fix that up and generate a new decision image, so
> that everything is consistently /builds/worker?

I really don't know, that's why I kept it untouched.
Flags: needinfo?(wcosta)
Comment on attachment 8902240 [details]
Bug 1394779: decision image still uses /home;

https://reviewboard.mozilla.org/r/173766/#review179116
Attachment #8902240 - Flags: review?(wcosta) → review+
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6490ba9e0ec7
decision image still uses /home; r=wcosta
https://hg.mozilla.org/mozilla-central/rev/6490ba9e0ec7
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
This was only a partial fix.  It fixes action tasks that use action.yml but for backfilling, the command is hardcoded here:
https://hg.mozilla.org/integration/autoland/file/6b9d06ba6f769234530ae67d8353377d58a93fd0/taskcluster/taskgraph/actions/registry.py#l243

Either we push a change out for this as well (along with the other references to builds/worker in that file), or we can try to get bug 1394883 landed.
See Also: → 1395563
I think bug 1394883 is close to landing (of course, it won't help with regressions)
Depends on: 1395724
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
as a note, this was from pushes on September 8th, luckily add new jobs worked!
We took a brief look into this issue but was unsure what the steps to reproduce were.  What job was being backfilled? I see some other backfill requests successfully completing.

On the failed actions, we noticed that the action task and action task ID were not filled out in the task payload (shows as "null" whereas a successful run has much more data there).

I am not sure how an action task could be scheduled without providing that information but some STR would help track it down.
Flags: needinfo?(garndt) → needinfo?(jmaher)
typically I am trying to backfill an AWSY job, here are some repro steps:
1) go to mozilla inbound and filter on awsy: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=awsy
1.5) I narrowed the range down to focus on specific revisions: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=awsy&fromchange=bc1f526a6152eb8a810c78041678b249c0906314&tochange=0e2f9e7b7fd7ab31640383e64c8b7bf4c602d828
2) select linux64 awsy: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=awsy&fromchange=bc1f526a6152eb8a810c78041678b249c0906314&tochange=0e2f9e7b7fd7ab31640383e64c8b7bf4c602d828&selectedJob=130126553
3) from the popup pane, click the '...' and click 'backfill'.
4) verify green bar with text as a dialog popup saying | Request sent to backfill job via actions.json ...|
5) look at the action tasks: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=action&fromchange=bc1f526a6152eb8a810c78041678b249c0906314&tochange=0e2f9e7b7fd7ab31640383e64c8b7bf4c602d828
6) verify we have a red Bk job

As a note, I didn't need to backfill that specific job, but went through the exercise in detail, it is not that big of a deal to backfill a random job
Flags: needinfo?(jmaher)
Assignee: nobody → bstack
Attachment #8907427 - Flags: review?(cdawson)
Thanks for the good repro steps! Figured out what I had missed the first time around. Hopefully this patch fixes it although backfilling is difficult to test.
I've started pulse_actions back again just in case there's something the action tasks are not handling.
Let me know when you think I can shut it off again.
Attachment #8907427 - Flags: review?(cdawson) → review+
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.