Closed Bug 1301505 Opened 4 years ago Closed 4 years ago

Regression on action tasks - task_dict KeyError

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla51

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(1 file)

kats was trying to add TaskCluster jobs to this push [1]
The action task jobs turned red (system transparency ftw!).
It seems that there's been some sort of regression. I can reproduce locally [2][3].

[1]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b12674382576&group_state=expanded

[2]
https://public-artifacts.taskcluster.net/f7L5TCLiRcuQkZZC-lCnxg/0/public/logs/live_backing.log

[3] 
armenzg@armenzg-thinkpad:~/repos/firefox$ hg id && ./mach --log-no-times taskgraph action-task --decision-id=fxN3ej-mSOK1yjgF8nnkqQ --task-label=android-test-android-4.3-arm7-api-15/debug-reftest-6
938ce16be25f tip
Starting new HTTPS connection (1): queue.taskcluster.net
"GET /v1/task/fxN3ej-mSOK1yjgF8nnkqQ/artifacts/public/full-task-graph.json HTTP/1.1" 303 29
Starting new HTTPS connection (1): public-artifacts.taskcluster.net
"GET /fxN3ej-mSOK1yjgF8nnkqQ/0/public/full-task-graph.json HTTP/1.1" 200 4189023
Traceback (most recent call last):
  File "/home/armenzg/repos/firefox/taskcluster/mach_commands.py", line 187, in taskgraph_action
    return taskgraph.action.taskgraph_action(options)
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/action.py", line 36, in taskgraph_action
    all_tasks, full_task_graph = TaskGraph.from_json(full_task_json, options['root'])
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/taskgraph.py", line 76, in from_json
    tasks[key] = task_kind.from_json(value)
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/task/base.py", line 108, in from_json
    task=task_dict['task'])
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/task/nightly_fennec.py", line 28, in __init__
    self.task_dict = kwargs.pop('task_dict')
KeyError: u'task_dict'
dustin: any suggestions on how to solve this?
I'm happy to provide a patch.

I wonder if this has been broken since last week.
I wonder what we can do to prevent future regressions like this. Perhaps run the mach target with --dry-run as part of some tc tests? Or the gecko decision task?
Flags: needinfo?(dustin)
I've been noticing similar things.  I think the fundamental issue is that the to_json/from_json logic is fragile.  I have some longer-term ideas of how to fix this, but for the moment if this is just caused by the nightly-fennec kind, it might be easiest to just wait until it's deleted (it's very temporary).  Jordan can give you a timeline for that.
Flags: needinfo?(dustin)
Thanks dustin!

Hi jlund, currently adding new jobs to TaskCluster is broken due to a bug on the Fennec nightly kind.
What is the timeline of deleting that?

Also, would this fix be good?
-        self.task_dict = kwargs.pop('task_dict')
+        try:
+            self.task_dict = kwargs.pop('task_dict')
+        except KeyError:
+            pass
Flags: needinfo?(jlund)
martianwars: How do I test locally if my change is working? or do I have to push to try?

armenzg@armenzg-thinkpad:~/repos/firefox$ ./mach --log-no-times taskgraph action-task --decision-id=fxN3ej-mSOK1yjgF8nnkqQ --task-label=android-test-android-4.3-arm7-api-15/debug-reftest-6
Starting new HTTPS connection (1): queue.taskcluster.net
"GET /v1/task/fxN3ej-mSOK1yjgF8nnkqQ/artifacts/public/full-task-graph.json HTTP/1.1" 303 29
Starting new HTTPS connection (1): public-artifacts.taskcluster.net
"GET /fxN3ej-mSOK1yjgF8nnkqQ/0/public/full-task-graph.json HTTP/1.1" 200 4189023
Starting new HTTPS connection (1): queue.taskcluster.net
"GET /v1/task/fxN3ej-mSOK1yjgF8nnkqQ/artifacts/public/label-to-taskid.json HTTP/1.1" 303 29
Starting new HTTPS connection (1): public-artifacts.taskcluster.net
"GET /fxN3ej-mSOK1yjgF8nnkqQ/0/public/label-to-taskid.json HTTP/1.1" 200 251
optimizing `build-docker-image-desktop-build`, replacing with task `Ti11U-EtSHKGyS53aFHVYA`
optimizing `build-docker-image-desktop-test`, replacing with task `ZyLlvIg0SSyUdpH-g4raDg`
optimizing `TaskLabel==R8bgT3pzQD6sFDA8L2oZAQ`, replacing with task `TzpGKp1PTZCb2ThORaDSCQ`
writing artifact file `task-graph.json`
writing artifact file `label-to-taskid.json`
Creating task with taskId SDhMt0W4QL20cUgRP_OyJg for android-test-android-4.3-arm7-api-15/debug-reftest-6
Starting new HTTP connection (1): taskcluster
Traceback (most recent call last):
  File "/home/armenzg/repos/firefox/taskcluster/mach_commands.py", line 187, in taskgraph_action
    return taskgraph.action.taskgraph_action(options)
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/action.py", line 57, in taskgraph_action
    create_tasks(optimized_graph, label_to_taskid)
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/create.py", line 84, in create_tasks
    f.result()
  File "/home/armenzg/repos/firefox/python/futures/concurrent/futures/_base.py", line 396, in result
    return self.__get_result()
  File "/home/armenzg/repos/firefox/python/futures/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/armenzg/repos/firefox/taskcluster/taskgraph/create.py", line 97, in _create_task
    data=json.dumps(task_def))
  File "/home/armenzg/repos/firefox/python/requests/requests/sessions.py", line 521, in put
    return self.request('PUT', url, data=data, **kwargs)
  File "/home/armenzg/repos/firefox/python/requests/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/armenzg/repos/firefox/python/requests/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/armenzg/repos/firefox/python/requests/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='taskcluster', port=80): Max retries exceeded with url: /queue/v1/task/SDhMt0W4QL20cUgRP_OyJg (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f194a1df410>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Flags: needinfo?(kalpeshk2011)
Yeah, you would have to push to try. After that you could use this script to schedule action tasks https://github.com/armenzg/TC_developer_scheduling_experiments/blob/master/schedule_action_task.py, in the case the Treeherder / pulse actions is not working correctly
Flags: needinfo?(kalpeshk2011)
However in this case since you are getting an HTTPConnectionPool error, it is okay. I think this error is only generated at the very end, when the tasks are actually being created. Maybe you could try a few more task label combinations to check if everything is correct!
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #0)
>
> armenzg@armenzg-thinkpad:~/repos/firefox$ hg id && ./mach --log-no-times
> taskgraph action-task --decision-id=fxN3ej-mSOK1yjgF8nnkqQ
> --task-label=android-test-android-4.3-arm7-api-15/debug-reftest-6
> 938ce16be25f tip
...
>     all_tasks, full_task_graph = TaskGraph.from_json(full_task_json,

I wonder if nightly-fennec should have its own from_json like other kinds do: https://dxr.mozilla.org/mozilla-central/search?q=path%3Ataskcluster+from_json&redirect=false

I'll have a look now
Flags: needinfo?(jlund)
Comment on attachment 8790319 [details]
Bug 1301505 - Gracefully handle missing key for nightly fennec class.

https://reviewboard.mozilla.org/r/78200/#review76694

let's give it a try :)

as dustin mentioned, this is a temp fix. If this doesn't work, we may need to have a custom from_json defined for nightly-fennec
Attachment #8790319 - Flags: review?(jlund) → review+
It seems that I can add again new jobs.
Assignee: nobody → armenzg
Status: NEW → ASSIGNED
Pushed by armenzg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6711f5d1f7b6
Gracefully handle missing key for nightly fennec class. r=jlund
https://hg.mozilla.org/mozilla-central/rev/6711f5d1f7b6
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
You need to log in before you can comment on or make changes to this bug.