Closed
Bug 1197855
Opened 9 years ago
Closed 9 years ago
buildbot bridge needs to learn to cope with tasks that go past their deadline
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
Details
Attachments
(3 files)
49 bytes,
text/x-github-pull-request
|
jonasfj
:
review+
bhearsum
:
checked-in+
|
Details | Review |
1.25 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
49 bytes,
text/x-github-pull-request
|
jonasfj
:
review+
bhearsum
:
checked-in+
|
Details | Review |
Looks like it's happening when trying to create artifacts, eg: Traceback (most recent call last): File "/builds/bbb/bin/buildbot-bridge", line 9, in <module> load_entry_point('bbb==1.2', 'console_scripts', 'buildbot-bridge')() File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/runner.py", line 82, in main service.start() File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/servicebase.py", line 356, in start connection.drain_events() File "/builds/bbb/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events return self.transport.drain_events(self.connection, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events return connection.drain_events(**kwargs) File "/builds/bbb/lib/python2.7/site-packages/amqp/connection.py", line 325, in drain_events return amqp_method(channel, args, content) File "/builds/bbb/lib/python2.7/site-packages/amqp/channel.py", line 1908, in _basic_deliver fun(msg) File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 592, in _receive_callback return on_m(message) if on_m else self.receive(decoded, message) File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 559, in receive [callback(body, message) for callback in callbacks] File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/services.py", line 164, in handleFinished createJsonArtifact(self.tc_queue, taskid, runid, "public/properties.json", properties, expires) File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/tcutils.py", line 20, in createJsonArtifact "expires": expires, File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 417, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterAuthFailure: Authorization Failed
Assignee | ||
Comment 1•9 years ago
|
||
Might be related to bug 1197841 in some way.
Assignee | ||
Comment 2•9 years ago
|
||
Jonas, any idea here? AFAICT the clientId and accessToken are correct, and the credentials have "queue:create-artifact:*" as a scope. It looks like one of the tasks in question is https://tools.taskcluster.net/task-inspector/#A18IB6cARBGyKrV7w5YAMA/. Perhaps there's some weirdness because that was cancelled via the taskcluster UI, but the bridge is still trying to create artifacts for it? I would expect to get a different error than "authorization failed" for that though...
Flags: needinfo?(jopsen)
Comment 3•9 years ago
|
||
The excpetions for authentication (bad credentials) and authorization (missing scopes) are very similar. You might be lacking scopes. I faced this a couple of weeks.
Comment 4•9 years ago
|
||
Is this an occasional error? Or persistently there. Create task requires scopes: queue:create-artifact:<name>, assume:worker-id:<workerGroup>/<workerId> You have scopes: queue:create-artifact:* assume:worker-id:buildbot-bridge/buildbot-bridge So you ought to be covered. But if we look at the task in question: A18IB6cARBGyKrV7w5YAMA 1) It wasn't cancelled through the UI (runId=0 has deadline-exceeded) 2) It was never claimed (started has no timestamp, WorkerGroup, WorkerId are null) So since your worker never claimed it, you can't resolve it or upload artifacts for it. Note: Generally you can't upload artifacts after a run is resolved, but when the run is resolved task-exception we do allow this. In this case you can't because your worker never claimed the task in the first place. --- I suspect you're trying to upload a log or something from a task you never claimed. (I would say that you don't have to upload anything for tasks that are cancelled)
Flags: needinfo?(jopsen)
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #4) > Is this an occasional error? Or persistently there. It was persistently there for A18IB6cARBGyKrV7w5YAMA. > Create task requires scopes: > queue:create-artifact:<name>, > assume:worker-id:<workerGroup>/<workerId> > You have scopes: > queue:create-artifact:* > assume:worker-id:buildbot-bridge/buildbot-bridge > So you ought to be covered. > > But if we look at the task in question: A18IB6cARBGyKrV7w5YAMA > 1) It wasn't cancelled through the UI (runId=0 has deadline-exceeded) > 2) It was never claimed (started has no timestamp, WorkerGroup, WorkerId > are null) > > So since your worker never claimed it, you can't resolve it or upload > artifacts for it. > Note: Generally you can't upload artifacts after a run is resolved, but when > the run is > resolved task-exception we do allow this. In this case you can't > because your worker > never claimed the task in the first place. > --- > I suspect you're trying to upload a log or something from a task you never > claimed. > (I would say that you don't have to upload anything for tasks that are > cancelled) Aha! I did some further digging in https://bugzilla.mozilla.org/show_bug.cgi?id=1197841#c4, and you're absolutely right: it was never claimed. This is due to the Buildbot Build not starting until well after the deadline. I'm going to morph this bug into a fix for that. AFAICT, we need it to start handling deadline-exceeded exceptions. Thanks for the help analyzing this!
Summary: buildbot bridge bblistener keeps dying with authorization failures → buildbot bridge needs to learn to cope with tasks that go past their deadline
Assignee | ||
Comment 6•9 years ago
|
||
Attachment #8654148 -
Flags: review?(jopsen)
Comment 7•9 years ago
|
||
Comment on attachment 8654148 [details] [review] handle deadline exceeded exceptions Looks solid to me... bb-bridge is not my strong side though :) But forgetting about a job after deadline-exceeded or cancel seems reasonable. From a worker perspective it's the same thing.
Attachment #8654148 -
Flags: review?(jopsen) → review+
Assignee | ||
Comment 8•9 years ago
|
||
Both the Reflector and the TCListener will be using self serve once this patch lands, need to move that config up to the top level too!
Attachment #8654295 -
Flags: review?(rail)
Updated•9 years ago
|
Attachment #8654295 -
Flags: review?(rail) → review+
Assignee | ||
Updated•9 years ago
|
Attachment #8654295 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Attachment #8654148 -
Flags: checked-in+
Assignee | ||
Comment 9•9 years ago
|
||
This worked in dev, hooray!
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #9) > This worked in dev, hooray! Ugh, spoke too soon. It appeared to work (handled deadline-exceeded correctly), and then got into an infinite loop processing the build finished message (and failing somehow) afterwards.
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #10) > (In reply to Ben Hearsum (:bhearsum) from comment #9) > > This worked in dev, hooray! > > Ugh, spoke too soon. It appeared to work (handled deadline-exceeded > correctly), and then got into an infinite loop processing the build finished > message (and failing somehow) afterwards. Looks like the error is: Traceback (most recent call last): File "/builds/bbb/bin/buildbot-bridge", line 9, in <module> load_entry_point('bbb==1.4.1', 'console_scripts', 'buildbot-bridge')() File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 86, in main service.start() File "/builds/bbb/lib/python2.7/site-packages/bbb/servicebase.py", line 359, in start connection.drain_events() File "/builds/bbb/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events return self.transport.drain_events(self.connection, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events return connection.drain_events(**kwargs) File "/builds/bbb/lib/python2.7/site-packages/amqp/connection.py", line 325, in drain_events return amqp_method(channel, args, content) File "/builds/bbb/lib/python2.7/site-packages/amqp/channel.py", line 1908, in _basic_deliver fun(msg) File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 592, in _receive_callback return on_m(message) if on_m else self.receive(decoded, message) File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 559, in receive [callback(body, message) for callback in callbacks] File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 227, in handleFinished self.tc_queue.cancelTask(taskid) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterRestFailure: Task can't be cancel task past it's deadline
Assignee | ||
Comment 12•9 years ago
|
||
I had thought this wouldn't be necessary, but I didn't realize that cancelTask would raise an exception if called on a Task that was resolved with deadline-exceeded (it silently does nothing if called on a Task alerady resolved as cancelled). This should fix it up...
Attachment #8657275 -
Flags: review?(jopsen)
Comment 13•9 years ago
|
||
Comment on attachment 8657275 [details] [review] handled deadline exceeded exceptions in bblistener > it silently does nothing if called on a Task alerady resolved as cancelled Well, if a new run was added it will affect that run... Perhaps it would have been better if cancelTask was queue.cancelTask(taskId, runId). I guess we can always add cancelRun :)
Attachment #8657275 -
Flags: review?(jopsen) → review+
Assignee | ||
Comment 14•9 years ago
|
||
Comment on attachment 8657275 [details] [review] handled deadline exceeded exceptions in bblistener I merged this. I'll do some final testing in dev before deploying to prod.
Attachment #8657275 -
Flags: checked-in+
Assignee | ||
Comment 15•9 years ago
|
||
OK, verification complete. https://treeherder.allizom.org/#/jobs?repo=alder&revision=463fc0cc3576 verified that Builds and BuildRequest get cancelled when the deadline is exceeded (the "pending" OS X jobs on there are a treeherder bug, where it doesn't remove pending jobs that never show up in builds-4hr or builds-running). https://treeherder.allizom.org/#/jobs?repo=alder&revision=73b75e139115 verified that cancellation through Taskcluster and Buildbot both work. I cancelled the Linux on that through Buildbot, the Windows ones through Taskcluster. The OS X ones were slow to start, so I took the opportunity to double check that BuildRequest cancellation worked too. As you can see, all the Buildbot and Taskcluster states match up (even though the TC cells are purple, the exceptions are all "cancelled"). Given the success of these tests, I'm going to push this to prod tomorrow \o/.
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•