Closed Bug 1197855 Opened 9 years ago Closed 9 years ago

buildbot bridge needs to learn to cope with tasks that go past their deadline

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(3 files)

Looks like it's happening when trying to create artifacts, eg:
Traceback (most recent call last):
  File "/builds/bbb/bin/buildbot-bridge", line 9, in <module>
    load_entry_point('bbb==1.2', 'console_scripts', 'buildbot-bridge')()
  File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/runner.py", line 82, in main
    service.start()
  File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/servicebase.py", line 356, in start
    connection.drain_events()
  File "/builds/bbb/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
    return connection.drain_events(**kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/connection.py", line 325, in drain_events
    return amqp_method(channel, args, content)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/channel.py", line 1908, in _basic_deliver
    fun(msg)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 592, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 559, in receive
    [callback(body, message) for callback in callbacks]
  File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/services.py", line 164, in handleFinished
    createJsonArtifact(self.tc_queue, taskid, runid, "public/properties.json", properties, expires)
  File "/builds/bbb/lib/python2.7/site-packages/bbb-1.2-py2.7.egg/bbb/tcutils.py", line 20, in createJsonArtifact
    "expires": expires,
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
    return self._makeApiCall(e, *args, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
    return self._makeHttpRequest(entry['method'], route, payload)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 417, in _makeHttpRequest
    superExc=rerr
taskcluster.exceptions.TaskclusterAuthFailure: Authorization Failed
Might be related to bug 1197841 in some way.
Jonas, any idea here? AFAICT the clientId and accessToken are correct, and the credentials have "queue:create-artifact:*" as a scope. It looks like one of the tasks in question is https://tools.taskcluster.net/task-inspector/#A18IB6cARBGyKrV7w5YAMA/. Perhaps there's some weirdness because that was cancelled via the taskcluster UI, but the bridge is still trying to create artifacts for it? I would expect to get a different error than "authorization failed" for that though...
Flags: needinfo?(jopsen)
The excpetions for authentication (bad credentials) and authorization (missing scopes) are very similar.
You might be lacking scopes. I faced this a couple of weeks.
Is this an occasional error? Or persistently there.

Create task requires scopes:
  queue:create-artifact:<name>,
  assume:worker-id:<workerGroup>/<workerId>
You have scopes:
  queue:create-artifact:*
  assume:worker-id:buildbot-bridge/buildbot-bridge
So you ought to be covered.

But if we look at the task in question: A18IB6cARBGyKrV7w5YAMA
 1) It wasn't cancelled through the UI (runId=0 has deadline-exceeded)
 2) It was never claimed (started has no timestamp, WorkerGroup, WorkerId are null)

So since your worker never claimed it, you can't resolve it or upload artifacts for it.
Note: Generally you can't upload artifacts after a run is resolved, but when the run is
      resolved task-exception we do allow this. In this case you can't because your worker
      never claimed the task in the first place.
---
I suspect you're trying to upload a log or something from a task you never claimed.
(I would say that you don't have to upload anything for tasks that are cancelled)
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #4)
> Is this an occasional error? Or persistently there.

It was persistently there for A18IB6cARBGyKrV7w5YAMA.

> Create task requires scopes:
>   queue:create-artifact:<name>,
>   assume:worker-id:<workerGroup>/<workerId>
> You have scopes:
>   queue:create-artifact:*
>   assume:worker-id:buildbot-bridge/buildbot-bridge
> So you ought to be covered.
> 
> But if we look at the task in question: A18IB6cARBGyKrV7w5YAMA
>  1) It wasn't cancelled through the UI (runId=0 has deadline-exceeded)
>  2) It was never claimed (started has no timestamp, WorkerGroup, WorkerId
> are null)
> 
> So since your worker never claimed it, you can't resolve it or upload
> artifacts for it.
> Note: Generally you can't upload artifacts after a run is resolved, but when
> the run is
>       resolved task-exception we do allow this. In this case you can't
> because your worker
>       never claimed the task in the first place.
> ---
> I suspect you're trying to upload a log or something from a task you never
> claimed.
> (I would say that you don't have to upload anything for tasks that are
> cancelled)

Aha! I did some further digging in https://bugzilla.mozilla.org/show_bug.cgi?id=1197841#c4, and you're absolutely right: it was never claimed. This is due to the Buildbot Build not starting until well after the deadline. I'm going to morph this bug into a fix for that. AFAICT, we need it to start handling deadline-exceeded exceptions.

Thanks for the help analyzing this!
Summary: buildbot bridge bblistener keeps dying with authorization failures → buildbot bridge needs to learn to cope with tasks that go past their deadline
Attachment #8654148 - Flags: review?(jopsen)
Comment on attachment 8654148 [details] [review]
handle deadline exceeded exceptions

Looks solid to me... bb-bridge is not my strong side though :)
But forgetting about a job after deadline-exceeded or cancel seems reasonable.
From a worker perspective it's the same thing.
Attachment #8654148 - Flags: review?(jopsen) → review+
Both the Reflector and the TCListener will be using self serve once this patch lands, need to move that config up to the top level too!
Attachment #8654295 - Flags: review?(rail)
Attachment #8654295 - Flags: review?(rail) → review+
Attachment #8654295 - Flags: checked-in+
Attachment #8654148 - Flags: checked-in+
This worked in dev, hooray!
(In reply to Ben Hearsum (:bhearsum) from comment #9)
> This worked in dev, hooray!

Ugh, spoke too soon. It appeared to work (handled deadline-exceeded correctly), and then got into an infinite loop processing the build finished message (and failing somehow) afterwards.
(In reply to Ben Hearsum (:bhearsum) from comment #10)
> (In reply to Ben Hearsum (:bhearsum) from comment #9)
> > This worked in dev, hooray!
> 
> Ugh, spoke too soon. It appeared to work (handled deadline-exceeded
> correctly), and then got into an infinite loop processing the build finished
> message (and failing somehow) afterwards.

Looks like the error is:
Traceback (most recent call last):
  File "/builds/bbb/bin/buildbot-bridge", line 9, in <module>
    load_entry_point('bbb==1.4.1', 'console_scripts', 'buildbot-bridge')()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 86, in main
    service.start()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/servicebase.py", line 359, in start
    connection.drain_events()
  File "/builds/bbb/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
    return connection.drain_events(**kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/connection.py", line 325, in drain_events
    return amqp_method(channel, args, content)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/channel.py", line 1908, in _basic_deliver
    fun(msg)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 592, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 559, in receive
    [callback(body, message) for callback in callbacks]
  File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 227, in handleFinished
    self.tc_queue.cancelTask(taskid)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
    return self._makeApiCall(e, *args, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
    return self._makeHttpRequest(entry['method'], route, payload)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
    superExc=rerr
taskcluster.exceptions.TaskclusterRestFailure: Task can't be cancel task past it's deadline
I had thought this wouldn't be necessary, but I didn't realize that cancelTask would raise an exception if called on a Task that was resolved with deadline-exceeded (it silently does nothing if called on a Task alerady resolved as cancelled). This should fix it up...
Attachment #8657275 - Flags: review?(jopsen)
Comment on attachment 8657275 [details] [review]
handled deadline exceeded exceptions in bblistener

> it silently does nothing if called on a Task alerady resolved as cancelled
Well, if a new run was added it will affect that run...

Perhaps it would have been better if cancelTask was queue.cancelTask(taskId, runId).
I guess we can always add cancelRun :)
Attachment #8657275 - Flags: review?(jopsen) → review+
Comment on attachment 8657275 [details] [review]
handled deadline exceeded exceptions in bblistener

I merged this. I'll do some final testing in dev before deploying to prod.
Attachment #8657275 - Flags: checked-in+
OK, verification complete.
https://treeherder.allizom.org/#/jobs?repo=alder&revision=463fc0cc3576 verified that Builds and BuildRequest get cancelled when the deadline is exceeded (the "pending" OS X jobs on there are a treeherder bug, where it doesn't remove pending jobs that never show up in builds-4hr or builds-running).
https://treeherder.allizom.org/#/jobs?repo=alder&revision=73b75e139115 verified that cancellation through Taskcluster and Buildbot both work. I cancelled the Linux on that through Buildbot, the Windows ones through Taskcluster. The OS X ones were slow to start, so I took the opportunity to double check that BuildRequest cancellation worked too. As you can see, all the Buildbot and Taskcluster states match up (even though the TC cells are purple, the exceptions are all "cancelled").

Given the success of these tests, I'm going to push this to prod tomorrow \o/.
Deployed to production today.
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: