Closed Bug 1148860 Opened 9 years ago Closed 9 years ago

queue: Allow artifacts up to 25min past a run is resolved as exception (for logs)

Categories

(Taskcluster :: Services, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: jonasfj)

References

Details

Attachments

(1 file)

If a run is resolved as exception we should still allow artifact upload
until 25min past the resolution time. At the moment the queue will reject
any attempts to upload artifacts after a run is resolved.

This is a special case for "exception", because we still want logs, but
we accept that they are best-effort when we have an exception.

Use-cases:
A) docker-worker encounters a spot termination warning from EC2.
   Instead of uploading logs and then reportException(worker-shutdown)
   We should reportException(worker-shutdown) and then upload logs, this way
   if we are terminated while uploading a large log file, we will still have
   reported exception with worker-shutdown, so the queue will have scheduled
   a new rerun immediately.

B) A user encounters something weird in the live-log from a task and cancels
   the task. The livelog will not be persisted, so when the user reports the
   issue to us, it'll be hard to debug :)
   As docker-worker will reclaimTask every 20min, it should be able to detect
   the 409 and upload logs. Note, that docker-worker listens for
   task-exception to get the cancel message, so in most cases it'll upload
   immediately after cancelTask. But in current setup even that will fail.

Note:
We should NOT allow artifacts to be uploaded after reportCompleted or
reportFailed, in these cases we want to ensure that artifacts are present at
the time of resolution. As logs aren't a best-effort service here.
The buildbot bridge will throw exceptions like this until this is fixed:
Traceback (most recent call last):
  File "/builds/bbb/bin/buildbot-bridge", line 9, in <module>
    load_entry_point('bbb==0.3', 'console_scripts', 'buildbot-bridge')()
  File "/builds/bbb/lib/python2.7/site-packages/bbb-0.3-py2.7.egg/bbb/runner.py", line 81, in main
    service.start()
  File "/builds/bbb/lib/python2.7/site-packages/bbb-0.3-py2.7.egg/bbb/servicebase.py", line 300, in start
    connection.drain_events()
  File "/builds/bbb/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events
    return connection.drain_events(**kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/connection.py", line 325, in drain_events
    return amqp_method(channel, args, content)
  File "/builds/bbb/lib/python2.7/site-packages/amqp/channel.py", line 1908, in _basic_deliver
    fun(msg)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 592, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/builds/bbb/lib/python2.7/site-packages/kombu/messaging.py", line 559, in receive
    [callback(body, message) for callback in callbacks]
  File "/builds/bbb/lib/python2.7/site-packages/bbb-0.3-py2.7.egg/bbb/services.py", line 118, in handleFinished
    createJsonArtifact(self.tc_queue, taskid, runid, "public/properties.json", properties, expires)
  File "/builds/bbb/lib/python2.7/site-packages/bbb-0.3-py2.7.egg/bbb/tcutils.py", line 20, in createJsonArtifact
    "expires": expires,
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
    return self._makeApiCall(e, *args, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
    return self._makeHttpRequest(entry['method'], route, payload)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
    superExc=rerr
taskcluster.exceptions.TaskclusterRestFailure: The given run is not running

Not a huge rush to fix, very nice to have though.
Should be a quick fix... I have a lot of reviewed stuff to rollout, also for queue, but will get back to this soon.
Severity: normal → major
Priority: -- → P2
No longer blocks: 1156301
Attached file Github PR
Ideas for a better code path are welcome... But tests seems to pass.
Add any comments suggestions in github PR, thanks.

Feel free to push the merge button, but this still requires manual push, I'll undertake that when this is r+'ed.
Assignee: nobody → jopsen
Status: NEW → ASSIGNED
Attachment #8610895 - Flags: review?(garndt)
Attachment #8610895 - Flags: review?(garndt) → review+
Blocks: 1168809
Deployed, enjoy.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: TaskCluster → Queue
Product: Testing → Taskcluster
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: