claimWork requests not being held for 20s (long polling) and not returning results
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
People
(Reporter: pmoore, Assigned: dustin)
References
Details
The claimWork
endpoint should hold requests for 20s, since this this endpoint implements long polling.
Sometimes this isn't happening.
I've been able to reproduce this locally.
$ date; echo; echo '{"workerGroup": "mdc1", "workerId": "t-yosemite-r7-380", "tasks": 1}' | curl -v --header 'Content-Type: application/json' --request POST --data @- http://localhost:8080/queue/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta; echo; echo; date
Mon 14 Jan 2019 15:50:57 CET
Note: Unnecessary use of -X or --request, POST is already inferred.
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> POST /queue/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 68
>
* upload completely sent off: 68 out of 68 bytes
< HTTP/1.1 200 OK
< Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin,Cache-Control
< Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
< Access-Control-Allow-Origin: *
< Access-Control-Max-Age: 900
< Access-Control-Request-Method: *
< Cache-Control: no-store no-cache must-revalidate
< Connection: keep-alive
< Content-Length: 17
< Content-Security-Policy: report-uri /__cspreport__;default-src 'none';frame-ancestors 'none';
< Content-Type: application/json; charset=utf-8
< Date: Mon, 14 Jan 2019 14:50:57 GMT
< Etag: W/"11-IIixY78GOTb7jUPZdbQZ7mNo6gk"
< Server: Cowboy
< Via: 1.1 vegur
< X-Content-Type-Options: nosniff
< X-For-Request-Id: 130321b2-6e93-499c-ad03-832e2e8eb7d9
< X-Taskcluster-Endpoint: https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta
< X-Taskcluster-Proxy-Perm-Clientid: mozilla-ldap/pmoore@mozilla.com/dev
< X-Taskcluster-Proxy-Revision:
< X-Taskcluster-Proxy-Version: 5.0.1
<
{
"tasks": []
* Connection #0 to host localhost left intact
}
Mon 14 Jan 2019 15:50:57 CET
Here we see this request returns a 200 HTTP response with zero tasks in less than a second.
Note, this is especially weird (wrong) because the Queue claims there are no workers of this worker type:
$ curl 'http://localhost:8080/queue/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010-beta/workers'
{
"workers": []
}
Even if that worker was quarantined, it should show up in the list of workers, if we have just called claimWork
endpoint.
So something is going wrong in the Queue.
Note - both of the above requests were made via the taskcluster-proxy
with sufficient scopes.
Reporter | ||
Comment 1•6 years ago
|
||
Note, this is probably also the cause of both bug 1485380 comment 5 and bug 1493321 comment 8.
Reporter | ||
Comment 2•6 years ago
•
|
||
Hey Dustin,
My understanding is that:
- Regardless of whether a worker is quarantined or not, if the queue is going to return 0 tasks to a
queue.claimWork
call, it should hold the connection for 20s. - Regardless of whether a worker is quarantined or not, it should show up in the
queue.listWorkers
endpoint, if the worker recently calledqueue.claimWork
.
Above, we see that we get an immediate HTTP 200 response from queue.claimWork
call, with "tasks": []
, and that the workerId
used in that call does not make it into the (empty) list of workers from the queue.listWorkers
endpoint.
This indicates a problem in the worker tracking implemented in the queue, and the serving of tasks, since we can also see there are pending tasks for this worker type:
$ curl 'http://localhost:8080/queue/v1/pending/releng-hardware/gecko-t-osx-1010-beta'
{
"provisionerId": "releng-hardware",
"workerType": "gecko-t-osx-1010-beta",
"pendingTasks": 2
}
Assignee | ||
Comment 3•6 years ago
|
||
I'll defer to John as service owner. It would be helpful to know how often this happens.
Reporter | ||
Comment 4•6 years ago
•
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #3)
I'll defer to John as service owner. It would be helpful to know how often this happens.
This happens consistently. If you manually run the commands from the above comments, you should get the same results as I do.
This seems to happen for special workerId
or worker types only, so I suspect this is probably a bug in the declareWorker
/quarantine code, which is why I needinfo'd you rather than John, since I think you wrote that. Let me know if that isn't the case, and I am misremembering! :)
Of course, if by the time you get to this, all the pending tasks have expired for releng-hardware/gecko-t-osx-1010-beta
, creating new arbitrary tasks will reproduce the circumstances...
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 5•6 years ago
|
||
It sounds like this is happening only for quarantined tasks -- is that consistent? What did you mean by "special"?
Reporter | ||
Comment 6•6 years ago
•
|
||
I'm not sure what a quarantined task is. It seems to happen consistently at least for "provisionerId": "releng-hardware", "workerType": "gecko-t-osx-1010-beta", "workerId": "t-yosemite-r7-380"
as per the above example.
I haven't explored other workerId
s - but since it consistently happens for this one, that could be a good avenue to explore.
You'll see the claimWork
call returns in less than a second, that despite making claims using that workerId
, it never appears in the list of workers, and it always gets 0 tasks back, even when there are pending tasks. And since it doesn't appear in the list of quarantined workers, it isn't not getting tasks because it is quarantined. Lastly, even if it was quarantined, the connection should probably be held by the queue for 20s, but I suspect the reason it isn't is probably a symptom of the underlying bug that causes the other symptoms...
Assignee | ||
Comment 7•6 years ago
|
||
Sorry, I meant quarantined worker, yes. So if it's not quarantined, what's special about this workerId? Or workerType?
Reporter | ||
Comment 8•6 years ago
•
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)
What did you mean by "special"?
Clearly this doesn't happen for all provisionerId/workerType/workerId combinations, otherwise we'd have a downtime. But since it happens consistently and reproducibly for the given provisionerId/workerType/workerId combination, that makes it special; it is behaving differently to the many workers that are happily running tasks at the moment in production without exhibiting this issue.
Reporter | ||
Comment 9•6 years ago
•
|
||
To recap:
We have three reproducible queries (claimTask
, pendingTasks
, listWorkers
). The claimTask
call gets back zero tasks (immediately), despite:
- the queue should hold the connection for 20s when returning no tasks, since the queue implements 20s long-polling
- there are pending tasks for this worker type (according to the
pendingTasks
query) - the worker is not listed as quarantined nor as not-quarantined (according to the
listWorkers
query)
If the worker is not listed as quarantined, and not listed as not-quarantined, calling claimTask
should return one of the pending tasks, and the worker should be listed in future listWorkers
calls. This is not happening.
Assignee | ||
Comment 10•6 years ago
|
||
You're correct that quarantining is not waiting the requisite 20 seconds: https://github.com/taskcluster/taskcluster/blob/master/services/queue/src/api.js#L1336
What I'm confused about is "the worker is not listed as quarantined nor as not-quarantined (according to the listWorkers query)". What does that mean? I don't see any workerType gecko-t-osx-1010-beta
defined at all under provisionerId releng-hardware. Perhaps it has just timed out?
Anyway, I'm going to just assume that this worker is (or at least was) quarantined and that's the behavior you're seeing.
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
dustin@lamport ~ $ echo '{"workerGroup": "mdc1", "workerId": "t-yosemite-r7-380", "tasks": 1}' | time taskcluster api queue claimWork releng-hardawre gecko-t-osx-1010-beta
{
"tasks": []
}
0.06user 0.00system 0:20.66elapsed 0%CPU (0avgtext+0avgdata 15244maxresident)k
0inputs+0outputs (0major+1736minor)pagefaults 0swaps
I'm still confused about the quarantined bit, but for that worker at least this is fixed. Pete, can you confirm?
Reporter | ||
Comment 13•6 years ago
|
||
For this worker (t-yosemite-r7-380) it does seem to be fixed, but I guess someone just removed the quarantine for that worker explicitly?
I think the events that led to the problem were as follows
- the worker got quarantined
- it continued to run, and for a while, all seemed ok - it appeared in the list of quarantined workers, so it was visible that there was a worker out of commission, and it didn't get tasks from the queue, so all was good
- when the worker called queue.claimWork, the queue (incorrectly) didn't update its 'last seen' timestamp for the worker, so at some point, the queue considered that the worker hadn't been seen for a long enough amount of time that it was ok to delete its record of that worker's existence (but without deleting the record of it being quarantined)[1]
- somebody noticed that this worker was running, getting no tasks, and no records could be seen in taskcluster of this worker being quarantined, or even existing
I think the important fix here was that if the worker calls claimWork, even if it is quarantined, the queue should update its recorded 'last seen' field for this workerId, so that somebody browsing the worker pool can see that there is an active worker that has recently tried to claim a task, but is quarantined.
The problem is that if a quarantine record appears for a given workerId, that workerId should appear in the list of workers of its worker type, if it has been seen by the queue recently despite being quarantined, or possibly even if it hasn't been seen by the queue for a long time. In other words, I don't think we should ever filter out quarantined workerIds from the list of workers for a given worker type, even if we do choose to filter out workerIds that haven't called claimWork for a long time.
I hope that makes sense!
--
[1] it could also be that it didn't delete the record, but in queries to show known workers for the given worker type, it chose not to return it in the list
Assignee | ||
Comment 14•6 years ago
|
||
OK. So I think the first part of this is fixed (the "important fix" in your last message).
But you make a good point that we should not expire quarantined workers. In fact, we don't -- but we also don't display them! So they're stuck in a "limbo" where they are still quarantined but not visible.
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
OK, that is the second part.
Updated•6 years ago
|
Description
•