1519849 - claimWork requests not being held for 20s (long polling) and not returning results

Pete Moore [:pmoore][:pete]

Reporter

Description

•

6 years ago

•

Edited

The claimWork endpoint should hold requests for 20s, since this this endpoint implements long polling.

Sometimes this isn't happening.

I've been able to reproduce this locally.

$ date; echo; echo '{"workerGroup": "mdc1", "workerId": "t-yosemite-r7-380", "tasks": 1}' | curl -v --header 'Content-Type: application/json' --request POST --data @- http://localhost:8080/queue/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta; echo; echo; date
Mon 14 Jan 2019 15:50:57 CET

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> POST /queue/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 68
> 
* upload completely sent off: 68 out of 68 bytes
< HTTP/1.1 200 OK
< Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin,Cache-Control
< Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
< Access-Control-Allow-Origin: *
< Access-Control-Max-Age: 900
< Access-Control-Request-Method: *
< Cache-Control: no-store no-cache must-revalidate
< Connection: keep-alive
< Content-Length: 17
< Content-Security-Policy: report-uri /__cspreport__;default-src 'none';frame-ancestors 'none';
< Content-Type: application/json; charset=utf-8
< Date: Mon, 14 Jan 2019 14:50:57 GMT
< Etag: W/"11-IIixY78GOTb7jUPZdbQZ7mNo6gk"
< Server: Cowboy
< Via: 1.1 vegur
< X-Content-Type-Options: nosniff
< X-For-Request-Id: 130321b2-6e93-499c-ad03-832e2e8eb7d9
< X-Taskcluster-Endpoint: https://queue.taskcluster.net/v1/claim-work/releng-hardware/gecko-t-osx-1010-beta
< X-Taskcluster-Proxy-Perm-Clientid: mozilla-ldap/pmoore@mozilla.com/dev
< X-Taskcluster-Proxy-Revision: 
< X-Taskcluster-Proxy-Version: 5.0.1
< 
{
  "tasks": []
* Connection #0 to host localhost left intact
}

Mon 14 Jan 2019 15:50:57 CET

Here we see this request returns a 200 HTTP response with zero tasks in less than a second.

Note, this is especially weird (wrong) because the Queue claims there are no workers of this worker type:

$ curl 'http://localhost:8080/queue/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010-beta/workers'
{
  "workers": []
}

Even if that worker was quarantined, it should show up in the list of workers, if we have just called claimWork endpoint.

So something is going wrong in the Queue.

Note - both of the above requests were made via the taskcluster-proxy with sufficient scopes.

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

6 years ago

Note, this is probably also the cause of both bug 1485380 comment 5 and bug 1493321 comment 8.

Blocks: 1485380, 1493321

Pete Moore [:pmoore][:pete]

Reporter

Comment 2

•

6 years ago

•

Edited

Hey Dustin,

My understanding is that:

Regardless of whether a worker is quarantined or not, if the queue is going to return 0 tasks to a queue.claimWork call, it should hold the connection for 20s.
Regardless of whether a worker is quarantined or not, it should show up in the queue.listWorkers endpoint, if the worker recently called queue.claimWork.

Above, we see that we get an immediate HTTP 200 response from queue.claimWork call, with "tasks": [], and that the workerId used in that call does not make it into the (empty) list of workers from the queue.listWorkers endpoint.

This indicates a problem in the worker tracking implemented in the queue, and the serving of tasks, since we can also see there are pending tasks for this worker type:

$ curl 'http://localhost:8080/queue/v1/pending/releng-hardware/gecko-t-osx-1010-beta'
{
  "provisionerId": "releng-hardware",
  "workerType": "gecko-t-osx-1010-beta",
  "pendingTasks": 2
}

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

6 years ago

I'll defer to John as service owner. It would be helpful to know how often this happens.

Flags: needinfo?(dustin) → needinfo?(jhford)

Pete Moore [:pmoore][:pete]

Reporter

Comment 4

•

6 years ago

•

Edited

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #3)

I'll defer to John as service owner. It would be helpful to know how often this happens.

This happens consistently. If you manually run the commands from the above comments, you should get the same results as I do.

This seems to happen for special workerId or worker types only, so I suspect this is probably a bug in the declareWorker/quarantine code, which is why I needinfo'd you rather than John, since I think you wrote that. Let me know if that isn't the case, and I am misremembering! :)

Of course, if by the time you get to this, all the pending tasks have expired for releng-hardware/gecko-t-osx-1010-beta, creating new arbitrary tasks will reproduce the circumstances...

Flags: needinfo?(jhford) → needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

6 years ago

Assignee: nobody → dustin

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

6 years ago

It sounds like this is happening only for quarantined tasks -- is that consistent? What did you mean by "special"?

Pete Moore [:pmoore][:pete]

Reporter

Comment 6

•

6 years ago

•

Edited

I'm not sure what a quarantined task is. It seems to happen consistently at least for "provisionerId": "releng-hardware", "workerType": "gecko-t-osx-1010-beta", "workerId": "t-yosemite-r7-380" as per the above example.

I haven't explored other workerIds - but since it consistently happens for this one, that could be a good avenue to explore.

You'll see the claimWork call returns in less than a second, that despite making claims using that workerId, it never appears in the list of workers, and it always gets 0 tasks back, even when there are pending tasks. And since it doesn't appear in the list of quarantined workers, it isn't not getting tasks because it is quarantined. Lastly, even if it was quarantined, the connection should probably be held by the queue for 20s, but I suspect the reason it isn't is probably a symptom of the underlying bug that causes the other symptoms...

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

6 years ago

Sorry, I meant quarantined worker, yes. So if it's not quarantined, what's special about this workerId? Or workerType?

Pete Moore [:pmoore][:pete]

Reporter

Comment 8

•

6 years ago

•

Edited

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)

What did you mean by "special"?

Clearly this doesn't happen for all provisionerId/workerType/workerId combinations, otherwise we'd have a downtime. But since it happens consistently and reproducibly for the given provisionerId/workerType/workerId combination, that makes it special; it is behaving differently to the many workers that are happily running tasks at the moment in production without exhibiting this issue.

Pete Moore [:pmoore][:pete]

Reporter

Comment 9

•

6 years ago

•

Edited

To recap:

We have three reproducible queries (claimTask, pendingTasks, listWorkers). The claimTask call gets back zero tasks (immediately), despite:

the queue should hold the connection for 20s when returning no tasks, since the queue implements 20s long-polling
there are pending tasks for this worker type (according to the pendingTasks query)
the worker is not listed as quarantined nor as not-quarantined (according to the listWorkers query)

If the worker is not listed as quarantined, and not listed as not-quarantined, calling claimTask should return one of the pending tasks, and the worker should be listed in future listWorkers calls. This is not happening.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

6 years ago

You're correct that quarantining is not waiting the requisite 20 seconds: https://github.com/taskcluster/taskcluster/blob/master/services/queue/src/api.js#L1336

What I'm confused about is "the worker is not listed as quarantined nor as not-quarantined (according to the listWorkers query)". What does that mean? I don't see any workerType gecko-t-osx-1010-beta defined at all under provisionerId releng-hardware. Perhaps it has just timed out?

Anyway, I'm going to just assume that this worker is (or at least was) quarantined and that's the behavior you're seeing.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

6 years ago

https://github.com/taskcluster/taskcluster/pull/93

Dragos Crisan [:dragrom]

Updated

•

6 years ago

Blocks: 1507488

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

6 years ago

dustin@lamport ~ $ echo '{"workerGroup": "mdc1", "workerId": "t-yosemite-r7-380", "tasks": 1}' | time taskcluster api queue claimWork releng-hardawre gecko-t-osx-1010-beta 
{
  "tasks": []
}
0.06user 0.00system 0:20.66elapsed 0%CPU (0avgtext+0avgdata 15244maxresident)k
0inputs+0outputs (0major+1736minor)pagefaults 0swaps

I'm still confused about the quarantined bit, but for that worker at least this is fixed. Pete, can you confirm?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Reporter

Comment 13

•

6 years ago

For this worker (t-yosemite-r7-380) it does seem to be fixed, but I guess someone just removed the quarantine for that worker explicitly?

I think the events that led to the problem were as follows

the worker got quarantined
it continued to run, and for a while, all seemed ok - it appeared in the list of quarantined workers, so it was visible that there was a worker out of commission, and it didn't get tasks from the queue, so all was good
when the worker called queue.claimWork, the queue (incorrectly) didn't update its 'last seen' timestamp for the worker, so at some point, the queue considered that the worker hadn't been seen for a long enough amount of time that it was ok to delete its record of that worker's existence (but without deleting the record of it being quarantined)[1]
somebody noticed that this worker was running, getting no tasks, and no records could be seen in taskcluster of this worker being quarantined, or even existing

I think the important fix here was that if the worker calls claimWork, even if it is quarantined, the queue should update its recorded 'last seen' field for this workerId, so that somebody browsing the worker pool can see that there is an active worker that has recently tried to claim a task, but is quarantined.

The problem is that if a quarantine record appears for a given workerId, that workerId should appear in the list of workers of its worker type, if it has been seen by the queue recently despite being quarantined, or possibly even if it hasn't been seen by the queue for a long time. In other words, I don't think we should ever filter out quarantined workerIds from the list of workers for a given worker type, even if we do choose to filter out workerIds that haven't called claimWork for a long time.

I hope that makes sense!

--
[1] it could also be that it didn't delete the record, but in queries to show known workers for the given worker type, it chose not to return it in the list

Flags: needinfo?(pmoore)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

6 years ago

OK. So I think the first part of this is fixed (the "important fix" in your last message).

But you make a good point that we should not expire quarantined workers. In fact, we don't -- but we also don't display them! So they're stuck in a "limbo" where they are still quarantined but not visible.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 15

•

6 years ago

https://github.com/taskcluster/taskcluster/pull/105

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

6 years ago

OK, that is the second part.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Queue → Services

Bugzilla

claimWork requests not being held for 20s (long polling) and not returning results

Categories

(Taskcluster :: Services, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated