idle gecko-t/t-linux-metal workers
Categories
(Taskcluster :: Operations and Service Requests, defect)
Tracking
(Not tracked)
People
(Reporter: aerickson, Assigned: dustin)
Details
Attachments
(1 file)
I've noticed that there are a few gecko-t/t-linux-metal workers that aren't working. One is quarantined and others just seem to not be taking jobs.
t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}
Assignee | ||
Comment 1•4 years ago
|
||
Here's what the worker-manager thinks is running right now, along with the created times (all today):
lamport ~/p/taskcluster [master] $ TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com/ taskcluster api workerManager listWorkersForWorkerPool gecko-t/t-linux-metal | jq -r '.workers[] | select(.state == "running") | .workerGroup + " " + .workerId + " -- " + .created'
us-east-1 i-0025a27bb9280ce57 -- 2020-07-06T13:57:38.306Z
us-east-1 i-01ea999f4302badf2 -- 2020-07-06T09:34:31.642Z
us-east-1 i-0b911d2703edefdf2 -- 2020-07-06T09:04:41.412Z
us-east-1 i-0d4f0eec5171d6fb6 -- 2020-07-06T13:17:37.286Z
us-west-1 i-0606ad0a1d0117223 -- 2020-07-06T13:34:41.440Z
us-west-2 i-02d871b1aa48ab102 -- 2020-07-06T13:34:39.919Z
us-west-2 i-063fe378e13132f4f -- 2020-07-06T13:25:01.260Z
us-west-2 i-094a583afaeb7bbae -- 2020-07-06T13:57:34.751Z
us-west-2 i-0ffa4c92fc13b4a6c -- 2020-07-06T13:17:38.826Z
None of that overlaps with the four instances identified above. Looking for those specific instances, three of them (not including the quarantined one) are still known to worker-manager, but in state "stopped". All three of those were also created today.
lamport ~/p/taskcluster [master] $ for wid in i-0ee0201095f06b0a6 i-04bde5f7c6e8c6a26 i-0d0959a99d940981b i-03ee7bfd4677f3bc3; do TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com/ taskcluster api workerManager listWorkersForWorkerPool gecko-t/t-linux-metal | jq -r '.workers[] | select(.workerId == "'$wid'")'; done
{
"workerPoolId": "gecko-t/t-linux-metal",
"workerGroup": "us-east-1",
"workerId": "i-0ee0201095f06b0a6",
"providerId": "aws",
"created": "2020-07-06T02:35:21.473Z",
"expires": "2020-07-13T02:35:21.473Z",
"state": "stopped",
"capacity": 15,
"lastModified": "2020-07-06T02:42:31.456Z",
"lastChecked": "2020-07-06T03:03:59.584Z"
}
{
"workerPoolId": "gecko-t/t-linux-metal",
"workerGroup": "us-west-2",
"workerId": "i-04bde5f7c6e8c6a26",
"providerId": "aws",
"created": "2020-07-06T13:57:36.324Z",
"expires": "2020-07-13T13:57:36.324Z",
"state": "stopped",
"capacity": 15,
"lastModified": "2020-07-06T14:04:27.996Z",
"lastChecked": "2020-07-06T14:30:28.957Z"
}
{
"workerPoolId": "gecko-t/t-linux-metal",
"workerGroup": "us-west-2",
"workerId": "i-0d0959a99d940981b",
"providerId": "aws",
"created": "2020-07-06T05:47:17.771Z",
"expires": "2020-07-13T05:47:17.771Z",
"state": "stopped",
"capacity": 32,
"lastModified": "2020-07-06T05:54:27.521Z",
"lastChecked": "2020-07-06T06:19:55.736Z"
}
Assignee | ||
Comment 2•4 years ago
|
||
Also, the email I got from Bugzilla contains five hosts:
t-linux-metal.i-0d4f0eec5171d6fb6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 20, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}
and it looks like the first was removed.. what changed? To avoid confusion, please don't silently update important content in bugs like that! It's fine to just make a new comment, or if editing in place helps avoid future confusion, an [edit: removed instance ... because it was a typo]
would help others (me) not lose time chasing phantoms.
Assignee | ||
Comment 3•4 years ago
|
||
Looking in the EC2 console, I see roughly what worker-manager sees, shown above. One of the us-east-1 hosts has gone away since I asked worker-manager about it, but otherwise the lists are identical.
So, I don't see anything at all about the four (or five!) instances listed in the first comment, which leads me to ask.. where are those from?
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
I filed https://github.com/taskcluster/taskcluster/issues/3166 regarding partial usage of a large group of high-capacity workers. Is that what the monitoring is discovering here?
Reporter | ||
Comment 5•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #2)
Also, the email I got from Bugzilla contains five hosts:
t-linux-metal.i-0d4f0eec5171d6fb6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 20, notes: ['No jobs in queue.'], alerts: ['No work done!']} t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']} t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']} t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']} t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}
and it looks like the first was removed.. what changed? To avoid confusion, please don't silently update important content in bugs like that! It's fine to just make a new comment, or if editing in place helps avoid future confusion, an
[edit: removed instance ... because it was a typo]
would help others (me) not lose time chasing phantoms.
I removed a host because it was not in my original email and it was not having the issue described.
Reporter | ||
Comment 6•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #3)
Looking in the EC2 console, I see roughly what worker-manager sees, shown above. One of the us-east-1 hosts has gone away since I asked worker-manager about it, but otherwise the lists are identical.
So, I don't see anything at all about the four (or five!) instances listed in the first comment, which leads me to ask.. where are those from?
The data is from Taskcluster. https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-linux-metal?sortBy=Quarantined&sortDirection=desc
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #4)
I filed https://github.com/taskcluster/taskcluster/issues/3166 regarding partial usage of a large group of high-capacity workers. Is that what the monitoring is discovering here?
The ticket was about idle and quarantined workers. I was wrong about the idle workers. I consistently noticed 3 instances that hadn't done any work and assumed they were the same nodes.
t-linux-metal.i-03ee7bfd4677f3bc3 is still reported as quarantined. If the instance has been deleted, can the quarantine record get cleared also?
Assignee | ||
Comment 7•4 years ago
|
||
Ah, so the identified hosts are from the queue-workers data. What about {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
-- what is that?
That queue-workers data is based on calls from workers to the queue.claimWork
and queue.reclaimTask
endpoints. Workers are added to the list and updated when they call those endpoints, and expire from that list when they have not called those endpoints for 5 days (usually because the worker is gone). So presence or absence in that list is really not indicative of anything. We have a lastDateActive for provisioners and worker types, but for some reason not for workers. That would be useful information!
Qurantined workers do not automatically expire from the list while they are still quarantined. Without that, a quarantine couldn't last more than 5 days. That makes more sense for hardware workers, where for example a chassis might need to be sent out for repairs. If you reset the quarantine on that worker, it will be deleted in the next expiration run. I don't think I have permissions to do that.
Anyway, this API has proven to be not-that-useful and definitely confused a lot of people, so we are planning to eventually and replace or supplement it with data from worker-manager, which is actually aware of when a worker starts and stops.
Reporter | ||
Comment 8•4 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #7)
Ah, so the identified hosts are from the queue-workers data. What about
{sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
-- what is that?
The "No jobs in queue" is from a call to https://firefox-ci-tc.services.mozilla.com/api/queue/v1/pending. "No work done" is from the worker data (if no jobs are running or completed). SR (success rate) is also calculated based on that data. It would be great to have TC keep this data.
That queue-workers data is based on calls from workers to the
queue.claimWork
andqueue.reclaimTask
endpoints. Workers are added to the list and updated when they call those endpoints, and expire from that list when they have not called those endpoints for 5 days (usually because the worker is gone). So presence or absence in that list is really not indicative of anything. We have a lastDateActive for provisioners and worker types, but for some reason not for workers. That would be useful information!
That would be very handy.
Qurantined workers do not automatically expire from the list while they are still quarantined. Without that, a quarantine couldn't last more than 5 days. That makes more sense for hardware workers, where for example a chassis might need to be sent out for repairs. If you reset the quarantine on that worker, it will be deleted in the next expiration run. I don't think I have permissions to do that.
The worker manager knows these instances aren't long-lived and it could cleanup the quarantine record when it's deleted.
Anyway, this API has proven to be not-that-useful and definitely confused a lot of people, so we are planning to eventually and replace or supplement it with data from worker-manager, which is actually aware of when a worker starts and stops.
It's super useful for RelOps. We'd be pretty blind without it (we'd just have to watch queue counts). Supplementing would be preferred as we have existing tooling around the API endpoint.
Comment 9•4 years ago
|
||
Comment 10•4 years ago
|
||
Comment on attachment 9169941 [details]
Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=emilio
Revision D87022 was moved to bug 1659028. Setting attachment 9169941 [details] to obsolete.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Description
•