Closed Bug 1650928 Opened 4 years ago Closed 4 years ago

idle gecko-t/t-linux-metal workers

Categories

(Taskcluster :: Operations and Service Requests, defect)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: aerickson, Assigned: dustin)

Details

Attachments

(1 file)

I've noticed that there are a few gecko-t/t-linux-metal workers that aren't working. One is quarantined and others just seem to not be taking jobs.

t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}

Here's what the worker-manager thinks is running right now, along with the created times (all today):

lamport ~/p/taskcluster [master] $ TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com/ taskcluster api workerManager listWorkersForWorkerPool gecko-t/t-linux-metal | jq -r '.workers[] | select(.state == "running") | .workerGroup + " " + .workerId + " -- " + .created'
us-east-1 i-0025a27bb9280ce57 -- 2020-07-06T13:57:38.306Z
us-east-1 i-01ea999f4302badf2 -- 2020-07-06T09:34:31.642Z
us-east-1 i-0b911d2703edefdf2 -- 2020-07-06T09:04:41.412Z
us-east-1 i-0d4f0eec5171d6fb6 -- 2020-07-06T13:17:37.286Z
us-west-1 i-0606ad0a1d0117223 -- 2020-07-06T13:34:41.440Z
us-west-2 i-02d871b1aa48ab102 -- 2020-07-06T13:34:39.919Z
us-west-2 i-063fe378e13132f4f -- 2020-07-06T13:25:01.260Z
us-west-2 i-094a583afaeb7bbae -- 2020-07-06T13:57:34.751Z
us-west-2 i-0ffa4c92fc13b4a6c -- 2020-07-06T13:17:38.826Z

None of that overlaps with the four instances identified above. Looking for those specific instances, three of them (not including the quarantined one) are still known to worker-manager, but in state "stopped". All three of those were also created today.

lamport ~/p/taskcluster [master] $ for wid in i-0ee0201095f06b0a6 i-04bde5f7c6e8c6a26 i-0d0959a99d940981b i-03ee7bfd4677f3bc3; do TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com/ taskcluster api workerManager listWorkersForWorkerPool gecko-t/t-linux-metal | jq -r '.workers[] | select(.workerId == "'$wid'")'; done
{
  "workerPoolId": "gecko-t/t-linux-metal",
  "workerGroup": "us-east-1",
  "workerId": "i-0ee0201095f06b0a6",
  "providerId": "aws",
  "created": "2020-07-06T02:35:21.473Z",
  "expires": "2020-07-13T02:35:21.473Z",
  "state": "stopped",
  "capacity": 15,
  "lastModified": "2020-07-06T02:42:31.456Z",
  "lastChecked": "2020-07-06T03:03:59.584Z"
}
{
  "workerPoolId": "gecko-t/t-linux-metal",
  "workerGroup": "us-west-2",
  "workerId": "i-04bde5f7c6e8c6a26",
  "providerId": "aws",
  "created": "2020-07-06T13:57:36.324Z",
  "expires": "2020-07-13T13:57:36.324Z",
  "state": "stopped",
  "capacity": 15,
  "lastModified": "2020-07-06T14:04:27.996Z",
  "lastChecked": "2020-07-06T14:30:28.957Z"
}
{
  "workerPoolId": "gecko-t/t-linux-metal",
  "workerGroup": "us-west-2",
  "workerId": "i-0d0959a99d940981b",
  "providerId": "aws",
  "created": "2020-07-06T05:47:17.771Z",
  "expires": "2020-07-13T05:47:17.771Z",
  "state": "stopped",
  "capacity": 32,
  "lastModified": "2020-07-06T05:54:27.521Z",
  "lastChecked": "2020-07-06T06:19:55.736Z"
}

Also, the email I got from Bugzilla contains five hosts:

    t-linux-metal.i-0d4f0eec5171d6fb6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 20, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}

and it looks like the first was removed.. what changed? To avoid confusion, please don't silently update important content in bugs like that! It's fine to just make a new comment, or if editing in place helps avoid future confusion, an [edit: removed instance ... because it was a typo] would help others (me) not lose time chasing phantoms.

Looking in the EC2 console, I see roughly what worker-manager sees, shown above. One of the us-east-1 hosts has gone away since I asked worker-manager about it, but otherwise the lists are identical.

So, I don't see anything at all about the four (or five!) instances listed in the first comment, which leads me to ask.. where are those from?

Component: General → Operations and Service Requests
Assignee: nobody → dustin

I filed https://github.com/taskcluster/taskcluster/issues/3166 regarding partial usage of a large group of high-capacity workers. Is that what the monitoring is discovering here?

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #2)

Also, the email I got from Bugzilla contains five hosts:

    t-linux-metal.i-0d4f0eec5171d6fb6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 20, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-0ee0201095f06b0a6 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-04bde5f7c6e8c6a26 {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-0d0959a99d940981b {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']}
    t-linux-metal.i-03ee7bfd4677f3bc3 {sr: [== ] 20.0%, suc: 4, cmp: 20, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!', 'Quarantined.']}

and it looks like the first was removed.. what changed? To avoid confusion, please don't silently update important content in bugs like that! It's fine to just make a new comment, or if editing in place helps avoid future confusion, an [edit: removed instance ... because it was a typo] would help others (me) not lose time chasing phantoms.

I removed a host because it was not in my original email and it was not having the issue described.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #3)

Looking in the EC2 console, I see roughly what worker-manager sees, shown above. One of the us-east-1 hosts has gone away since I asked worker-manager about it, but otherwise the lists are identical.

So, I don't see anything at all about the four (or five!) instances listed in the first comment, which leads me to ask.. where are those from?

The data is from Taskcluster. https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/t-linux-metal?sortBy=Quarantined&sortDirection=desc

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #4)

I filed https://github.com/taskcluster/taskcluster/issues/3166 regarding partial usage of a large group of high-capacity workers. Is that what the monitoring is discovering here?

The ticket was about idle and quarantined workers. I was wrong about the idle workers. I consistently noticed 3 instances that hadn't done any work and assumed they were the same nodes.

t-linux-metal.i-03ee7bfd4677f3bc3 is still reported as quarantined. If the instance has been deleted, can the quarantine record get cleared also?

Ah, so the identified hosts are from the queue-workers data. What about {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']} -- what is that?

That queue-workers data is based on calls from workers to the queue.claimWork and queue.reclaimTask endpoints. Workers are added to the list and updated when they call those endpoints, and expire from that list when they have not called those endpoints for 5 days (usually because the worker is gone). So presence or absence in that list is really not indicative of anything. We have a lastDateActive for provisioners and worker types, but for some reason not for workers. That would be useful information!

Qurantined workers do not automatically expire from the list while they are still quarantined. Without that, a quarantine couldn't last more than 5 days. That makes more sense for hardware workers, where for example a chassis might need to be sent out for repairs. If you reset the quarantine on that worker, it will be deleted in the next expiration run. I don't think I have permissions to do that.

Anyway, this API has proven to be not-that-useful and definitely confused a lot of people, so we are planning to eventually and replace or supplement it with data from worker-manager, which is actually aware of when a worker starts and stops.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #7)

Ah, so the identified hosts are from the queue-workers data. What about {sr: [ ] 0.0%, suc: 0, cmp: 0, exc: 0, rng: 0, notes: ['No jobs in queue.'], alerts: ['No work done!']} -- what is that?

The "No jobs in queue" is from a call to https://firefox-ci-tc.services.mozilla.com/api/queue/v1/pending. "No work done" is from the worker data (if no jobs are running or completed). SR (success rate) is also calculated based on that data. It would be great to have TC keep this data.

That queue-workers data is based on calls from workers to the queue.claimWork and queue.reclaimTask endpoints. Workers are added to the list and updated when they call those endpoints, and expire from that list when they have not called those endpoints for 5 days (usually because the worker is gone). So presence or absence in that list is really not indicative of anything. We have a lastDateActive for provisioners and worker types, but for some reason not for workers. That would be useful information!

That would be very handy.

Qurantined workers do not automatically expire from the list while they are still quarantined. Without that, a quarantine couldn't last more than 5 days. That makes more sense for hardware workers, where for example a chassis might need to be sent out for repairs. If you reset the quarantine on that worker, it will be deleted in the next expiration run. I don't think I have permissions to do that.

The worker manager knows these instances aren't long-lived and it could cleanup the quarantine record when it's deleted.

Anyway, this API has proven to be not-that-useful and definitely confused a lot of people, so we are planning to eventually and replace or supplement it with data from worker-manager, which is actually aware of when a worker starts and stops.

It's super useful for RelOps. We'd be pretty blind without it (we'd just have to watch queue counts). Supplementing would be preferred as we have existing tooling around the API endpoint.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID

Comment on attachment 9169941 [details]
Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=emilio

Revision D87022 was moved to bug 1659028. Setting attachment 9169941 [details] to obsolete.

Attachment #9169941 - Attachment is obsolete: true
Attachment #9169941 - Attachment is obsolete: false
Attachment #9169941 - Attachment description: Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=emilio → Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=smaug
Attachment #9169941 - Attachment description: Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=smaug → Bug 1650928 - Move IsDisabled() from nsGeneralHTMLElement to Element r=emilio
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: