Closed Bug 1493321 Opened 6 years ago Closed 6 years ago

Windows 10 hardware nodes are not getting tasks

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Unassigned)

References

Details

Attachments

(2 files)

CiDuty Deployed 30 nodes a few hours ago. All but the nodes below began picking up tasks immediately (1,000+ pending). Sep 21 16:01:41 T-W1064-MS-318.mdc2.mozilla.com generic-worker: 2018/09/21 23:01:40 No task claimed. Idle for 1h6m54.391498s (will exit if no task claimed in 53m5.608502s).#015 Sep 21 16:02:22 T-W1064-MS-320.mdc2.mozilla.com generic-worker: 2018/09/21 23:02:21 No task claimed. Idle for 1h35m23.9761104s (will exit if no task claimed in 24m36.0238896s).#015 Sep 21 16:02:50 T-W1064-MS-321.mdc2.mozilla.com generic-worker: 2018/09/21 23:02:50 No task claimed. Idle for 1h35m28.3924118s (will exit if no task claimed in 24m31.6075882s).#015 Sep 21 16:04:08 T-W1064-MS-322.mdc2.mozilla.com generic-worker: 2018/09/21 23:04:07 No task claimed. Idle for 1h36m13.6820371s (will exit if no task claimed in 23m46.3179629s). Sep 21 16:05:24 T-W1064-MS-324.mdc2.mozilla.com generic-worker: 2018/09/21 23:05:23 No task claimed. Idle for 1h37m23.4765015s (will exit if no task claimed in 22m36.5234985s).#015 Sep 21 16:04:28 T-W1064-MS-323.mdc2.mozilla.com generic-worker: 2018/09/21 23:04:27 No task claimed. Idle for 1h36m18.9467124s (will exit if no task claimed in 23m41.0532876s).#015 Sep 21 15:58:48 T-W1064-MS-345.mdc2.mozilla.com generic-worker: 2018/09/21 22:58:48 No task claimed. Idle for 1h30m4.2717715s (will exit if no task claimed in 29m55.7282285s).#015 What information can I provide that would be helpful?
Blocks: 1451835
Hi Mark, Is this still happening? What we'd need to know is at least: - which worker type are they configured to be? - what is the taskcluster client-id of these machines - were all of these machines configured the same way with something like Puppet or by hand? - more context from the logs Pete, do you have any other ideas of what we'd need to help diagnose this problem?
Flags: needinfo?(pmoore)
Mark/Rob, is this still an issue, or was this resolved? Many Thanks.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Flags: needinfo?(mcornmesser)
my understanding is that this problem is dealt with in bug 1496526. i believe it is now resolved, but i will defer to mark who will know better about wether tasks are being claimed but it looks like they are (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw).
Flags: needinfo?(rthijssen)
These nodes still have not picked up tasks. These are all configured to be workerType gecko-t-win10-64-hw, and are being configured by OpenCloudConfig like the other hardware nodes.
Flags: needinfo?(mcornmesser)
This seems to be the case for T-W1064-MS-470: Oct 12 08:15:39 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:38 No task claimed. Idle for 1h27m11.6959233s (will exit if no task claimed in 32m48.3040767s).#015 Oct 12 08:15:44 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:43 Disk available: 42551791616 bytes#015 Oct 12 08:15:49 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:48 Disk available: 42551791616 bytes#015 Oct 12 08:15:54 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:53 Disk available: 42551791616 bytes#015 Oct 12 08:15:59 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:58 Disk available: 42551791616 bytes#015 Oct 12 08:16:04 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:16:04 Disk available: 42551791616 bytes#015 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc2/T-W1064-MS-470 WWjktn28RNusnHWShqS5ww does not correspond to a task that exists. Are you sure this task has already been submitted?
Found another worker, T-W1064-MS-291, exhibiting this problem. From the logs it appears to be healthy but it is missing from the taskcluster worker explorer. Missleading logs are : Oct 15 23:46:25 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:24 Disk available: 42545258496 bytes#015 Oct 15 23:46:25 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:25 No task claimed. Idle for 31m42.6609644s (will exit if no task claimed in 1h28m17.3390356s).#015 Oct 15 23:46:30 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:30 Disk available: 42545258496 bytes#015 Oct 15 23:46:36 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:35 Disk available: 42545258496 bytes#015 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-291
This worker (T-W1064-MS-291.mdc1.mozilla.com) is configured as follows: Config: { "accessToken": "*************", "authBaseURL": "https://auth.taskcluster.net/v1", "availabilityZone": "", "cachesDir": "C:\\caches", "certificate": "", "checkForNewDeploymentEverySecs": 0, "cleanUpTaskDirs": true, "clientId": "project/releng/generic-worker/gecko-t-win10-64-hw/production", "deploymentId": "", "disableReboots": true, "downloadsDir": "C:\\downloads", "idleTimeoutSecs": 7200, "instanceId": "", "instanceType": "", "livelogCertificate": "C:\\generic-worker\\livelog.crt", "livelogExecutable": "C:\\generic-worker\\livelog.exe", "livelogGETPort": 60023, "livelogKey": "C:\\generic-worker\\livelog.key", "livelogPUTPort": 60022, "livelogSecret": "*************", "numberOfTasksToRun": 0, "privateIP": "", "provisionerBaseURL": "", "provisionerId": "releng-hardware", "publicIP": "10.49.40.192", "purgeCacheBaseURL": "https://purge-cache.taskcluster.net/v1", "queueBaseURL": "https://queue.taskcluster.net/v1", "region": "", "requiredDiskSpaceMegabytes": 10240, "runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd", "runTasksAsCurrentUser": false, "sentryProject": "generic-worker", "shutdownMachineOnIdle": false, "shutdownMachineOnInternalError": false, "signingKeyLocation": "C:\\generic-worker\\generic-worker-gpg-signing-key.key", "subdomain": "taskcluster-worker.net", "taskclusterProxyExecutable": "taskcluster-proxy", "taskclusterProxyPort": 80, "tasksDir": "C:\\Users", "workerGroup": "mdc1", "workerId": "T-W1064-MS-291", "workerType": "gecko-t-win10-64-hw", "workerTypeMetadata": { "config": { "deploymentId": "", "runTasksAsCurrentUser": false }, "generic-worker": { "go-arch": "amd64", "go-os": "windows", "go-version": "go1.10.3", "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.11.2", "revision": "4b47d5d9f45eabc283947da0a6b1c02eddd0a7fe", "source": "https://github.com/taskcluster/generic-worker/commits/4b47d5d9f45eabc283947da0a6b1c02eddd0a7fe", "version": "10.11.2" }, "machine-setup": { "created": "20170404", "maintainer": "Markco \u003cmcornmesser@mozilla.com\u003e", "manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json" } } } These values look correct: "provisionerId": "releng-hardware", "workerType": "gecko-t-win10-64-hw", "workerId": "T-W1064-MS-291", "workerGroup": "mdc1", "queueBaseURL": "https://queue.taskcluster.net/v1", "clientId": "project/releng/generic-worker/gecko-t-win10-64-hw/production", I notice that the "Disk available" message appears every 5 seconds, whereas the http request to the queue should hold the connection for 20 seconds (long polling), so normally on successful workers, we'll see "Disk available" messages with > 20s time lapse between them. Since we don't see HTTP failures in the logs, this suggests to me that maybe the queue is returning no tasks with an HTTP 200 status code. Alternatively, the worker isn't logging bad HTTP responses. I've just checked what happens when the worker has no network connection by disconnecting a running worker from the network, and it logs: > 2018/10/16 19:08:56 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: dial tcp: lookup queue.taskcluster.net: no such host So I think the problem *isn't* that it can't reach the queue at all. I'm a bit baffled at the moment. The next step is to find out if the connection to the queue is made to claim work, and immediately dropped, or the HTTP request to the queue never arrives.
(In reply to Pete Moore [:pmoore][:pete] from comment #8) > I notice that the "Disk available" message appears every 5 seconds, whereas > the http request to the queue should hold the connection for 20 seconds > (long polling), so normally on successful workers, we'll see "Disk > available" messages with > 20s time lapse between them. I remembered seeing this before, and tracked down the previous occurrence to bug 1485380 comment 5. It turns out to be the same host. But ... why?! :/
See Also: → 1485380
Note, I can make a generic-worker release which provides more debugging information... Let's do that.
I am taking a look at ms-318 to verify if it is talking to the queue. I am assuming that ec2-50-17-218-87.compute-1.amazonaws.com:https is the queue. C:\Users\task_1538190400>netstat -f Active Connections Proto Local Address Foreign Address State TCP 10.51.40.4:5900 rejh1.srv.releng.mdc2.mozilla.com:40876 ESTABLISHED TCP 10.51.40.4:5900 rejh1.srv.releng.mdc2.mozilla.com:40880 ESTABLISHED TCP 10.51.40.4:49672 log-aggregator1.srv.releng.mdc2.mozilla.com:cmd ESTABLISHED TCP 10.51.40.4:49677 52.165.175.144:https ESTABLISHED TCP 10.51.40.4:49763 ec2-50-17-218-87.compute-1.amazonaws.com:https ESTABLISHED TCP 10.51.40.4:49776 a-0001.a-msedge.net:https ESTABLISHED TCP 10.51.40.4:49777 13.107.9.158:https ESTABLISHED TCP 10.51.40.4:49779 204.79.197.222:https ESTABLISHED TCP 10.51.40.4:49780 13.107.3.254:https ESTABLISHED TCP 10.51.40.4:49781 13.107.56.254:https ESTABLISHED TCP 10.51.40.4:49782 204.79.197.254:https ESTABLISHED TCP 10.51.40.4:49783 40.114.105.100:https TIME_WAIT TCP 127.0.0.1:49670 T-W1064-MS-318.mdc2.mozilla.com:49671 ESTABLISHED TCP 127.0.0.1:49671 T-W1064-MS-318.mdc2.mozilla.com:49670 ESTABLISHED For comparison from ms-581, which has behaved as expected: C:\Users\task_1539641194>netstat -f Active Connections Proto Local Address Foreign Address State TCP 10.51.40.182:5900 rejh1.srv.releng.mdc2.mozilla.com:38046 ESTABLISHED TCP 10.51.40.182:49672 log-aggregator1.srv.releng.mdc2.mozilla.com:cmd ESTABLISHED TCP 10.51.40.182:49677 52.173.24.17:https ESTABLISHED TCP 10.51.40.182:49761 ec2-50-16-192-183.compute-1.amazonaws.com:https ESTABLISHED TCP 10.51.40.182:49772 65.55.44.109:https TIME_WAIT TCP 10.51.40.182:49773 65.55.44.109:https TIME_WAIT TCP 10.51.40.182:49775 a-0001.a-msedge.net:https ESTABLISHED TCP 10.51.40.182:49776 bingforbusiness.com:https ESTABLISHED TCP 10.51.40.182:49777 204.79.197.222:https ESTABLISHED TCP 10.51.40.182:49778 13.107.51.254:https ESTABLISHED TCP 10.51.40.182:49779 13.107.136.254:https ESTABLISHED TCP 10.51.40.182:49780 13.107.128.254:https ESTABLISHED TCP 127.0.0.1:49670 T-W1064-MS-581.mdc2.mozilla.com:49671 ESTABLISHED TCP 127.0.0.1:49671 T-W1064-MS-581.mdc2.mozilla.com:49670 ESTABLISHED
All the bellow MSs are missing from TC with Error ResourceNotFound $RANDOMSTRING does not correspond to a task that exists. Are you sure this task has already been submitted? T-W1064-MS-318 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495255 T-W1064-MS-320 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495257 T-W1064-MS-321 | https://bugzilla.mozilla.org/show_bug.cgi?id=1482758 T-W1064-MS-322 | https://bugzilla.mozilla.org/show_bug.cgi?id=1480216 T-W1064-MS-323 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495263 T-W1064-MS-324 | https://bugzilla.mozilla.org/show_bug.cgi?id=1480778
Depends on: 1495256
Blocks: 1495265
Reimaged 318 and is running tasks. Will do the rest from comment #12.
Attached image quarantine.PNG
All machines from this bug where fund in quarantine. I have reimaged them, removed from Q and they all running tasks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 1519849
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: