Closed
Bug 1493321
Opened 6 years ago
Closed 6 years ago
Windows 10 hardware nodes are not getting tasks
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Unassigned)
References
Details
Attachments
(2 files)
CiDuty Deployed 30 nodes a few hours ago. All but the nodes below began picking up tasks immediately (1,000+ pending).
Sep 21 16:01:41 T-W1064-MS-318.mdc2.mozilla.com generic-worker: 2018/09/21 23:01:40 No task claimed. Idle for 1h6m54.391498s (will exit if no task claimed in 53m5.608502s).#015
Sep 21 16:02:22 T-W1064-MS-320.mdc2.mozilla.com generic-worker: 2018/09/21 23:02:21 No task claimed. Idle for 1h35m23.9761104s (will exit if no task claimed in 24m36.0238896s).#015
Sep 21 16:02:50 T-W1064-MS-321.mdc2.mozilla.com generic-worker: 2018/09/21 23:02:50 No task claimed. Idle for 1h35m28.3924118s (will exit if no task claimed in 24m31.6075882s).#015
Sep 21 16:04:08 T-W1064-MS-322.mdc2.mozilla.com generic-worker: 2018/09/21 23:04:07 No task claimed. Idle for 1h36m13.6820371s (will exit if no task claimed in 23m46.3179629s).
Sep 21 16:05:24 T-W1064-MS-324.mdc2.mozilla.com generic-worker: 2018/09/21 23:05:23 No task claimed. Idle for 1h37m23.4765015s (will exit if no task claimed in 22m36.5234985s).#015
Sep 21 16:04:28 T-W1064-MS-323.mdc2.mozilla.com generic-worker: 2018/09/21 23:04:27 No task claimed. Idle for 1h36m18.9467124s (will exit if no task claimed in 23m41.0532876s).#015
Sep 21 15:58:48 T-W1064-MS-345.mdc2.mozilla.com generic-worker: 2018/09/21 22:58:48 No task claimed. Idle for 1h30m4.2717715s (will exit if no task claimed in 29m55.7282285s).#015
What information can I provide that would be helpful?
Comment 1•6 years ago
|
||
Hi Mark,
Is this still happening?
What we'd need to know is at least:
- which worker type are they configured to be?
- what is the taskcluster client-id of these machines
- were all of these machines configured the same way with something like Puppet or by hand?
- more context from the logs
Pete, do you have any other ideas of what we'd need to help diagnose this problem?
Flags: needinfo?(pmoore)
Comment 2•6 years ago
|
||
Mark/Rob, is this still an issue, or was this resolved?
Many Thanks.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Flags: needinfo?(mcornmesser)
Comment 3•6 years ago
|
||
my understanding is that this problem is dealt with in bug 1496526. i believe it is now resolved, but i will defer to mark who will know better about wether tasks are being claimed but it looks like they are (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw).
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 4•6 years ago
|
||
These nodes still have not picked up tasks.
These are all configured to be workerType gecko-t-win10-64-hw, and are being configured by OpenCloudConfig like the other hardware nodes.
Flags: needinfo?(mcornmesser)
Reporter | ||
Comment 5•6 years ago
|
||
Reporter | ||
Comment 6•6 years ago
|
||
This seems to be the case for T-W1064-MS-470:
Oct 12 08:15:39 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:38 No task claimed. Idle for 1h27m11.6959233s (will exit if no task claimed in 32m48.3040767s).#015
Oct 12 08:15:44 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:43 Disk available: 42551791616 bytes#015
Oct 12 08:15:49 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:48 Disk available: 42551791616 bytes#015
Oct 12 08:15:54 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:53 Disk available: 42551791616 bytes#015
Oct 12 08:15:59 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:15:58 Disk available: 42551791616 bytes#015
Oct 12 08:16:04 T-W1064-MS-470.mdc2.mozilla.com generic-worker: 2018/10/12 15:16:04 Disk available: 42551791616 bytes#015
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc2/T-W1064-MS-470
WWjktn28RNusnHWShqS5ww does not correspond to a task that exists. Are you sure this task has already been submitted?
Updated•6 years ago
|
Blocks: t-w1064-ms-470
Comment 7•6 years ago
|
||
Found another worker, T-W1064-MS-291, exhibiting this problem. From the logs it appears to be healthy but it is missing from the taskcluster worker explorer. Missleading logs are :
Oct 15 23:46:25 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:24 Disk available: 42545258496 bytes#015
Oct 15 23:46:25 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:25 No task claimed. Idle for 31m42.6609644s (will exit if no task claimed in 1h28m17.3390356s).#015
Oct 15 23:46:30 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:30 Disk available: 42545258496 bytes#015
Oct 15 23:46:36 T-W1064-MS-291.mdc1.mozilla.com generic-worker: 2018/10/16 06:46:35 Disk available: 42545258496 bytes#015
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-291
Comment 8•6 years ago
|
||
This worker (T-W1064-MS-291.mdc1.mozilla.com) is configured as follows:
Config: {
"accessToken": "*************",
"authBaseURL": "https://auth.taskcluster.net/v1",
"availabilityZone": "",
"cachesDir": "C:\\caches",
"certificate": "",
"checkForNewDeploymentEverySecs": 0,
"cleanUpTaskDirs": true,
"clientId": "project/releng/generic-worker/gecko-t-win10-64-hw/production",
"deploymentId": "",
"disableReboots": true,
"downloadsDir": "C:\\downloads",
"idleTimeoutSecs": 7200,
"instanceId": "",
"instanceType": "",
"livelogCertificate": "C:\\generic-worker\\livelog.crt",
"livelogExecutable": "C:\\generic-worker\\livelog.exe",
"livelogGETPort": 60023,
"livelogKey": "C:\\generic-worker\\livelog.key",
"livelogPUTPort": 60022,
"livelogSecret": "*************",
"numberOfTasksToRun": 0,
"privateIP": "",
"provisionerBaseURL": "",
"provisionerId": "releng-hardware",
"publicIP": "10.49.40.192",
"purgeCacheBaseURL": "https://purge-cache.taskcluster.net/v1",
"queueBaseURL": "https://queue.taskcluster.net/v1",
"region": "",
"requiredDiskSpaceMegabytes": 10240,
"runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd",
"runTasksAsCurrentUser": false,
"sentryProject": "generic-worker",
"shutdownMachineOnIdle": false,
"shutdownMachineOnInternalError": false,
"signingKeyLocation": "C:\\generic-worker\\generic-worker-gpg-signing-key.key",
"subdomain": "taskcluster-worker.net",
"taskclusterProxyExecutable": "taskcluster-proxy",
"taskclusterProxyPort": 80,
"tasksDir": "C:\\Users",
"workerGroup": "mdc1",
"workerId": "T-W1064-MS-291",
"workerType": "gecko-t-win10-64-hw",
"workerTypeMetadata": {
"config": {
"deploymentId": "",
"runTasksAsCurrentUser": false
},
"generic-worker": {
"go-arch": "amd64",
"go-os": "windows",
"go-version": "go1.10.3",
"release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.11.2",
"revision": "4b47d5d9f45eabc283947da0a6b1c02eddd0a7fe",
"source": "https://github.com/taskcluster/generic-worker/commits/4b47d5d9f45eabc283947da0a6b1c02eddd0a7fe",
"version": "10.11.2"
},
"machine-setup": {
"created": "20170404",
"maintainer": "Markco \u003cmcornmesser@mozilla.com\u003e",
"manifest": "https://github.com/markcor/OpenCloudConfig/blob/master/userdata/Manifest/gecko-t-win10-64-hw.json"
}
}
}
These values look correct:
"provisionerId": "releng-hardware",
"workerType": "gecko-t-win10-64-hw",
"workerId": "T-W1064-MS-291",
"workerGroup": "mdc1",
"queueBaseURL": "https://queue.taskcluster.net/v1",
"clientId": "project/releng/generic-worker/gecko-t-win10-64-hw/production",
I notice that the "Disk available" message appears every 5 seconds, whereas the http request to the queue should hold the connection for 20 seconds (long polling), so normally on successful workers, we'll see "Disk available" messages with > 20s time lapse between them.
Since we don't see HTTP failures in the logs, this suggests to me that maybe the queue is returning no tasks with an HTTP 200 status code. Alternatively, the worker isn't logging bad HTTP responses. I've just checked what happens when the worker has no network connection by disconnecting a running worker from the network, and it logs:
> 2018/10/16 19:08:56 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: dial tcp: lookup queue.taskcluster.net: no such host
So I think the problem *isn't* that it can't reach the queue at all.
I'm a bit baffled at the moment. The next step is to find out if the connection to the queue is made to claim work, and immediately dropped, or the HTTP request to the queue never arrives.
Comment 9•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
> I notice that the "Disk available" message appears every 5 seconds, whereas
> the http request to the queue should hold the connection for 20 seconds
> (long polling), so normally on successful workers, we'll see "Disk
> available" messages with > 20s time lapse between them.
I remembered seeing this before, and tracked down the previous occurrence to bug 1485380 comment 5. It turns out to be the same host. But ... why?! :/
Comment 10•6 years ago
|
||
Note, I can make a generic-worker release which provides more debugging information...
Let's do that.
Reporter | ||
Comment 11•6 years ago
|
||
I am taking a look at ms-318 to verify if it is talking to the queue. I am assuming that ec2-50-17-218-87.compute-1.amazonaws.com:https is the queue.
C:\Users\task_1538190400>netstat -f
Active Connections
Proto Local Address Foreign Address State
TCP 10.51.40.4:5900 rejh1.srv.releng.mdc2.mozilla.com:40876 ESTABLISHED
TCP 10.51.40.4:5900 rejh1.srv.releng.mdc2.mozilla.com:40880 ESTABLISHED
TCP 10.51.40.4:49672 log-aggregator1.srv.releng.mdc2.mozilla.com:cmd ESTABLISHED
TCP 10.51.40.4:49677 52.165.175.144:https ESTABLISHED
TCP 10.51.40.4:49763 ec2-50-17-218-87.compute-1.amazonaws.com:https ESTABLISHED
TCP 10.51.40.4:49776 a-0001.a-msedge.net:https ESTABLISHED
TCP 10.51.40.4:49777 13.107.9.158:https ESTABLISHED
TCP 10.51.40.4:49779 204.79.197.222:https ESTABLISHED
TCP 10.51.40.4:49780 13.107.3.254:https ESTABLISHED
TCP 10.51.40.4:49781 13.107.56.254:https ESTABLISHED
TCP 10.51.40.4:49782 204.79.197.254:https ESTABLISHED
TCP 10.51.40.4:49783 40.114.105.100:https TIME_WAIT
TCP 127.0.0.1:49670 T-W1064-MS-318.mdc2.mozilla.com:49671 ESTABLISHED
TCP 127.0.0.1:49671 T-W1064-MS-318.mdc2.mozilla.com:49670 ESTABLISHED
For comparison from ms-581, which has behaved as expected:
C:\Users\task_1539641194>netstat -f
Active Connections
Proto Local Address Foreign Address State
TCP 10.51.40.182:5900 rejh1.srv.releng.mdc2.mozilla.com:38046 ESTABLISHED
TCP 10.51.40.182:49672 log-aggregator1.srv.releng.mdc2.mozilla.com:cmd ESTABLISHED
TCP 10.51.40.182:49677 52.173.24.17:https ESTABLISHED
TCP 10.51.40.182:49761 ec2-50-16-192-183.compute-1.amazonaws.com:https ESTABLISHED
TCP 10.51.40.182:49772 65.55.44.109:https TIME_WAIT
TCP 10.51.40.182:49773 65.55.44.109:https TIME_WAIT
TCP 10.51.40.182:49775 a-0001.a-msedge.net:https ESTABLISHED
TCP 10.51.40.182:49776 bingforbusiness.com:https ESTABLISHED
TCP 10.51.40.182:49777 204.79.197.222:https ESTABLISHED
TCP 10.51.40.182:49778 13.107.51.254:https ESTABLISHED
TCP 10.51.40.182:49779 13.107.136.254:https ESTABLISHED
TCP 10.51.40.182:49780 13.107.128.254:https ESTABLISHED
TCP 127.0.0.1:49670 T-W1064-MS-581.mdc2.mozilla.com:49671 ESTABLISHED
TCP 127.0.0.1:49671 T-W1064-MS-581.mdc2.mozilla.com:49670 ESTABLISHED
Comment 12•6 years ago
|
||
All the bellow MSs are missing from TC with Error ResourceNotFound
$RANDOMSTRING does not correspond to a task that exists. Are you sure this task has already been submitted?
T-W1064-MS-318 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495255
T-W1064-MS-320 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495257
T-W1064-MS-321 | https://bugzilla.mozilla.org/show_bug.cgi?id=1482758
T-W1064-MS-322 | https://bugzilla.mozilla.org/show_bug.cgi?id=1480216
T-W1064-MS-323 | https://bugzilla.mozilla.org/show_bug.cgi?id=1495263
T-W1064-MS-324 | https://bugzilla.mozilla.org/show_bug.cgi?id=1480778
Comment 14•6 years ago
|
||
Reimaged 318 and is running tasks. Will do the rest from comment #12.
Comment 15•6 years ago
|
||
All machines from this bug where fund in quarantine. I have reimaged them, removed from Q and they all running tasks.
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Component: Queue → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•