Only 1/3 of Windows 10 hardware machines (gecko-t-win10-64-hw) run jobs due to long puppet runs
Categories
(Infrastructure & Operations :: RelOps: Hardware, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: markco)
References
Details
https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-win10-64-hw shows that only a third of the Windows hardware workers (gecko-t-win10-64-hw) are taking jobs.
<grenade> looks to me like the puppet run is taking excessively long (like more than 20 minutes) on each reboot. i followed links from here: https://papertrailapp.com/groups/13938481/events?q=program%3Auser32
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This looks like the something is happening with the generic-worker config file. For example:
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Detected windows platform#015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0xc042473930, 0x4acc89, 0xc042015630) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91fc80, 0xc0423b3a70) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042473da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91fc80, 0xc0423b3a70) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.ReadTasksResolvedFile(0x8f9360) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:377 +0xdb main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:435 +0x27e main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: strconv.Atoi: parsing "\x00\x00\x00": invalid syntax#015
Aug 14 08:44:35 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: "Generic-worker panic! Issue with enviroment or worker bug. (exit code 69)" #015
Aug 14 08:44:35 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: "For more information on the exit codes run C:\generic-worker\generic-worker.exe --help or go to https://github.com/taskcluster/generic-worker" #015
Aug 14 08:44:37 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Exiting worker with exit code 69#015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: Generic worker exit with code 69 more than once; Rebooting to recover #015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker: The system cannot find the file specified. #015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker: #015
Aug 14 08:44:45 T-W1064-MS-167.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-167) has initiated the restart of computer T-W1064-MS-167 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Generic worker has not recovered; Rebooting#015
I am taking a look into this today. I may end up reimaging some of the nodes.
Assignee | ||
Comment 2•5 years ago
|
||
I am seeing continous generic-worker exits of 69 on ms-167, but the config file looks OK.
Miles: do you have ideas?
{
"accessToken": "----------",
"certificate": "",
"livelogSecret": "",
"authBaseURL": "",
"availabilityZone": "",
"cachesDir": "C:\cache",
"checkForNewDeploymentEverySecs": 0,
"cleanUpTaskDirs": true,
"clientId": "project/releng/generic-worker/datacenter-gecko-t-win10-64-hw",
"deploymentId": "",
"disableReboots": true,
"downloadsDir": "C:\downloads",
"ed25519SigningKeyLocation": "C:\generic-worker\ed25519-private.key",
"idleTimeoutSecs": 7200,
"instanceId": "",
"instanceType": "",
"livelogExecutable": "C:\generic-worker\livelog.exe",
"livelogPUTPort": 60022,
"numberOfTasksToRun": 0,
"privateIP": "",
"provisionerBaseURL": "",
"provisionerId": "releng-hardware",
"publicIP": "10.49.40.108",
"purgeCacheBaseURL": "",
"queueBaseURL": "",
"region": "",
"requiredDiskSpaceMegabytes": 10240,
"rootURL": "https://taskcluster.net",
"runAfterUserCreation": "C:\generic-worker\task-user-init.cmd",
"runTasksAsCurrentUser": false,
"secretsBaseURL": "",
"sentryProject": "generic-worker",
"shutdownMachineOnIdle": false,
"shutdownMachineOnInternalError": false,
"taskclusterProxyExecutable": "C:\generic-worker\taskcluster-proxy.exe",
"taskclusterProxyPort": 80,
"tasksDir": "C:\Users",
"wstAudience": "taskcluster-net",
"wstServerURL": "https://websocktunnel.tasks.build",
"workerGroup": "mdc1",
"workerId": "T-W1064-MS-167",
"workerType": "gecko-t-win10-64-hw",
"workerTypeMetadata": {
"config": {
"deploymentId": "",
"runTasksAsCurrentUser": false
},
"generic-worker": {
"go-arch": "amd64",
"go-os": "windows",
"release": "https://github.com/taskcluster/generic-worker/releases/download/v14.1.0/generic-worker-nativeEngine-windows-amd64.exe",
"version": "14.1.0"
}
}
}
Comment 3•5 years ago
|
||
Hi Mark,
It looks like the source of the error is related to the resolved tasks file, which should be in the generic-worker directory under the name "tasks-resolved-count.txt". From the traceback it looks as though the content of that file is malformed, could you paste it here? I would try deleting that file and re-running generic-worker.
miles
Assignee | ||
Comment 4•5 years ago
|
||
The resolve task file was blank. I deleted and rebooted the node. The result:
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc042470848, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91b120, 0xc0421e3de0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042469da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91b120, 0xc0421e3de0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd565e8, 0x9d31f0, 0x15, 0xc0421e6da0, 0xc) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd74bc8, 0x1f, 0xc042469a78) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:155 +0xaa main.initialiseFeatures(0xc0420c8000, 0xc0420745c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:110 +0x49b main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:448 +0x30b main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: invalid character '\x00' looking for beginning of value#015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Exiting worker with exit code 69#015
Comment 5•5 years ago
|
||
Let's also try removing "file-caches.json" and "directory-caches.json".
Assignee | ||
Comment 6•5 years ago
|
||
Removed that file and the result:
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Creating directory C:\cache with permissions 700#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Could not load file directory-caches.json into object *main.CacheMap - is it json?#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc0424b67f8, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91b120, 0xc04206e4c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042471da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91b120, 0xc04206e4c0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd565e8, 0x9d31f0, 0x15, 0xc042078480, 0xc) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd74bc8, 0x1f, 0xc042471a78) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:155 +0xaa main.initialiseFeatures(0xc0420c8000, 0xc04237a480) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:110 +0x49b main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:448 +0x30b main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015
Assignee | ||
Comment 7•5 years ago
|
||
Also i am going to go through and reimage the majority of the pool. These issues seem to be associated with the life span of the static instances. I am going to leave chassis 1 as is , so that further troubleshooting maybe performed.
Comment 8•5 years ago
|
||
OK. Do the files "file-caches.json" and "directory-caches.json" exist?
The error "invalid character '\x00' looking for beginning of value" makes it sound as though the files exist but the content is invalid JSON, NULL chars, or similar.
The error "Could not load file directory-caches.json into object *main.CacheMap - is it json?" similarly indicates that the files exist but have malformed content.
Do you have a wrapper script that might be creating these files?
Assignee | ||
Comment 9•5 years ago
•
|
||
I think those are created after the wrapper script launches generic-worker.
For reference the wrapper script can be found here https://github.com/markcor/ronin_puppet/blob/master/modules/win_generic_worker/templates/run-hw-generic-worker-and-reboot.bat.epp .
Assignee | ||
Comment 10•5 years ago
|
||
I have redeployed about 250 nodes and will redeploy and additional 90 tomorrow. Doing some spot checking things generally look good. I will do some additional checking in the morning. Hopefully this will prevent large backlogs for the near future.
There are 2 distinct issues.
There were several nodes that were generic-worker exited with 69. Possibly an issue, as mentioned above, with "file-caches.json" and "directory-caches.json". I will continue to makes notes on that issue in this bug.
The other issues is the 20 minutes between tasks on longed lived nodes. I will start tracking that in Bug 1574026.
Assignee | ||
Comment 11•5 years ago
|
||
I deployed a additional 30 nodes this afternoon. Newly deployed nodes seemed to be around 3 minutes between tasks. I will continue to deploy the other 60 in the up coming days.
As far as the generic-worker exit 69. Yesterday when I clicked on the papertrail link in the description it showed several nodes logging that exit. Today I am not seeing that exit code.
Assignee | ||
Comment 15•4 years ago
|
||
This issue was resolved with the automated restore function in Bug 1590791.
Description
•