Closed Bug 1573864 Opened 5 years ago Closed 4 years ago

Only 1/3 of Windows 10 hardware machines (gecko-t-win10-64-hw) run jobs due to long puppet runs

Categories

(Infrastructure & Operations :: RelOps: Hardware, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: markco)

References

Details

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-win10-64-hw shows that only a third of the Windows hardware workers (gecko-t-win10-64-hw) are taking jobs.

<grenade> looks to me like the puppet run is taking excessively long (like more than 20 minutes) on each reboot. i followed links from here: https://papertrailapp.com/groups/13938481/events?q=program%3Auser32

Assignee: nobody → mcornmesser

This looks like the something is happening with the generic-worker config file. For example:

Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Detected windows platform#015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0xc042473930, 0x4acc89, 0xc042015630) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91fc80, 0xc0423b3a70) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042473da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91fc80, 0xc0423b3a70) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.ReadTasksResolvedFile(0x8f9360) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:377 +0xdb main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:435 +0x27e main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015
Aug 14 08:44:33 T-W1064-MS-167.mdc1.mozilla.com generic-worker: strconv.Atoi: parsing "\x00\x00\x00": invalid syntax#015
Aug 14 08:44:35 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: "Generic-worker panic! Issue with enviroment or worker bug. (exit code 69)" #015
Aug 14 08:44:35 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: "For more information on the exit codes run C:\generic-worker\generic-worker.exe --help or go to https://github.com/taskcluster/generic-worker" #015
Aug 14 08:44:37 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Exiting worker with exit code 69#015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker-wrapper: Generic worker exit with code 69 more than once; Rebooting to recover #015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker: The system cannot find the file specified. #015
Aug 14 08:44:43 T-W1064-MS-167.mdc1.mozilla.com generic-worker: #015
Aug 14 08:44:45 T-W1064-MS-167.mdc1.mozilla.com User32: The process C:\windows\system32\shutdown.exe (T-W1064-MS-167) has initiated the restart of computer T-W1064-MS-167 on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Generic worker has not recovered; Rebooting#015

I am taking a look into this today. I may end up reimaging some of the nodes.

I am seeing continous generic-worker exits of 69 on ms-167, but the config file looks OK.

Miles: do you have ideas?

{
"accessToken": "----------",
"certificate": "",
"livelogSecret": "",
"authBaseURL": "",
"availabilityZone": "",
"cachesDir": "C:\cache",
"checkForNewDeploymentEverySecs": 0,
"cleanUpTaskDirs": true,
"clientId": "project/releng/generic-worker/datacenter-gecko-t-win10-64-hw",
"deploymentId": "",
"disableReboots": true,
"downloadsDir": "C:\downloads",
"ed25519SigningKeyLocation": "C:\generic-worker\ed25519-private.key",
"idleTimeoutSecs": 7200,
"instanceId": "",
"instanceType": "",
"livelogExecutable": "C:\generic-worker\livelog.exe",
"livelogPUTPort": 60022,
"numberOfTasksToRun": 0,
"privateIP": "",
"provisionerBaseURL": "",
"provisionerId": "releng-hardware",
"publicIP": "10.49.40.108",
"purgeCacheBaseURL": "",
"queueBaseURL": "",
"region": "",
"requiredDiskSpaceMegabytes": 10240,
"rootURL": "https://taskcluster.net",
"runAfterUserCreation": "C:\generic-worker\task-user-init.cmd",
"runTasksAsCurrentUser": false,
"secretsBaseURL": "",
"sentryProject": "generic-worker",
"shutdownMachineOnIdle": false,
"shutdownMachineOnInternalError": false,
"taskclusterProxyExecutable": "C:\generic-worker\taskcluster-proxy.exe",
"taskclusterProxyPort": 80,
"tasksDir": "C:\Users",
"wstAudience": "taskcluster-net",
"wstServerURL": "https://websocktunnel.tasks.build",
"workerGroup": "mdc1",
"workerId": "T-W1064-MS-167",
"workerType": "gecko-t-win10-64-hw",
"workerTypeMetadata": {
"config": {
"deploymentId": "",
"runTasksAsCurrentUser": false
},
"generic-worker": {
"go-arch": "amd64",
"go-os": "windows",
"release": "https://github.com/taskcluster/generic-worker/releases/download/v14.1.0/generic-worker-nativeEngine-windows-amd64.exe",
"version": "14.1.0"
}
}
}

Flags: needinfo?(miles)

Hi Mark,

It looks like the source of the error is related to the resolved tasks file, which should be in the generic-worker directory under the name "tasks-resolved-count.txt". From the traceback it looks as though the content of that file is malformed, could you paste it here? I would try deleting that file and re-running generic-worker.

miles

Flags: needinfo?(miles)

The resolve task file was blank. I deleted and rebooted the node. The result:

Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc042470848, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91b120, 0xc0421e3de0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042469da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91b120, 0xc0421e3de0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd565e8, 0x9d31f0, 0x15, 0xc0421e6da0, 0xc) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd74bc8, 0x1f, 0xc042469a78) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:155 +0xaa main.initialiseFeatures(0xc0420c8000, 0xc0420745c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:110 +0x49b main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:448 +0x30b main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: invalid character '\x00' looking for beginning of value#015
Aug 14 10:13:23 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Exiting worker with exit code 69#015

Let's also try removing "file-caches.json" and "directory-caches.json".

Removed that file and the result:

Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Creating directory C:\cache with permissions 700#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: Could not load file directory-caches.json into object *main.CacheMap - is it json?#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: goroutine 1 [running]: runtime/debug.Stack(0x0, 0xc0424b67f8, 0x0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/debug/stack.go:24 +0xae main.HandleCrash(0x91b120, 0xc04206e4c0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:393 +0x2d main.RunWorker.func1(0xc042471da0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:412 +0x59 panic(0x91b120, 0xc04206e4c0) #011/home/travis/.gimme/versions/go1.10.8.src/src/runtime/panic.go:502 +0x237 main.(*CacheMap).LoadFromFile(0xd565e8, 0x9d31f0, 0x15, 0xc042078480, 0xc) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:146 +0x4a5 main.(*MountsFeature).Initialise(0xd74bc8, 0x1f, 0xc042471a78) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/mounts.go:155 +0xaa main.initialiseFeatures(0xc0420c8000, 0xc04237a480) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:110 +0x49b main.RunWorker(0x0) #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:448 +0x30b main.main() #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:186 +0xa26#015
Aug 14 10:35:12 T-W1064-MS-167.mdc1.mozilla.com generic-worker: *********** PANIC occurred! *********** #015

Also i am going to go through and reimage the majority of the pool. These issues seem to be associated with the life span of the static instances. I am going to leave chassis 1 as is , so that further troubleshooting maybe performed.

OK. Do the files "file-caches.json" and "directory-caches.json" exist?

The error "invalid character '\x00' looking for beginning of value" makes it sound as though the files exist but the content is invalid JSON, NULL chars, or similar.

The error "Could not load file directory-caches.json into object *main.CacheMap - is it json?" similarly indicates that the files exist but have malformed content.

Do you have a wrapper script that might be creating these files?

I think those are created after the wrapper script launches generic-worker.

For reference the wrapper script can be found here https://github.com/markcor/ronin_puppet/blob/master/modules/win_generic_worker/templates/run-hw-generic-worker-and-reboot.bat.epp .

Depends on: 1574026

I have redeployed about 250 nodes and will redeploy and additional 90 tomorrow. Doing some spot checking things generally look good. I will do some additional checking in the morning. Hopefully this will prevent large backlogs for the near future.

There are 2 distinct issues.

There were several nodes that were generic-worker exited with 69. Possibly an issue, as mentioned above, with "file-caches.json" and "directory-caches.json". I will continue to makes notes on that issue in this bug.

The other issues is the 20 minutes between tasks on longed lived nodes. I will start tracking that in Bug 1574026.

I deployed a additional 30 nodes this afternoon. Newly deployed nodes seemed to be around 3 minutes between tasks. I will continue to deploy the other 60 in the up coming days.

As far as the generic-worker exit 69. Yesterday when I clicked on the papertrail link in the description it showed several nodes logging that exit. Today I am not seeing that exit code.

Depends on: 1590791

This issue was resolved with the automated restore function in Bug 1590791.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.