Run generic-worker.exe as a start up item rather than a Windows service

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
10 days ago

People

(Reporter: pmoore, Unassigned)

Tracking

Details

Attachments

(1 attachment)

For an as-yet unknown reason, running the generic worker as a windows service seems to on occasion (not for all tasks) introduce a performance penalty.

When running the worker from an interactive shell, this problem does not appear.

We currently do not know the reason why, but as an interim (or even permanent) solution, we should move the worker to run as a start up item, rather than as a Windows service.

This task command took 15m10s, where the worker ran from a Windows Service:
https://public-artifacts.taskcluster.net/NA3oCERsS0yto7KCsXN3ag/0/public/logs/command_000013.log

The same task command took 1m15s, where the worker ran from an interactive shell:
https://public-artifacts.taskcluster.net/ermG_PBKT62Dc9qg9gCZ6g/0/public/logs/command_000013.log

This is the same task command, running against a machine based on the same AWS EC2 AMI. Running as a Windows service was > 12 times slower.

This is the code that installs the generic worker as a Windows service:

https://github.com/taskcluster/generic-worker/blob/e60bca12ad095e275e29b63fe3fb9a384f8fbd44/plat_windows.go#L379-L415

This is therefore the code that needs to be rewritten to install as a Start Up item. Please also note, nssm will then no longer be a dependency on the worker types, so references to nssm in worker type userdata files under https://github.com/taskcluster/generic-worker/tree/e60bca12ad095e275e29b63fe3fb9a384f8fbd44/worker_types can also be removed.
Note, for this mach build from last week, we did not experience any slowdown, so it is not clear yet why running a desktop build from mozharness, that calls mach, goes slowly, yet calling mach directly in a task, when the generic worker runs as a windows service, goes quickly.

https://tools.taskcluster.net/task-inspector/#Cia4h2g4QUmU89LK-IYLjg/0
I'm thinking the easiest solution is probably something like creating this file:

C:\generic-worker\generic-worker.bat

And the contents of the file would be:

C:\generic-worker\generic-worker.exe run > C:\generic-worker\generic-worker.log 2>&1

Then we could add a shortcut to C:\generic-worker\generic-worker.bat in the GenericWorker user account's startup folder. Then we'd need to get the machine to auto-login to the GenericWorker user account on startup...

:grenade, how do we get the machine to auto-login with the cltbld user account in buildbot, and could we use the same technique for doing this with the GenericWorker account on the TC machines?
Flags: needinfo?(rthijssen)
(In reply to Pete Moore [:pmoore][:pete] from comment #0)
> This is the code that installs the generic worker as a Windows service:
> 
> https://github.com/taskcluster/generic-worker/blob/
> e60bca12ad095e275e29b63fe3fb9a384f8fbd44/plat_windows.go#L379-L415
> 
> This is therefore the code that needs to be rewritten to install as a Start
> Up item.

Note, it is not mandatory that the installation is done by the go code itself. In the interests of rolling out quickly, it can be installed as part of the AMI setup steps, and the existing go code can be left as it is. After all, this is just a convenience function.
Looking at those two logs, the first discrepancy I saw was this simple echo command:
From the first log, as a service: 
```
09:01:38     INFO -  (  echo 'export MOZ_AUTOMATION_BUILD_SYMBOLS=1'; <...> ) > C:/Users/Task_1461744498/build/src/obj-firefox/.mozconfig.mk
09:03:44     INFO -  C:/Users/Task_1461744498/build/src/mozmake.EXE -f C:/Users/Task_1461744498/build/src/client.mk realbuild CREATE_MOZCONFIG_JSON=
```

From the second log, from a shell:
```
10:20:58     INFO -  (  echo 'export MOZ_AUTOMATION_BUILD_SYMBOLS=1'; <...> ) > C:/Users/Task_1461748270/build/src/obj-firefox/.mozconfig.mk
10:21:02     INFO -  C:/Users/Task_1461748270/build/src/mozmake.EXE -f C:/Users/Task_1461748270/build/src/client.mk realbuild CREATE_MOZCONFIG_JSON=
```

It looks like that echo command took 2 minutes(!) in the service case, and only 4 seconds in the shell case.
Note, generic worker requires admin privileges to create task users, so running from Startup folder is not an option (as startup items won't run with administrator privileges, even if the user is in the Administrators group), due to the integrity level of a startup item:
https://msdn.microsoft.com/en-us/library/bb625963.aspx

However, there is a solution, which is to use the task scheduler, e.g. triggered by an auto-login:
http://stackoverflow.com/questions/5427673/how-to-run-a-program-automatically-as-admin-on-windows-startup
I saw mention of UAC disabling in the thread from comment 6.  Do we have UAC disabled on this AMI?
No need Joel, we can run with elevated privileges.
Created attachment 8746618 [details]
Run Generic Worker.xml

Something like this will do the trick.

I created this, it works, and I exported it. We can tweak the contents not to include explicit machine name, etc.

It can be installed with something like:

schtasks /create /tn "start generic worker on login" /xml "Run Generic Worker.xml"
I didn't think we were using powershell to run the build- just to confirm this is for setup only, possibly to run the worker, but not the build.
We're not using powershell to run the build - we're using it to set up the machine, and to configure it to login-on-boot. :)
Note, I ran the powershell code to setup auto-login-on-boot, and I ran the code to install the scheduled task to auto-run-generic-worker-on-login, and rebooted the machine, and .... *tada* it worked.

So now this just needs to be added to userdata so this all gets setup during AMI creation, and we should be good to go.

I'll work on this in the morning. I'll do it in the generic-worker install target after all, as it relies on generating random passwords, creating an account, and embedding xml, and this will be easier for me to do in go rather than in powershell, and also makes it straightforward to trivially install the worker from the binary, without needing to write a lot of supporting powershell (i.e as a self-contained executable, like it is now for installing as a service).
The world hates Windows, it seems...

Administrator@WIN-I5S5J090NG0:/cygdrive/c/gopath/src/github.com/taskcluster/pete $ cat main.go 
package main

import (
	"fmt"
	"os"
	"path"
	"runtime"
)

func main() {
	fmt.Println("OS:          " + runtime.GOOS)
	fmt.Println("Me:          " + os.Args[0])
	fmt.Println("My dir:      " + path.Dir(os.Args[0]))

	// pretend we're unix
	unixFile := "/usr/bin/gopath/bin/pete"
	fmt.Println("UNIX file    " + unixFile)
	fmt.Println("Its dir:     " + path.Dir(unixFile))
}

Administrator@WIN-I5S5J090NG0:/cygdrive/c/gopath/src/github.com/taskcluster/pete $ pete
OS:          windows
Me:          C:\gopath\bin\pete.exe
My dir:      .
UNIX file    /usr/bin/gopath/bin/pete
Its dir:     /usr/bin/gopath/bin



The problem here is that path.Dir assumes you have a unix path, even when running on Windows. So it seems you have to use path.filepath.Dir instead.... bah

Hopefully this time it will work....

https://github.com/taskcluster/generic-worker/compare/v2.0.0alpha25...v2.0.0alpha30
That seems to have done it!

My task is now running automatically from a provisioner-spawned instance, as a startup item for the GenericWorker user, which has auto-logged in. I'm able to RDP in as that user and see the cmd.exe window where the generic worker is running.

This also means that the task isolation is now working properly as we are running in user space. The worker is spawning new task users for each task with limited privileges, so they should not be able to dirty the worker. When we were running as a windows service we had to disable this feature and run all tasks as System. So we've killed two birds with one stone, so to speak.
So performance problems have gone away. We seem to have hit a different issue now - but this particular task was not the latest working version, so I did not expect it to complete successfully anyway - this is one of my AMIs rather than the target ones Rob has been working on, just for testing purposes.

The task execution however was quick, compared to the slow version before from comment 0, so this bug can be closed.

https://tools.taskcluster.net/task-inspector/#BEZui4EyQIWx1tOlPMCvuQ/

The only changes needed for the AMI creation are:

1) No need to install nssm (we also don't need pstools, ftr)
2) Run `generic-worker.exe install startup` instead of the previous `generic-worker.exe install` that we had before.

The old `install` target has now been replaced with `install service` (the legacy installation method for installing as a windows service) and `install startup` (the new installation mechanism to install as a startup item on login, and to auto-login).

Resolved in v2.0.0alpha30: https://github.com/taskcluster/generic-worker/releases/tag/v2.0.0alpha30
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → FIXED
(In reply to Pete Moore [:pmoore][:pete] from comment #18)
> We seem to have hit a different issue now

OK - I see the (new) bug!

Basically, the process is inheriting properties from the generic worker (e.g. USERPROFILE) but is actually running as a different user, so shouldn't inherit this stuff. Instead we should probably call e.g.

https://msdn.microsoft.com/en-us/library/windows/desktop/bb762281%28v=vs.85%29.aspx

I'll have to fix this on Monday. I'll have to fix this on Monday....
Component: Generic-Worker → Workers
Product: Taskcluster → Taskcluster
You need to log in before you can comment on or make changes to this bug.