Closed Bug 1459203 Opened 7 years ago Closed 4 years ago

Remove dependency on nssm

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 obsolete file)

generic-worker should be able to install itself as a native Windows service without needing nssm. This should reduce the number of issues we have due to nssm, and also simplify the deployment process.
Attached file Github Pull Request for generic-worker (obsolete) —
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Pete: what's the status here?
Flags: needinfo?(pmoore)
This is still on the TODO list I'm afraid - I really wish to implement this at some point.
Flags: needinfo?(pmoore)
Component: Generic-Worker → Workers

The install happens here.

It calls deployService which installs the executable using nssm.

The nssm configuration settings used here are described on the nssm website.

Note, it relies on the nssm executable being on the local filesystem, and uses it to execute the nssm commands to create the windows service.

In order not to use nssm, we should use the native packages of the go standard library for creating a Windows service.

The standard library also comes with an example implementation.

Any questions, let me know!

Keywords: good-first-bug
Comment on attachment 8973229 [details] [review] Github Pull Request for generic-worker I suspect this has somewhat bit-rotted, and was probably also not implemented correctly or completely, so it probably can be mostly ignored.
Comment on attachment 8973229 [details] [review] Github Pull Request for generic-worker Setting attachment as obsolete...
Attachment #8973229 - Attachment is obsolete: true

Note, here is an example of generic-worker being installed, including downloading nssm, and then running the generic-worker install command, specifying the location of the nssm binary to use.

When we stop using nssm, we'll need to update the workers to no longer download nssm, and to call the install target without specifying the nssm executable filepath location.

I started looking into this and have a few questions:

  1. The way that logging is handled currently, it looks like we log to a file generic-worker-service.log adjacent to the bat script / exe. NSSM comes with some niceties (log rotation) but the go example uses a default eventlog. We can keep logging to a file and still use eventlog, but I don't think we get rotation for free. Do we need the custom logfile? Should we just use the system eventlog?

  2. Do we need the bat script? [0] We create a bat script that essentially wraps the generic-worker.exe, passes arguments, and redirects output. My impression is that we can just include those arguments in the service creation instead, eliminating the need for the script. (though I'm not sure how input redirection works in this case)

  3. It is mentioned here [1] that the service is updated if already installed. My impression is that doing this in Go means removing and reinstalling the service - does that sound reasonable? Do we care if the service is already installed? (i.e. do we actually want to reinstall it?) Really what we want is for the generic-worker.exe to by up to date, right?

  4. There's a configFile variable passed to deployService. Best I can tell, it is unused. Any idea what it's there for? The worker itself obviously takes a config file, but generic-worker service install --config $filename doesn't seem to set that up (i.e. setting up the bash script to use the config file).

I have some WIP (untested because I don't have a working dev environment yet) code up here: https://github.com/taskcluster/generic-worker/compare/master...milescrabill:no-more-nssm-20190430#diff-04c6e90faac2675aa89e2176d2eec7d8

[0] https://github.com/taskcluster/generic-worker/blob/v14.1.0/plat_windows.go#L490-L520
[1] https://github.com/taskcluster/generic-worker/blob/v14.1.0/plat_windows.go#L548

PS: You answered my 5th question (should I clean up the worker_types) in flight! Thanks! :)

I got my dev environment figured out and was able to test on a Windows 10 VM. The service installs correctly, but fails to start because I don't have a proper config file. Progress!

I have some findings regarding my questions:

  1. The bat script serves to change the working directory of the service (effectively) to match the generic-worker executable. This makes logs and other things also be in that same location. That's useful.

  2. The naive approach to do this, deleting and recreating the service, doesn't really work - in practice I saw this error: The specified service has been marked for deletion. <- i.e. it takes time to delete a service, you can't just delete and recreate it in subsequent calls. It's probably fine to not delete the service if it already exists. Is there a case where we care about updating the service? I'd imagine the service itself doesn't need updating, but the generic-worker executable would (and could be).

  3. It might make sense to interpolate the configFile into the bat script. That's already effectively what happens with the default config from what I can tell. i.e. generic-worker install service --config $file makes the bat script have generic-worker run --config $file instead of default.

My patch is more tested now but still needs a little more work.

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #8)

I started looking into this and have a few questions:

  1. The way that logging is handled currently, it looks like we log to a file generic-worker-service.log adjacent to the bat script / exe. NSSM comes with some niceties (log rotation) but the go example uses a default eventlog. We can keep logging to a file and still use eventlog, but I don't think we get rotation for free. Do we need the custom logfile? Should we just use the system eventlog?

I wasn't aware that the current nssm integration is writing to the system event log, but funnily enough this has been requested in bug 1436274. If it is relatively straightforward to write to the event log, indeed we may wish to stop writing to the local log file, although I think we should make sure that when manually running the worker we can still write to standard out/standard error, e.g. so that you don't need to inspect the event log to find the cause of a unit test failure, etc. We would want to make sure that the generic-worker CI task logs contain the complete output of the worker, for example.

  1. Do we need the bat script? [0] We create a bat script that essentially wraps the generic-worker.exe, passes arguments, and redirects output. My impression is that we can just include those arguments in the service creation instead, eliminating the need for the script. (though I'm not sure how input redirection works in this case)

The bat script was created mostly for the purposes of customisation. In OpenCloudConfig, this version is swapped out with the version generated by the install command. If we were to remove it, we'd need to make sure that RelOps still had a way to customise this behaviour, or we should look at integrating their customised version, if it makes sense for it to be standard.

  1. It is mentioned here [1] that the service is updated if already installed. My impression is that doing this in Go means removing and reinstalling the service - does that sound reasonable? Do we care if the service is already installed? (i.e. do we actually want to reinstall it?) Really what we want is for the generic-worker.exe to by up to date, right?

This is a good question. I think on cloud instances we would never need to install more than once; if we are making system changes, we create a fresh instance and install from scratch. This isn't the case on all hardware workers - I believe they are typically not reimaged when they are upgraded. Therefore I think it probably does make sense to have an option to uninstall the worker.

In general, I think even if it is more work, providing the option to uninstall the worker would be a good thing, and we should ship that with the worker. Admittedly we don't provide a command line option for explicitly uninstalling the worker at the moment (the user would need to run something like nssm remove 'Generic Worker') but once we stop using nssm, I think we should probably include a generic-worker uninstall target or similar.

Note, I agree that probably the only thing that changes when we install a newer version of a worker is that the generic-worker binary changes (so just swapping out the generic-worker.exe binary might be enough) but we can't easily guarantee that we won't make changes to the install footprint, so having an explicit uninstall would be nice to capture this. It also allows people who wish to play with the worker locally on a Windows machine to uninstall it after they no longer wish to use the worker / have it installed on their system.

  1. There's a configFile variable passed to deployService. Best I can tell, it is unused. Any idea what it's there for? The worker itself obviously takes a config file, but generic-worker service install --config $filename doesn't seem to set that up (i.e. setting up the bash script to use the config file).

Whoops, indeed that is a bug! Indeed, it should set the config file to use in the run-generic-worker.bat script, but at the moment it isn't passing it through to the CreateRunGenericWorkerBatScript method.

I have some WIP (untested because I don't have a working dev environment yet) code up here: https://github.com/taskcluster/generic-worker/compare/master...milescrabill:no-more-nssm-20190430#diff-04c6e90faac2675aa89e2176d2eec7d8

Great, I'll take a look!

[0] https://github.com/taskcluster/generic-worker/blob/v14.1.0/plat_windows.go#L490-L520
[1] https://github.com/taskcluster/generic-worker/blob/v14.1.0/plat_windows.go#L548

PS: You answered my 5th question (should I clean up the worker_types) in flight! Thanks! :)

See Also: → 1178661
Assignee: pmoore → miles
Blocks: 1436274
Blocks: 1565390
No longer blocks: 1565390
Keywords: good-first-bug
QA Whiteboard: [lang=go]

Miles: how much work to finish this off?

Flags: needinfo?(miles)

Pete and I discussed this several months ago now, the scope of the changes required to remove NSSM (in particular, the question of how generic-worker would be run as a service) was large enough that this work was tabled. Correct me if I'm wrong Pete, but you had opted to keep NSSM to avoid some of the complexity incurred by doing things in Go, right?

Unassigning myself per above.

Assignee: miles → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(miles) → needinfo?(pmoore)

Yeah, let's leave things as they are, they aren't causing us too much pain at the moment, and if NSSM goes away and we need an alternative solution, I think we can tackle it then. Thanks for following up!

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(pmoore)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: