Closed Bug 1329944 Opened 7 years ago Closed 7 years ago

[Puppet] Make pushapkworker use scriptworker module

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: jlorenzo)

References

Details

Attachments

(1 file)

This will introduce consistency with other scriptworker instances (like signing worker). This is also needed to complete bug 1321513.

In order to make that possible, pushapkscript had to be upgraded to support scriptworker 1.0.0b4. This has been done at: https://github.com/mozilla-releng/pushapkscript/pull/5
Depends on: 1330276
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

Following up yesterday's discussion, here's a proposal for the scriptworker puppet module:

* More default values are exposed. For instance, by default the expected script is a python one with the location to script_config.json passed.
* The module takes care of the regular
  * nagios config + pending tasks and file age checks
  * supervisord config
* A warning is displayed when the worker_id is too long (usually happens with personal loaners)
* Chain of trust instructions are moved to another file. scriptworker::instance calls it if one of ($sign_chain_of_trust, $verify_chain_of_trust, $verify_cot_signature) is true.

I tested it against pushapkworker, which works in staging. I'll wait until balrogworker and beetmover have COT enabled (bug 1329944) before going further. In the meantime, I'd like your point of view about the module.
Attachment #8825813 - Flags: feedback?(mtabara)
Attachment #8825813 - Flags: feedback?(aki)
(responding to your comment here; I'll also take a look at the patch.)

(In reply to Johan Lorenzo [:jlorenzo] from comment #2)
> Comment on attachment 8825813 [details]
> Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module
>
> Following up yesterday's discussion, here's a proposal for the scriptworker
> puppet module:
>
> * More default values are exposed. For instance, by default the expected
> script is a python one with the location to script_config.json passed.

Meaning you'd set defaults for $task_script_executable and $task_script_config?  That makes sense.
We should use the python in the venv by default ($basedir/bin/python).

> * The module takes care of the regular
>   * nagios config + pending tasks and file age checks

I think it does this already, though Simon may have added something to signing_scriptworker that should probably be handled in the shared module.

>   * supervisord config

The reason I kept this in signing_scriptworker is because there are instance-specific files we need to watch for, e.g. passwords.json for signing scriptworker.  We shouldn't start scriptworker without that file being in place.  (We don't need to restart it if it changes, though.)

If we're able to handle that type of thing in the shared module, then agreed, supervisord belongs there.  I wasn't able to pass in lists to instance.pp, but it's quite likely I was doing something wrong.

> * A warning is displayed when the worker_id is too long (usually happens
> with personal loaners)

I think scriptworker will barf on run, but it would be good to have an earlier error.
It might be good to be able to override @hostname with a different worker_id for long-hostname loaners.

> * Chain of trust instructions are moved to another file.
> scriptworker::instance calls it if one of ($sign_chain_of_trust,
> $verify_chain_of_trust, $verify_cot_signature) is true.

Ok.

> I tested it against pushapkworker, which works in staging. I'll wait until
> balrogworker and beetmover have COT enabled (bug 1329944) before going
> further. In the meantime, I'd like your point of view about the module.

I've got a patch feedback draft started.
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review104640

Hm, it looks like reviewboard doesn't allow for f+, just r+/r-/r?
I'll f+ manually after posting this.

Overall I think this is good.  I've commented where I have feedback.  The other point I have is this will change behavior behind signing/balrog/beetmover, so if this lands before they're ready, those may break.  We should land in a coordinated manner.  (You're probably aware; I'm just stating this to be explicit.)

::: manifests/moco-config.pp:454
(Diff revision 1)
>          }
>      }
>  
>      ## TC pushapk scriptworkers
> -    $pushapk_scriptworker_old_root = '/builds/pushapkworker' # TODO Remove this line once bug 1321513 reaches production
>      $pushapk_scriptworker_root = $scriptworker_root

We've been moving our scriptworker config into `settings.pp` files. I'm on the fence about which is better, one large config file or multiple discrete config files.

::: modules/pushapkworker/manifests/init.pp
(Diff revision 1)
> -    include ::config
> -    include pushapkworker::services
>      include pushapkworker::settings
>      include dirs::builds
>      include packages::mozilla::python35
> -    include users::builder

Aiui, the users::builder include means we create the cltbld account.  Do you still have a way to create that (or other) account?

::: modules/pushapkworker/manifests/init.pp:48
(Diff revision 1)
>                  'python-dateutil==2.5.3',
>                  'python-gnupg==0.3.9',
> -                'requests==2.11.1',
> +                'PyYAML==3.12',
> +                'requests==2.12.4',
>                  'rsa==3.4.2',
> -                'scriptworker==0.7.2',
> +                'scriptworker==1.0.0b4',

beta 5 is out!

::: modules/pushapkworker/templates/script_config.json.erb:22
(Diff revision 1)
>              "certificate": "<%= @google_play_config['release']['certificate_target_location'] %>"
>          }
>      },
>  
> -    "jarsigner_key_store": "<%= scope.lookupvar('config::pushapk_scriptworker_jarsigner_keystore') %>",
> +    "jarsigner_key_store": "<%= scope['pushapkworker::settings::jarsigner_keystore'] %>",
>      "jarsigner_certificate_aliases": {

I understand this is the keystore alias, and you're mapping those to aurora/beta/release.

I'm not sure if you already have this: we should hardcode aurora/beta/release to scopes that must be set in the task definition, so we can't use the wrong one on the wrong tree.  That hardcode would belong in the pushapk script repo.

::: modules/scriptworker/manifests/instance.pp:43
(Diff revision 1)
> -    # some constants
> +    # These constants need to be filled in $script_worker_config, even though Chain of Trust is not enabled.
>      $git_key_repo_dir = "${basedir}/gpg_key_repo/"
>      $git_pubkey_dir = "${basedir}/git_pubkeys/"
>  
> -    # This git repo has the various worker pubkeys
> -    git::repo {
> +    # Sanitize worker_id, so it doesn't exceed TC's 22-character-limit
> +    $worker_id = regsubst($hostname, '^(.{22}).+', '\1')

This is one way, but may end up with duplicate `worker_id`s if the first 22 chars of a hostname matches another.

Instead of this automatic trim, I'd rather have `worker_id` default to `$hostname`, fail out if `worker_id` > 22 chars, and allow for explicitly overriding the `worker_id`.

::: modules/scriptworker/manifests/supervisord.pp:12
(Diff revision 1)
> +) {
> +  supervisord::supervise {
> +      $instance_name:
> +          command      => "${basedir}/bin/scriptworker ${script_worker_config}",
> +          user         => $username,
> +          require      => [ File[$task_script_config]],

As mentioned in the bz comment, we may need to require additional files. This would make sense to pass into instance.pp as a list, but I haven't been successful in passing lists in (otherwise the entire task script command would have been a single list, rather than executable + script + config)
Attachment #8825813 - Flags: feedback?(aki) → feedback+
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review105702

::: manifests/moco-config.pp:455
(Diff revision 1)
> -    $pushapk_scriptworker_taskcluster_artifact_expiration_hours = 336
> -    $pushapk_scriptworker_taskcluster_artifact_upload_timeout = 1200
> -    $pushapk_scriptworker_task_max_timeout = 1200
> -    $pushapk_scriptworker_artifact_expiration_hours = 336
> -    $pushapk_scriptworker_artifact_upload_timeout = 600
>      $pushapk_scriptworker_env_config = {

You can move the env_config dict too in settings should you want to, to avoid the use of "_env_config"

::: modules/pushapkworker/manifests/init.pp:48
(Diff revision 1)
>                  'python-dateutil==2.5.3',
>                  'python-gnupg==0.3.9',
> -                'requests==2.11.1',
> +                'PyYAML==3.12',
> +                'requests==2.12.4',
>                  'rsa==3.4.2',
> -                'scriptworker==0.7.2',
> +                'scriptworker==1.0.0b4',

beta 6 is out! ;)

::: modules/pushapkworker/manifests/init.pp:76
(Diff revision 1)
> +            taskcluster_access_token => $pushapkworker::settings::taskcluster_access_token,
> +            worker_group             => $pushapkworker::settings::worker_group,
> +            worker_type              => $pushapkworker::settings::worker_type,
> +
> +            # TODO Enable one of the next 3 lines to turn on Chain of Trust (bug 1317783)
> +            sign_chain_of_trust      => false,

AFAIK, these three lines need to be all toggled to True but I could be wrong.

::: modules/pushapkworker/manifests/init.pp:79
(Diff revision 1)
> +
> +            # TODO Enable one of the next 3 lines to turn on Chain of Trust (bug 1317783)
> +            sign_chain_of_trust      => false,
> +            verify_chain_of_trust    => false,
> +            verify_cot_signature     => false,
> +            cot_job_type             => 'cot_unused',

Not sure all the details under scriptworker's hood but might need to add a new cot job type in scriptworker to support pushapk too.

::: modules/pushapkworker/manifests/settings.pp:10
(Diff revision 1)
> +    $_env_config = $config::pushapk_scriptworker_env_config[$pushapkworker_env]
>  
>      $root = $config::pushapk_scriptworker_root
> +    $schema_file = "${root}/lib/python3.5/site-packages/pushapkscript/data/pushapk_task_schema.json"
> +    $work_dir = "${root}/work"
> +    $script_config = "${root}/script_config.json"

Since you're using defaults for task-script-executable and script-config in the init.pp/instance declaration, I don't think this variable is being used anywhere, anymore.

::: modules/pushapkworker/manifests/settings.pp:16
(Diff revision 1)
> +    $task_script = "${root}/bin/pushapkscript"
> +
> +    $user = $users::builder::username
> +    $group = $users::builder::group
> +
> +    $taskcluster_client_id = $_env_config['taskcluster_client_id']

As said above, might be worth bringing the _env_config_ dict content from moco-config.pp in here.
Attachment #8825813 - Flags: feedback?(mtabara) → feedback+
Thank you guys for your feedback! I'll address the points your raised once bug 1330276 is in tree. That'll prevent to resolve conflicts more than once.
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review105702

> AFAIK, these three lines need to be all toggled to True but I could be wrong.

Not sure of the details yet, as well. Aki told me all 3 needed to be turned off to take out scriptworker. Do you confirm, Aki?

> Not sure all the details under scriptworker's hood but might need to add a new cot job type in scriptworker to support pushapk too.

Done by Aki in https://github.com/mozilla-releng/scriptworker/pull/60

> Since you're using defaults for task-script-executable and script-config in the init.pp/instance declaration, I don't think this variable is being used anywhere, anymore.

It's actually used by pushapkworker/manifests/init.pp. You need to define where the template should be copied to.
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review104640

> We've been moving our scriptworker config into `settings.pp` files. I'm on the fence about which is better, one large config file or multiple discrete config files.

Done.

> Aiui, the users::builder include means we create the cltbld account.  Do you still have a way to create that (or other) account?

users::builder is included in settings.pp. That's what triggers the creation of account.

> beta 5 is out!

Upgraded to beta7

> I understand this is the keystore alias, and you're mapping those to aurora/beta/release.
> 
> I'm not sure if you already have this: we should hardcode aurora/beta/release to scopes that must be set in the task definition, so we can't use the wrong one on the wrong tree.  That hardcode would belong in the pushapk script repo.

Good idea! I filed https://github.com/mozilla-releng/pushapkscript/issues/7 to track this.

> This is one way, but may end up with duplicate `worker_id`s if the first 22 chars of a hostname matches another.
> 
> Instead of this automatic trim, I'd rather have `worker_id` default to `$hostname`, fail out if `worker_id` > 22 chars, and allow for explicitly overriding the `worker_id`.

In the latest revisions, I exposed worker_id in the instance's args. The default value remains $hostname. In my opinion, having to manually force worker_id in your module is quite a pain when you're working with loaners.

That's why I kept the strip (now done at the beginning of the string) if worker_id points to the default value. If manually forced to something else, longer names will throw exceptions.

I also added checks on the other identifiers (worker_type and worker_group)

Do you think that's a good compromise?

> As mentioned in the bz comment, we may need to require additional files. This would make sense to pass into instance.pp as a list, but I haven't been successful in passing lists in (otherwise the entire task script command would have been a single list, rather than executable + script + config)

I found a way to pass arrays. It's not particularly beautiful, but it worked with the examples I tried. This makes me wonder if we shouldn't also set up Python35::VirtualEnv in instance.pp thanks to lists. If you agree, I'll handle that in a follow up.
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

Here's a new revision that includes beetmover and balrog. I tested this code against my personal loaner. Here's what I tested:
* pushapkscript
  1. rm -rf /build/scriptworker
  2. puppet agent --test
  3. I ran this task https://tools.taskcluster.net/task-inspector/#EZ-ixw3YQX2j9QBp-W00kw/ . It failed (as excepted) in one of the latest step of pushapk. Google Play refuses to reupload APKs that were already published, even if you don't commit changes.

* balrog and beetmover
  1. rm -rf /build/scriptworker
  2. Turned on COT:
    a. Generated a gpg key for my loaner
    b. Signed it with my dummy key for scriptworker
    c. Pushed the signed public key to https://github.com/JohanLorenzo/cot-gpg-keys/tree/puppet-test-1 . This must be done on the master branch.
    d. Dropped the public and private keys in jlorenzo_secrets.eyaml
    e. Manually edited modules/scriptworker/manifests/instance/settings.pp to point to my fork
  3. puppet agent --test
  4. Looked at logs and checked the process is still running

I didn't go further by creating a dummy balrog/beetmover task. The main reason is: At the moment, workers are not under a different group than the production ones. I preferred not going too far.
Also, after thinking more closely about it, I believe the risks caused by my changes are mitigated:
=> I didn't upgrade python packages (which may cause runtime errors)
=> I didn't change any bits of configs, except cot-related. Cot errors have been detected during `puppet apply`, when rebuild_gpg_homedirs is called. 

I'm probably missing some bits, if you see some errors that might be caught by tasks being run, please tell me.

Appart from that, I thought of a way to tests scriptworker changes. How about:
* Some docker containers upon which we apply a dev config.
* A test runner that creates some tasks (targeted to the docker containers)
* Once the tasks are done, the test runner will check if the task passed (or, in the case of pushapk, failed with the correct error message)? 

This is likely a bug task. If you guys agree, I'll open a bug with the proposal.
Attachment #8825813 - Flags: review?(mtabara)
Attachment #8825813 - Flags: review?(aki)
(In reply to Johan Lorenzo [:jlorenzo] from comment #12)
> Comment on attachment 8825813 [details]
> Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module
> 
> https://reviewboard.mozilla.org/r/103514/#review105702
> 
> > AFAIK, these three lines need to be all toggled to True but I could be wrong.
> 
> Not sure of the details yet, as well. Aki told me all 3 needed to be turned
> off to take out scriptworker. Do you confirm, Aki?

if we're signing, we need the gnupg homedirs.  If we're verifying signatures we need gnupg homedirs as well.  We might be able to get away with cot verification without them, if the other two are off, but without signature verification it's fairly meaningless.  So I think this is good, and they should all be on as a rule (pushapk is currently the exception).
(In reply to Johan Lorenzo [:jlorenzo] from comment #14)
> I thought of a way to tests scriptworker changes. How
> about:
> * Some docker containers upon which we apply a dev config.
> * A test runner that creates some tasks (targeted to the docker containers)
> * Once the tasks are done, the test runner will check if the task passed
> (or, in the case of pushapk, failed with the correct error message)? 
> 
> This is likely a bug task. If you guys agree, I'll open a bug with the
> proposal.

This gets even bigger if we want to test chain of trust.

* generate gpg keypairs for the docker images
* sign the public portions, and create a staging repo like https://github.com/mozilla-releng/cot-gpg-keys

I'm not sure if there's a good way around actually having a build happen on a set of docker-workers.  We could create a mozilla-central fork where we replace |mach build| with a dummy artifact creator.  This does sound like a lot of work, as well as a lot of ongoing maintenance.  If we can figure out a way that doesn't compromise security, being able to test may be a big enough win to attempt this.
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review107294

::: modules/pushapkworker/manifests/init.pp
(Diff revision 6)
> -    include ::config
> -    include pushapkworker::services
>      include pushapkworker::settings
>      include dirs::builds
>      include packages::mozilla::python35
> -    include users::builder

I'm guessing we'll readd this when we apply a similar fix as bug 1298105 to the pushapk workers, but I'm assuming the cltbld user already exists.

::: modules/scriptworker/manifests/instance.pp:50
(Diff revision 6)
> -    # This git repo has the various worker pubkeys
> -    git::repo {
> -        "scriptworker-${git_key_repo_dir}":
> -            repo    => "${scriptworker::instance::settings::git_key_repo_url}",
> -            dst_dir => $git_key_repo_dir,
> -            user    => "${username}",
> +    validate_taskcluster_identifier($worker_group)
> +    validate_taskcluster_identifier($worker_type)
> +    # Hostname may be longer than 22 characters. Getting an error is painful especially in dev environments.
> +    # That's why we strip worker_id if the default value (aka hostname) is used.
> +    if $worker_id == $hostname {
> +      $sanitized_worker_id = regsubst($hostname, '^.*(.{22})$', '\1')

Carryover comment:

Instead of this automatic trim, I'd rather have `worker_id` default to `$hostname`, fail out if `worker_id` > 22 chars, and allow for explicitly overriding the `worker_id`.

::: modules/scriptworker/manifests/instance.pp:66
(Diff revision 6)
>  
> -    nrpe::custom {
> -        "scriptworker.cfg":
> -            content => template("scriptworker/nagios.cfg.erb");
> +    # XXX Workaround to have arrays as default values
> +    if $restart_process_when_changed == undef {
> +      $_restart_process_when_changed = [Python35::Virtualenv[$basedir], File[$task_script_config]]
> +    } else {
> +      $_restart_process_when_changed = $restart_process_when_changed

Have you successfully passed a list to `instance.pp`?  I haven't been able to yet.

::: modules/scriptworker/manifests/nagios.pp:7
(Diff revision 6)
> +  $basedir,
> +) {
> +
> +  nrpe::custom {
> +      "scriptworker.cfg":
> +          content => template("scriptworker/nagios.cfg.erb");

It's not clear to me if this file is even used (see bug 1314840).  It's probably safer to keep it until we verify we don't need it, though.
Attachment #8825813 - Flags: review?(aki)
Hm, that posted the review and cleared the flag without my intending it to.
I wanted to also add, we may need signing scriptworker patches before this can land, yes?
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

Oops, thank you mozreview!
Attachment #8825813 - Flags: review?(mtabara)
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review107294

> I'm guessing we'll readd this when we apply a similar fix as bug 1298105 to the pushapk workers, but I'm assuming the cltbld user already exists.

Agreed via IRC: cltbld is added via pushapkworker/settings.pp

> Carryover comment:
> 
> Instead of this automatic trim, I'd rather have `worker_id` default to `$hostname`, fail out if `worker_id` > 22 chars, and allow for explicitly overriding the `worker_id`.

Agreed via IRC: changing catalogs each time you forget to force a worker ID is a pain. Here's a screenshot of how the warning is showed up when you run `puppet agent` ![screenshot](/media/uploaded/files/2017/01/23/9b51e84b-6b60-403a-b5b1-ae6f16433645__Screenshot_from_2017-01-23_18-50-01.png)

> Have you successfully passed a list to `instance.pp`?  I haven't been able to yet.

Yes, see IRC logs:

18:35:34 
<jlorenzo> the tricky part is when you pass variables like Python::Virtualenv 
18:36:04 
<jlorenzo> if that var is a default value of a "define", it gets interpreted a string 
18:36:33 
<jlorenzo> that's why i pass an "undef" that check before assigning another variable 
18:37:27 
<jlorenzo> see line 62 of modules/scriptworker/manifests/instance.pp

> It's not clear to me if this file is even used (see bug 1314840).  It's probably safer to keep it until we verify we don't need it, though.

Sound like a good follow up!
Comment on attachment 8825813 [details]
Bug 1329944 - [Puppet] Make pushapkworker use scriptworker module

https://reviewboard.mozilla.org/r/103514/#review107602

lgtm.  Can you make sure to warn me, :jlund, and :mtabara when you're ready to land? We can keep an eye on the workers and retrigger jobs to make sure they still work afterwards.
Attachment #8825813 - Flags: review?(aki) → review+
Attachment #8825813 - Flags: review?(mtabara) → review+
Rolled out.  Thank you :jlorenzo!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1334425
No longer depends on: 1330276
See Also: → 1330276
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: